Advanced Topics in Machine Learning for Bioinformatics and Biomedical Engineering
April 20, 2026
Most biologically meaningful information is carried in ordered sequences, not in flat feature vectors.
Learning goals
Biological sequences are everywhere:
ATCGATCGTA…MKTAYIAKQR…Biological function is encoded in the order of discrete or continuous measurements taken over time or position.
Figure 2: Four sequential modalities encountered in bioinformatics. Each row represents one data type; the horizontal axis is always position or time.
A DNA or RNA sequence is a string over a small finite alphabet:
Each position \(t\) carries a symbol \(x_t \in \mathcal{V}\), typically encoded as a one-hot vector \(x_t \in \{0,1\}^{|\mathcal{V}|}\).
Key properties that require order-awareness:
A protein is a sequence of amino acids over an alphabet of 20 standard residues, \(x_t \in \mathcal{V}_{\text{aa}}\) with \(|\mathcal{V}_{\text{aa}}| = 20\).
Downstream tasks include contact map prediction, variant effect scoring, and subcellular localisation.
Not all sequential data in bioinformatics is discrete or genomic.
Common challenges: irregular sampling, missing values, very long \(T\), and patient-level variability in length.
Three properties jointly distinguish sequence problems from standard supervised learning:
These three properties are not independent: handling variable length and long range simultaneously is the central challenge that motivates sequence-specific architectures.
Recall the standard MLP: a composition of affine maps and nonlinearities.
For a single input \(\mathbf{x} \in \mathbb{R}^d\) and layer \(\ell\) out of \(L\) layers:
\[ \mathbf{h}^{(0)} = \mathbf{x}, \qquad \mathbf{h}^{(\ell)} = \phi\!\left(W^{(\ell)} \mathbf{h}^{(\ell-1)} + \mathbf{b}^{(\ell)}\right), \qquad \hat{y} = W^{(L)} \mathbf{h}^{(L-1)} + \mathbf{b}^{(L)} \]
The MLP is a universal approximator (Hornik et al. 1989) and the workhorse of most non-sequential tasks.
Where MLPs work well in bioinformatics:
An MLP with input layer of width \(d\) requires every input to have exactly \(d\) features. There is no mechanism to handle variable-length sequences natively.
Figure 3: Left: an MLP requires exactly d input features. Right: biological sequences come in variable lengths — three sequences of lengths 3, 8 and 4 cannot share the same fixed-width input layer.
The two engineering workarounds both lose information:
Figure 4: MLP fixed-size input vs variable-length sequences.
Even if we fix the length problem – e.g., by applying an MLP independently at each position)– the MLP has no way to pass information from one position to the next.
Figure 5: Left: independent MLPs applied at each position share no information (\(\times\)). Right: recurrent connections allow context from \(h_{t-1}\) to influence \(h_t\) (purple arrows).
For example, an MLP applied position-by-position cannot:

A model suited to sequential biological data needs at least four properties:
| Property | Requirement |
|---|---|
| Variable-length input | Same parameters for any \(T\) |
| Position awareness | Representations respect order |
| Context propagation | Information at \(t\) influences representations at \(t' > t\) |
| Parameter efficiency | Parameters do not grow with \(T\) |
Models we will study in this course, in order of increasing capacity:
Before recurrent networks, the standard approach was a sliding window: extract a fixed-width context of \(k\) positions around each target position and apply an MLP to that window.
Figure 6: Left: a sliding window of size k=3 extracts a local context and applies an MLP. Right: a recurrent model accumulates all past context in h_t without a fixed horizon.
For a window of size \(k\) centred at position \(t\), the input to the MLP is:
\[ \mathbf{v}_t = \bigl[x_{t - \lfloor k/2 \rfloor}, \;\dots,\; x_t, \;\dots,\; x_{t + \lfloor k/2 \rfloor}\bigr] \in \mathbb{R}^{k \cdot d_{\text{in}}} \]
Strengths:
Limitations:
Generally, we will find these sequence mapping variants
Figure 7: Three standard sequence task settings. (a) One label for the whole sequence. (b) One label per position. (c) Tokens generated one at a time, each conditioned on all previous tokens.
A single label is produced for the entire input sequence of length \(T\).
\[ f_\theta : \mathbb{R}^{T \times d_{\text{in}}} \longrightarrow \mathbb{R}^{k} \]
The model reads all \(T\) positions and emits one prediction vector.
Some examples:
A label is produced at every position in the input sequence.
\[ f_\theta : \mathbb{R}^{T \times d_{\text{in}}} \longrightarrow \mathbb{R}^{T \times k} \]
The output has the same length \(T\) as the input; labels are aligned position by position.
Some examples:
The model generates one token at a time, conditioning on all previously generated tokens.
\[ p(x_1, \dots, x_T) = \prod_{t=1}^{T} p\!\left(x_t \mid x_1, \dots, x_{t-1}\right) \]
At each step the model samples the next token, then feeds it back as input.
Some examples:
This is the same principle used in large language models. The only difference in biology is that the vocabulary is small (\(|\mathcal{V}| = 4\) or \(20\)) but the sequences can be very long.
A recurrent model maintains a hidden state \(h_t\) that is updated at every step:
\[ h_t = f\!\left(x_t,\, h_{t-1}\right) \]
A recurrent model maintains a hidden state \(h_t\) that is updated at every step:
\[ h_t = f\!\left(x_t,\, h_{t-1}\right) \]
Three consequences that directly address the MLP limitations:
The gradient signal must propagate through many time steps to update weights that influenced early states, –> vanishing gradient problem – the main motivation for LSTMs and GRUs.
The function \(f\) uses the same parameters at every time step. For the simplest recurrent model:
\[ h_t = \phi\!\left(W_x\, x_t + W_h\, h_{t-1} + b\right) \]
where \(W_x \in \mathbb{R}^{d_h \times d_{\text{in}}}\), \(W_h \in \mathbb{R}^{d_h \times d_h}\), and \(b \in \mathbb{R}^{d_h}\) are shared across all \(T\) steps.
Figure 8: Left: an MLP uses distinct weight matrices at each layer. Right: a recurrent model reuses the same (W, b) at every time step, keeping the parameter count independent of sequence length T.
Sharing parameters across time has two important consequences.
Generalisation to unseen lengths
The model is trained on sequences of length up to \(T_{\max}\), but the same recurrence \(h_t = f(x_t, h_{t-1})\) can be unrolled for any \(T\) at test time. No architecture change or retraining is required.
Regularisation by structure
Using the same \(W_x\), \(W_h\), \(b\) everywhere forces the model to learn a universal rule for updating context – not a position-specific rule. This reduces the effective number of parameters from \(O(T \cdot d_h^2)\) to \(O(d_h^2)\).
Note
Parameter sharing is the same principle as weight tying in convolutional networks: a filter is applied at every spatial position. In RNNs the axis of sharing is time rather than space.
At each time step \(t\), the unit receives two inputs:
It produces a new hidden state:
\[ h_t = \tanh(W_h h_{t-1} + W_x x_t + b) \]

Unfolded
The hidden state \(h_t\) compresses the entire history \((x_1, x_2, \dots, x_t)\) into a fixed-size vector.
\[ h_t = \tanh(W_h\, h_{t-1} + W_x\, x_t + b) \]
Note
The representation is lossy: \(h_t\) has \(d_h\) entries regardless of how long the sequence is. Early inputs can only influence the output if their effect survives repeated multiplication and squashing through \(\tanh\).
The hidden state \(h_t\) can be read out at every step or only at the end, depending on the task.
Sequence-to-sequence (e.g., per-residue secondary structure prediction):
\[ y_t = W_y\, h_t + b_y, \qquad t = 1, \dots, T \]
Sequence-to-one (e.g., classify a promoter sequence as active/inactive):
\[ y = W_y\, h_T + b_y \]
A key property: \(W_h\), \(W_x\), and \(b\) are the same at every step.
Tip
Weight sharing is what makes RNNs fundamentally different from a feed-forward network with \(T\) separate layers. It also makes gradient computation more involved, because changing \(W_h\) affects \(h_1\), which affects \(h_2\), and so on.
In a feed-forward network, backprop walks backward through layers.
In an RNN, the computational graph is the unrolled chain \(h_1 \to h_2 \to \cdots \to h_T\).
Backpropagation Through Time (BPTT) is simply backprop applied to this unrolled graph (Werbos 1990):
Note
BPTT is not a new algorithm. It is standard backprop on a graph that happens to repeat the same operation \(T\) times with shared weights.
Consider a loss that depends only on the final state: \(L = \ell(h_T)\).
The gradient of \(L\) with respect to \(h_t\) (for \(t < T\)) requires chaining through every intermediate step:
\[ \frac{\partial L}{\partial h_t} = \frac{\partial L}{\partial h_T} \prod_{s=t+1}^{T} \frac{\partial h_s}{\partial h_{s-1}} \]
Each Jacobian factor comes from the recurrence \(h_s = \tanh(W_h\, h_{s-1} + W_x\, x_s + b)\):
\[ \frac{\partial h_s}{\partial h_{s-1}} = \mathrm{diag}\!\bigl(\tanh'(z_s)\bigr)\; W_h \]
where \(z_s = W_h\, h_{s-1} + W_x\, x_s + b\) and \(\tanh'(z) = 1 - \tanh^2(z)\).
Expanding the product:
\[ \frac{\partial L}{\partial h_t} = \frac{\partial L}{\partial h_T} \prod_{s=t+1}^{T} \mathrm{diag}\!\bigl(1 - h_s^2\bigr)\; W_h \]
This is a product of \((T - t)\) matrices, all involving \(W_h\) (Pascanu et al. 2013).
Warning
This is the fundamental difficulty of training RNNs on long sequences: the gradient signal must survive many multiplicative steps to reach early time steps.
Since \(W_h\) appears at every step, the total gradient is a sum over time:
\[ \frac{\partial L}{\partial W_h} = \sum_{t=1}^{T} \frac{\partial L}{\partial h_t}\; \frac{\partial h_t}{\partial W_h}\bigg|_{\text{local}} \]
where the local term is:
\[ \frac{\partial h_t}{\partial W_h}\bigg|_{\text{local}} = \mathrm{diag}\!\bigl(1 - h_t^2\bigr)\; h_{t-1}^\top \]
The same pattern applies to \(W_x\) and \(b\). Each step contributes a term, and the difficulty lies in the factor \(\frac{\partial L}{\partial h_t}\), which requires the long product from the previous slide.
diag() appear here?The recurrence uses a vector activation:
\[ h_s = \tanh(z_s) \]
where \(z_s \in \mathbb{R}^{d_h}\). This means that \(\tanh\) is applied elementwise:
\[ h_{s,i} = \tanh(z_{s,i}) \qquad \text{for each } i = 1,\dots,d_h \]
So the Jacobian of \(h_s\) with respect to \(z_s\) has entries
\[ \left(\frac{\partial h_s}{\partial z_s}\right)_{ij} = \frac{\partial h_{s,i}}{\partial z_{s,j}} \]
because coordinate \(h_{s,i}\) depends only on its own input \(z_{s,i}\).
Therefore:
\[ \frac{\partial h_s}{\partial z_s} = \begin{bmatrix} \tanh'(z_{s,1}) & 0 & \cdots & 0\\ 0 & \tanh'(z_{s,2}) & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & \tanh'(z_{s,d_h}) \end{bmatrix} = \operatorname{diag}\!\bigl(\tanh'(z_s)\bigr) \]
This is exactly the diag() term in the BPTT product.
Then, since
\[ z_s = W_h h_{s-1} + W_x x_s + b \]
we also have
\[ \frac{\partial z_s}{\partial h_{s-1}} = W_h \]
and by the chain rule:
\[ \frac{\partial h_s}{\partial h_{s-1}} = \frac{\partial h_s}{\partial z_s} \frac{\partial z_s}{\partial h_{s-1}} = \operatorname{diag}\!\bigl(\tanh'(z_s)\bigr)\, W_h = \mathrm{diag}(1 - h_s^2)\; W_h \]
Let
\[ z_s = \begin{bmatrix} z_1\\ z_2 \end{bmatrix}, \qquad h_s = \begin{bmatrix} \tanh(z_1)\\ \tanh(z_2) \end{bmatrix} \]
Then
\[ \frac{\partial h_s}{\partial z_s} = \begin{bmatrix} \frac{\partial \tanh(z_1)}{\partial z_1} & \frac{\partial \tanh(z_1)}{\partial z_2}\\[4pt] \frac{\partial \tanh(z_2)}{\partial z_1} & \frac{\partial \tanh(z_2)}{\partial z_2} \end{bmatrix} = \begin{bmatrix} \tanh'(z_1) & 0\\ 0 & \tanh'(z_2) \end{bmatrix} \]
So diag() is just a compact way to say:
each coordinate gets multiplied by its own \(\tanh'\) value, and there are no cross-terms between coordinates.
Note
In backprop code we rarely build this diagonal matrix explicitly. Multiplying by \(\operatorname{diag}(v)\) is usually implemented as an elementwise multiply by the vector \(v\).
Recall the product that links \(h_T\) back to \(h_t\):
\[ \prod_{s=t+1}^{T} \mathrm{diag}(1 - h_s^2)\; W_h \]
Vanishing occurs when the spectral norm of each factor is less than 1:
Consequence: the network cannot learn long-range dependencies. Parameters are updated based almost entirely on the last few steps.
Exploding occurs when the spectral norm of \(W_h\) is large enough that the product grows exponentially:
\[ \left\| \prod_{s=t+1}^{T} \mathrm{diag}(1 - h_s^2)\; W_h \right\| \;\approx\; \rho^{T-t} \]
where \(\rho\) is an effective per-step amplification factor.
Note
Vanishing is silent (the model simply fails to learn). Exploding is loud (training crashes). Both stem from the same cause: a long chain of multiplications.
Long-range dependencies are common in biological sequences:
If the gradient vanishes over 20–50 steps, the RNN treats these dependencies as invisible. The model learns only local patterns.
Tip
This limitation motivated the development of gated architectures (LSTM, GRU) and, later, attention-based models (Transformers), which provide alternative gradient pathways.
Gradient clipping caps the norm of the gradient vector before each parameter update.
Norm clipping (most common): if \(\|\nabla_\theta L\| > \tau\), rescale:
\[ \nabla_\theta L \leftarrow \frac{\tau}{\|\nabla_\theta L\|}\; \nabla_\theta L \]
Note
Clipping does not solve vanishing gradients. It only prevents the training process from crashing when gradients explode.
In PyTorch, gradient clipping is a single line before the optimizer step:
Practical guidelines:
Warning
If you must clip aggressively (\(\tau < 0.5\)), the model is likely ill-conditioned. Consider reducing the learning rate, changing the architecture (e.g., LSTM/GRU), or shortening the sequences via truncated BPTT.
Full BPTT unrolls all \(T\) steps, which can be expensive in memory and computation.
Truncated BPTT limits how far gradients are propagated through the unrolled graph:
Trade-off:
Note
Truncated BPTT does not prevent vanishing gradients within the \(K\)-step window. It mainly controls memory and compute, while giving an approximate training signal for long sequences.
| Property | Vanilla RNN |
|---|---|
| Parameter sharing | Same \(W_h, W_x, b\) at every step |
| Handles variable-length input | Yes, by construction |
| Long-range dependencies | Poor — gradients vanish exponentially |
| Training stability | Requires gradient clipping |
| Typical effective memory | 10–20 steps in practice |
The vanilla RNN updates its state by overwriting it at every step:
\[ h_t = \tanh(W_h\, h_{t-1} + W_x\, x_t + b) \]
The core idea
Instead of forcing all information through a single multiplicative bottleneck, introduce a separate memory channel with additive updates and learnable gates that control what to remember, what to forget, and what to output.
An LSTM cell (Hochreiter and Schmidhuber 1997) maintains two vectors at each time step:
Cell state \(C_t \in \mathbb{R}^{d_h}\) — the long-term memory. Updated mostly by addition, so gradients flow without repeated matrix multiplication.
Hidden state \(h_t \in \mathbb{R}^{d_h}\) — the short-term output. A filtered, squashed view of \(C_t\), exposed to the rest of the network.
\[ C_t \;\longrightarrow\; \text{carries information across many steps (highway)} \] \[ h_t \;\longrightarrow\; \text{used for predictions and passed to the next step} \]
Note
The cell state \(C_t\) is what distinguishes an LSTM from a vanilla RNN. It acts as a conveyor belt: information can travel along it with minimal interference, unless a gate explicitly modifies it.
LSTM cell: the cell state \(C_t\) flows along the top as a highway. Four components — forget gate (\(f_t\)), input gate (\(i_t\)), candidate (\(\tilde{C}_t\)), and output gate (\(o_t\)) — regulate information flow.
The LSTM cell computes four intermediate quantities at each step, all from a linear combination of \(h_{t-1}\) and \(x_t\):
| Gate | Symbol | Activation | Role |
|---|---|---|---|
| Forget | \(f_t\) | sigmoid | How much of \(C_{t-1}\) to keep |
| Input | \(i_t\) | sigmoid | How much of the new candidate to write |
| Candidate | \(\tilde{C}_t\) | tanh | What new information to propose |
| Output | \(o_t\) | sigmoid | What part of \(C_t\) to expose as \(h_t\) |
The forget gate reads \([h_{t-1}, x_t]\) and produces \(f_t\) through a sigmoid. It then multiplies \(C_{t-1}\) element-wise, controlling which memory dimensions survive. Forget gate biases are often initialized to 1.0 so the network starts by remembering, not forgetting.
\[ f_t = \sigma\!\bigl(W_f\, [h_{t-1},\, x_t] + b_f\bigr) \]

Dimensions:
Per-dimension behaviour:
The input gate reads \([h_{t-1}, x_t]\) through a sigmoid and produces \(i_t\). It controls how much of the new candidate to write into the cell state. The input gate is the volume knob; the candidate (next) provides the content.
\[ i_t = \sigma\!\bigl(W_i\, [h_{t-1},\, x_t] + b_i\bigr) \]

Dimensions:
Separation of roles:
The candidate computes \(\tilde{C}_t\) through a \(\tanh\) activation. Unlike the three gates, it proposes content rather than control. Combined with the input gate as \(i_t \odot \tilde{C}_t\), it determines what new information enters the cell state.
\[ \tilde{C}_t = \tanh\!\bigl(W_C\, [h_{t-1},\, x_t] + b_C\bigr) \]

Dimensions:
Key properties:
The cell state is updated by forgetting selectively and adding selectively:
\[ C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t \]
The critical difference from vanilla RNNs
This update is additive in \(C_{t-1}\). When \(f_t \approx 1\) and \(i_t \approx 0\), the cell state passes through unchanged — no matrix multiplication, no \(\tanh\) squashing. This is why gradients can survive over long distances.
The output gate selects which dimensions of the squashed cell state \(\tanh(C_t)\) are exposed as \(h_t\). The cell can store information internally that it does not yet reveal — like a notebook where you choose which page to show.
\[ o_t = \sigma\!\bigl(W_o\, [h_{t-1},\, x_t] + b_o\bigr), \qquad h_t = o_t \odot \tanh(C_t) \]

Dimensions:
Two-step output:
\(C_t\) can store values outside \((-1,1)\) because only the output view \(h_t\) is squashed. This gives \(C_t\) more dynamic range.
Complete LSTM cell with all gates active. The red arrow shows the cell state highway; the four blocks at the bottom compute the gates and candidate from \([h_{t-1}, x_t]\).
All four equations for one time step:
\[\begin{eqnarray} f_t &=& \sigma\bigl(W_f [h_{t-1}, x_t] + b_f\bigr) \\ i_t &= & \sigma\bigl(W_i [h_{t-1}, x_t] + b_i\bigr)\\ \tilde{C}_t &= & \tanh\bigl(W_C [h_{t-1}, x_t] + b_C\bigr)\\ C_t &= &f_t \odot C_{t-1} + i_t \odot \tilde{C}_t\\ o_t &= & \sigma\bigl(W_o [h_{t-1}, x_t] + b_o\bigr)\\ h_t &= & o_t \odot \tanh(C_t) \end{eqnarray}\]

Inputs: \(h_{t-1}, C_{t-1}, x_t\). Outputs: \(h_t, C_t\).
To consolidate, trace the computation for a single time step:
Tip
In practice, the four matrix multiplications (\(W_f, W_i, W_C, W_o\)) are fused into a single large multiplication and then split. This is why you will see implementations using a single weight matrix of size \(4 d_h \times (d_h + d_{\text{in}})\).
Each gate has its own weight matrix and bias:
Total parameters per LSTM layer:
\[ 4\, d_h\, (d_h + d_{\text{in}}) + 4\, d_h = 4\, d_h\, (d_h + d_{\text{in}} + 1) \]
Note
An LSTM has roughly 4 times the parameters of a vanilla RNN with the same \(d_h\) and \(d_{\text{in}}\). This is the cost of gating. In efficient implementations, the four matrix multiplications are fused into one large product.
The LSTM cell state evolves as:
\[ C_s = f_s \odot C_{s-1} + i_s \odot \tilde{C}_s \]
where:
The key term is:
\[ f_s \odot C_{s-1} \]
which carries the previous memory forward by elementwise scaling.
The cell state at time (s) is a vector:
\[ C_s \in \mathbb{R}^{d_h} \qquad\text{with}\qquad C_s = \begin{bmatrix} C_s^{(1)} \\ C_s^{(2)} \\ \vdots \\ C_s^{(d_h)} \end{bmatrix} \]
So \(j\) simply labels one memory dimension of the LSTM.
(\(j\)) is the index of one component of the hidden/cell vector, (\(s\)) is the time step. If the cell state is
\[C_s \in \mathbb{R}^{d_h}, C_s =\begin{bmatrix} C_s^{(1)}\\ C_s^{(2)}\\ \vdots\\ C_s^{(d_h)} \end{bmatrix} \]
So:
\[ C_s^{(j)} = f_s^{(j)} C_{s-1}^{(j)} + \cdots \]
Each coordinate evolves independently in that derivative, which is why the Jacobian is diagonal.
Along the direct cell-state path, differentiating the cell state with respect to the previous one gives:
\[ \frac{\partial C_s}{\partial C_{s-1}} = \mathrm{diag}(f_s) \]
This is because each coordinate satisfies:
\[ C_s^{(j)} = f_s^{(j)} C_{s-1}^{(j)} + \cdots \]
so:
\[ \frac{\partial C_s^{(j)}}{\partial C_{s-1}^{(j)}} = f_s^{(j)} \]
Thus, the direct-path Jacobian is diagonal: each memory dimension is scaled independently.
Note
The full derivative also includes indirect terms because the gates depend on \(h_{s-1}\), which in turn depends on earlier states. The diagonal expression isolates the additive memory pathway that makes LSTMs trainable over longer spans.
From time \(T\) back to time \(t\), the gradient along the cell-state path is:
\[ \frac{\partial C_T}{\partial C_t} = \prod_{s=t+1}^{T} \mathrm{diag}(f_s) \]
For each coordinate (j), this becomes:
\[ \frac{\partial C_T^{(j)}}{\partial C_t^{(j)}} = \prod_{s=t+1}^{T} f_s^{(j)} \]
So the gradient is passed backward by repeated mult. of the forget gates.
If \(f_s \approx 1\), the product stays close to 1, and gradients flow across long time spans.
| Vanilla RNN | LSTM (cell state path) | |
|---|---|---|
| Gradient factor per step | \(\mathrm{diag}(\tanh'(z_s))\; W_h\) | \(\mathrm{diag}(f_s)\) |
| Involves weight matrix | Yes | No |
| Spectral norm | Can be \(> 1\) or \(\ll 1\) | In \((0, 1)\) per element |
| Additive path | No | Yes |
| Effective memory | 10–20 steps | 100+ steps possible |
Tip
The forget gate acts as a learnable decay rate per dimension. The network decides, based on the input, how long to retain each piece of information. This is far more flexible than the fixed decay imposed by repeated multiplication through \(W_h\).
LSTMs have been widely applied to biological sequence tasks:
Note
In many of these tasks, LSTMs were the dominant architecture before Transformers. They remain a strong baseline and are often preferred when data is limited, because they have fewer parameters than attention-based models of comparable capacity.
Warning
LSTMs are slower to train than vanilla RNNs (4x more parameters, 4x more computation per step) and cannot be parallelised across time steps. For very long sequences (\(T > 1000\)), consider Transformers or structured state-space models.
The LSTM introduced four components (three gates plus a candidate) and two state vectors (\(C_t\), \(h_t\)). This works well, but:
The core idea
The GRU merges the cell state and hidden state into a single vector \(h_t\), and reduces the gate count from three to two: a reset gate and an update gate. It retains the key benefit of LSTMs — learnable, per-dimension control over memory retention — with fewer parameters.
A GRU cell maintains a single state vector:
Two gates control how \(h_t\) is updated:
| Gate | Symbol | Role |
|---|---|---|
| Reset | \(r_t\) | How much of \(h_{t-1}\) to use when computing the candidate |
| Update | \(z_t\) | How much of \(h_{t-1}\) to carry forward vs replace |
Note
Compare with the LSTM: the GRU has no separate cell state \(C_t\) and no output gate. The update gate \(z_t\) plays the combined role of the LSTM forget and input gates.
GRU cell: the hidden state \(h_t\) is updated through a convex combination controlled by the update gate \(z_t\). The reset gate \(r_t\) modulates how much of \(h_{t-1}\) influences the candidate \(\tilde{h}_t\).
The reset gate reads \([h_{t-1}, x_t]\) and produces \(r_t\) through a sigmoid. It controls how much of the previous hidden state flows into the candidate computation. When \(r_t \approx 0\), the candidate ignores the past and behaves like a plain feed-forward layer.
\[ r_t = \sigma\bigl(W_r\, [h_{t-1},\, x_t] + b_r\bigr) \]

Dimensions:
Per-dimension behaviour:
The update gate reads \([h_{t-1}, x_t]\) and produces \(z_t\) through a sigmoid. It controls the interpolation between the old state \(h_{t-1}\) and the new candidate \(\tilde{h}_t\). A value of \(z_t \approx 1\) means “keep the old state”; \(z_t \approx 0\) means “replace with the candidate”.
\[ z_t = \sigma \bigl(W_z\, [h_{t-1},\, x_t] + b_z\bigr) \]

Dimensions:
Dual role:
The candidate \(\tilde{h}_t\) is computed through a \(\tanh\) activation, using the reset-gated version of \(h_{t-1}\). It proposes new content, analogous to the LSTM candidate \(\tilde{C}_t\).
\[ \tilde{h}_t = \tanh\!\bigl(W_h\, [r_t \odot h_{t-1},\, x_t] + b_h\bigr) \]

Dimensions:
Key difference from LSTM:
The new hidden state is a convex interpolation between the old state and the candidate:
\[ h_t = z_t \odot h_{t-1} + (1 - z_t) \odot \tilde{h}_t \]
The additive memory mechanism
When \(z_t \approx 1\), the hidden state passes through nearly unchanged — no matrix multiplication, no \(\tanh\) compression. This is the same principle that gives LSTMs long-range gradient flow, achieved here with a simpler structure.
Complete GRU cell. The reset gate modulates \(h_{t-1}\) for the candidate; the update gate interpolates between old and new states.
All three equations for one time step:
\[\begin{eqnarray} r_t &=& \sigma\bigl(W_r [h_{t-1}, x_t] + b_r\bigr) \\ z_t &=& \sigma\bigl(W_z [h_{t-1}, x_t] + b_z\bigr) \\ \tilde{h}_t &=& \tanh\bigl(W_h [r_t \odot h_{t-1}, x_t] + b_h\bigr) \\ h_t &=& z_t \odot h_{t-1} + (1 - z_t) \odot \tilde{h}_t \end{eqnarray}\]

Input: \(h_{t-1}, x_t\). Output: \(h_t\).
To consolidate, trace the computation for a single time step:
Tip
In practice, the three matrix multiplications (\(W_r, W_z, W_h\)) are fused into a single product using a weight matrix of size \(3 d_h \times (d_h + d_{\text{in}})\), then split and routed to the appropriate activations.
Each of the 3 components has its own weight matrix and bias:
Total parameters per GRU layer:
\[ 3\, d_h\, (d_h + d_{\text{in}}) + 3\, d_h = 3\, d_h\, (d_h + d_{\text{in}} + 1) \]
Note
A GRU has 3/4 the parameters of an LSTM with the same \(d_h\) and \(d_{\text{in}}\). For the same parameter budget, a GRU can use a larger hidden size, which sometimes compensates for the simpler gating structure.
\[ h_t = z_t \odot h_{t-1} + (1 - z_t) \odot \tilde{h}_t \]
Differentiating with respect to the previous state (dominant term):
\[ \frac{\partial h_t}{\partial h_{t-1}} \approx \mathrm{diag}(z_t) \]
Over many steps, the gradient along the direct path is:
\[ \frac{\partial h_T^{(j)}}{\partial h_t^{(j)}} \approx \prod_{s=t+1}^{T} z_s^{(j)} \]
When \(z_s \approx 1\), the product stays close to 1 and gradients survive.
Note
This is exactly analogous to the LSTM cell-state gradient \(\prod f_s^{(j)}\), with the update gate \(z_t\) playing the role of the forget gate \(f_t\).
| Vanilla RNN | LSTM (cell path) | GRU (hidden path) | |
|---|---|---|---|
| Gradient factor | \(\mathrm{diag}(\tanh')\; W_h\) | \(\mathrm{diag}(f_s)\) | \(\mathrm{diag}(z_s)\) |
| Weight matrix | Yes | No | No |
| Additive path | No | Yes (\(C_t\)) | Yes (\(h_t\)) |
| State vectors | 1 (\(h_t\)) | 2 (\(h_t, C_t\)) | 1 (\(h_t\)) |
| Gates | 0 | 3 | 2 |
| Params (relative) | 1x | 4x | 3x |
Note
On many benchmarks, GRUs and LSTMs perform comparably. The choice often depends on dataset size, sequence length, and computational budget rather than on a clear architectural winner. When in doubt, try both and compare on validation data.
GRUs have been applied to many of the same tasks as LSTMs:
Tip
When data is scarce — a common situation in clinical and rare-disease settings — the GRU’s lower parameter count can be a decisive advantage over the LSTM.
Warning
Like LSTMs, GRUs process time steps sequentially and cannot be parallelised across the time axis. For very long sequences or large-scale pretraining, Transformers are generally preferred.
Note
Gates (LSTM, GRU) address vanishing gradients architecturally. The practical choices on this slide address the remaining engineering problems.
The recurrent weight matrix \(W_h \in \mathbb{R}^{d_h \times d_h}\) is applied at every time step:
\[ h_t = \phi\bigl(W_x\, x_t + W_h\, h_{t-1} + b\bigr) \]
After \(T\) steps, the effective Jacobian involves products of \(W_h\):
\[ \frac{\partial h_T}{\partial h_0} \;\propto\; \prod_{t=1}^{T} \mathrm{diag}\bigl(\phi'(z_t)\bigr)\; W_h \]
Sample \(W_h\) as a random orthogonal matrix: all singular values equal 1.
\[ W_h^\top W_h = I \quad\Rightarrow\quad \sigma_i(W_h) = 1 \;\;\forall\, i \]
Tip
For LSTM forget gates, initialise the forget bias to 1 (or higher). This keeps the gate open at the start of training, so information flows through the cell before the network has learned what to forget.
Recall the LSTM forget gate:
\[ f_t = \sigma\bigl(W_f\, x_t + U_f\, h_{t-1} + b_f\bigr) \]
This simple trick, proposed by Jozefowicz et al. (2015), often matters more than the choice of weight initialisation scheme.
Note
In PyTorch, nn.LSTM packs all four gate biases into one vector. You must index the correct slice to set the forget bias independently.
Even with good initialisation, occasional gradient spikes can destabilise training.
Gradient norm clipping rescales the full gradient vector when its norm exceeds a threshold \(\tau\):
\[ g \leftarrow \begin{cases} g & \text{if } \|g\| \le \tau \\[4pt] \tau \;\dfrac{g}{\|g\|} & \text{if } \|g\| > \tau \end{cases} \]
Warning
Clipping does not fix vanishing gradients. It only prevents explosions. For vanishing gradients, use gated architectures (LSTM, GRU).
Logging \(\|g\|\) per training step is one of the most useful diagnostics for recurrent models.
In PyTorch, compute the total norm before clipping:
Tip
Plot gradient norms alongside the learning curve. If norms spike exactly when the loss jumps, you have an exploding-gradient event.
For tanh or sigmoid activations, saturation means most hidden units are near \(\pm 1\) (tanh) or near 0 / 1 (sigmoid).
Note
LSTM cell states \(C_t\) are unbounded by design – they do not saturate. This is a key reason why LSTMs handle long-range dependencies better than vanilla RNNs.
pack_padded_sequence) to avoid computing on padding tokens.| Symptom | Likely cause | Remedy |
|---|---|---|
| Loss becomes NaN | Exploding gradients | Lower \(\tau\), reduce learning rate |
| Loss plateaus immediately | Vanishing gradients or dead gates | Check initialisation, try LSTM/GRU |
| Train loss drops, val loss rises | Overfitting | Add dropout, reduce model size |
| Val loss is noisy | Batch too small or val set too small | Increase batch size, check data split |
Task: classify protein subcellular localisation from amino-acid sequences.
\[ x_t \;\xrightarrow{\text{embedding}}\; e_t \in \mathbb{R}^{d_e} \;\xrightarrow{\text{LSTM}}\; h_t \in \mathbb{R}^{d_h} \;\xrightarrow{\text{pool + linear}}\; z \in \mathbb{R}^{k} \;\xrightarrow{\text{softmax}}\; \hat{y} \]
LSTM sequence classifier with embedding, recurrent layers, pooling, and softmax output
The LSTM produces one hidden vector \(h_t\) per time step. We need a single vector for classification.
Common pooling strategies:
Tip
For protein sequences, mean pooling over unpadded positions is a robust default. It treats all residues equally and avoids the positional bias of using only \(h_T\).
The padding mask \(M \in \{0,1\}^{B \times T}\) is essential here: pool only over positions where \(M_{b,t} = 1\).
for epoch in 1 ... max_epochs:
for batch in dataloader:
x, lengths, y = batch # padded input, true lengths, labels
x_packed = pack(x, lengths) # avoid computing on padding
packed_out, _ = lstm(x_packed) # forward pass
h, _ = unpack(packed_out) # padded hidden states
pooled = masked_mean(h, lengths) # mean over true residues
logits = linear(pooled) # map to k classes
loss = cross_entropy(logits, y)
loss.backward() # backward pass
norm = clip_grad_norm_(params, tau) # clip + log
optimiser.step()
optimiser.zero_grad()
val_loss = evaluate(val_loader)
scheduler.step(val_loss) # reduce LR on plateau
if val_loss < best:
save_checkpoint()
patience_counter = 0
else:
patience_counter += 1
if patience_counter >= patience:
break # early stopping
During training, log at every epoch:
After training, inspect:
Warning
Split proteins by sequence identity, not by random sampling. Homologous sequences in both train and test sets inflate performance and hide poor generalisation.
| Concern | Recommendation |
|---|---|
| Weight init (\(W_h\)) | Orthogonal; forget bias = 1 for LSTM |
| Gradient explosions | Clip global norm, \(\tau \in [1, 5]\) |
| Vanishing gradients | Use LSTM or GRU |
| Optimiser | Adam, lr \(\approx 10^{-3}\) |
| Regularisation | Dropout between layers; weight decay |
| Monitoring | Log gradient norms + learning curves every epoch |
| Stopping criterion | Early stopping on validation loss |
| Batching | Pack sequences; sort by length |
So far, sequence-to-sequence meant one label aligned to each input position:
\[ X = (x_1,\dots,x_T) \longrightarrow Y = (y_1,\dots,y_T) \]
Many biological and biomedical tasks are more general:
\[ X = (x_1,\dots,x_T) \longrightarrow Y = (y_1,\dots,y_{T'}) \]
where \(T'\) may differ from \(T\).
Examples:
An RNN encoder reads the input sequence and compresses it into a context vector:
\[ h_t = f_{\text{enc}}(x_t, h_{t-1}), \qquad c = h_T \]
An RNN decoder generates the output sequence one token at a time:
\[ s_t = f_{\text{dec}}(y_{t-1}, s_{t-1}, c), \qquad p(y_t \mid y_{<t}, X) = \mathrm{softmax}(W_o s_t + b_o) \]
Note
This is the classic RNN encoder–decoder idea used in neural machine translation (Sutskever et al. 2014; Cho et al. 2014).
The encoder–decoder solves variable input/output lengths, but it creates a new problem:
\[ x_1,\dots,x_T \quad \longrightarrow \quad c = h_T \quad \longrightarrow \quad y_1,\dots,y_{T'} \]
The whole input must pass through a single fixed-size vector \(c\).
Consequences:
Tip
The next lecture introduces attention as the solution: instead of forcing the decoder to rely only on \(c = h_T\), let it look back at all encoder states \(h_1,\dots,h_T\) at every output step (Bahdanau et al. 2015).
RNNs gave us three ideas that Transformers keep:
Transformers change the mechanism:
| Question | RNN answer | Transformer answer |
|---|---|---|
| How does context move? | Step by step through \(h_t\) | Directly through attention weights |
| How many sequential steps from position 1 to \(T\)? | \(O(T)\) | \(O(1)\) per layer |
| What is the bottleneck? | Fixed hidden state / context vector | Quadratic attention cost |
We will use the following conventions throughout the course.
Single sequence
Batch of sequences (training)
Tip
Sequences in a batch typically have different true lengths. Padding extends each to \(T_{\max}\) with a special token (or zeros), and \(M\) tells the model which positions to ignore in the loss and in any attention computation.
We will keep the same conventions as in the MLP slides:
Sequence input
Hidden state

b2slab.upc.edu alexandre.perera@upc.edu