Trace an RNN over ordered events, see why gradients fade or grow, and use LSTM and GRU gates to control memory.
An action probability can choose ignore, rollback, or escalate for an incident assistant. One event saying "rollback requested" is useful, but it means something different after "error logged" and "reproduction confirmed" than it does by itself.
A sequence model reads ordered evidence. Recurrent neural networks (RNNs) do this by applying one learned update repeatedly and carrying a hidden state from one event to the next. LSTMs and GRUs keep the recurrent idea, then add learned gates that control what state should survive.[1][2]
The order of events is part of the input. The small accumulator below isn't a trained RNN, but it exposes the requirement: a late event has more immediate effect because it's applied after earlier state has decayed.
1import numpy as np
2
3signals = {
4 "error logged": 1.0,
5 "reproduction confirmed": 0.4,
6 "rollback requested": 0.9,
7}
8
9def running_state(events):
10 state = 0.0
11 for event in events:
12 state = 0.6 * state + signals[event]
13 return state
14
15incident_order = ["error logged", "reproduction confirmed", "rollback requested"]
16reversed_order = list(reversed(incident_order))
17
18print("incident order:", round(running_state(incident_order), 3))
19print("reversed order:", round(running_state(reversed_order), 3))
20print("same events:", sorted(incident_order) == sorted(reversed_order))1incident order: 1.5
2reversed order: 1.564
3same events: TrueThe events are identical as a set, but the final state differs. A useful model must therefore accept variable-length sequences without discarding order.
A basic RNN cell updates one state vector:
Here is the current event vector and is memory from earlier events. The same matrices and are reused at every timestep. A longer history causes more applications of the cell, not more parameters.
In one dimension, the update is easy to inspect:
1import numpy as np
2
3events = [1.0, 0.4, 0.9]
4w_input, w_state, bias = 0.8, 0.5, 0.0
5state = 0.0
6
7for step, event in enumerate(events, start=1):
8 state = np.tanh(w_input * event + w_state * state + bias)
9 print(f"h_{step} = {state:.3f}")1h_1 = 0.664
2h_2 = 0.573
3h_3 = 0.764The cell turns three inputs into one final value. Real cells use vectors so different dimensions can carry different kinds of evidence.
Represent each incident event with three binary features:
| Event | error_seen | repro_confirmed | rollback_requested |
|---|---|---|---|
| Error logged | 1 | 0 | 0 |
| Reproduction confirmed | 0 | 1 | 0 |
| Rollback requested | 0 | 0 | 1 |
Use a two-dimensional hidden state, small enough to print at every step:
1import numpy as np
2
3events = np.array([
4 [1.0, 0.0, 0.0],
5 [0.0, 1.0, 0.0],
6 [0.0, 0.0, 1.0],
7])
8W_xh = np.array([
9 [0.8, 0.2, 0.1],
10 [0.1, 0.7, 0.4],
11])
12W_hh = np.array([
13 [0.5, 0.1],
14 [0.0, 0.6],
15])
16b_h = np.zeros(2)
17
18state = np.zeros(2)
19for step, event in enumerate(events, start=1):
20 state = np.tanh(W_xh @ event + W_hh @ state + b_h)
21 print(f"h_{step} =", np.round(state, 3))
22
23print("final shape:", state.shape)1h_1 = [0.664 0.1 ]
2h_2 = [0.494 0.641]
3h_3 = [0.39 0.655]
4final shape: (2,)
Attach a classifier to the last hidden state exactly as you attached one to a feature vector in the previous chapter:
1import numpy as np
2
3final_state = np.array([0.390, 0.655])
4W_out = np.array([
5 [0.7, -0.3], # ignore
6 [-0.4, 0.8], # rollback
7 [0.1, 0.2], # escalate
8])
9labels = ["ignore", "rollback", "escalate"]
10
11logits = W_out @ final_state
12shifted = logits - logits.max()
13probabilities = np.exp(shifted) / np.exp(shifted).sum()
14
15print("logits:", np.round(logits, 3))
16print("probabilities:", np.round(probabilities, 3))
17print("prediction:", labels[int(probabilities.argmax())])1logits: [0.076 0.368 0.17 ]
2probabilities: [0.291 0.389 0.32 ]
3prediction: rollbackWeight sharing is the core economy of recurrence. This count stays fixed as history length grows:
1input_size, hidden_size, output_size = 3, 2, 3
2rnn_parameters = hidden_size * input_size + hidden_size * hidden_size + hidden_size
3head_parameters = output_size * hidden_size + output_size
4
5for timesteps in [3, 30, 300]:
6 print(f"{timesteps:>3} events -> {rnn_parameters + head_parameters} parameters")13 events -> 21 parameters
2 30 events -> 21 parameters
3300 events -> 21 parametersThe compression is useful but severe: every past event must influence the decision through two final hidden numbers.
Training unfolds the recurrence across time and backpropagates through every copy. This is backpropagation through time (BPTT). For a loss attached to the final hidden state, its gradient path back to the initial state contains a product of recurrent Jacobians:
If a model receives losses at several timesteps, their backward paths add together. Each long route still contains repeated factors.
In a scalar simplification, a local derivative below 1 shrinks the signal exponentially:
1for steps in [1, 4, 8, 12]:
2 gradient_factor = 0.6 ** steps
3 print(f"{steps:>2} recurrent steps: {gradient_factor:.6f}")11 recurrent steps: 0.600000
2 4 recurrent steps: 0.129600
3 8 recurrent steps: 0.016796
412 recurrent steps: 0.002177The reverse failure also occurs. Repeated factors above 1 can make gradients too large:
1import numpy as np
2
3raw_gradient = np.array([1.4 ** 12, -0.8 * (1.4 ** 12)])
4max_norm = 5.0
5norm = np.linalg.norm(raw_gradient)
6clipped = raw_gradient * min(1.0, max_norm / norm)
7
8print("raw norm:", round(float(norm), 3))
9print("clipped norm:", round(float(np.linalg.norm(clipped)), 3))
10print("clipped gradient:", np.round(clipped, 3))1raw norm: 72.604
2clipped norm: 5.0
3clipped gradient: [ 3.904 -3.123]Gradient clipping can stop an exploding update from destabilizing training. It doesn't give a plain RNN a reliable long memory; a vanished signal is already gone.
Long Short-Term Memory (LSTM) introduced gated memory cells designed to preserve useful learning signals over long delays.[1] The version below uses the later, common forget-gate form. Gers et al. added the adaptive forget gate so a cell could learn when to reset state in continual streams. With a cell state , the model has learned controls for keeping, writing, and exposing memory:
The key structural difference is the additive cell update. If the model learns near 1 and near 0 for an important memory dimension, that value can persist instead of being replaced at every step.
One cell dimension is enough to show the mechanism:
1import numpy as np
2
3cell = 0.0
4steps = [
5 ("damage recorded", 0.00, 0.95, 0.90, 0.80),
6 ("routine scan", 0.98, 0.02, 0.10, 0.80),
7 ("photo verified", 0.97, 0.15, 0.50, 0.90),
8]
9
10for label, forget, write, candidate, expose in steps:
11 cell = forget * cell + write * candidate
12 hidden = expose * np.tanh(cell)
13 print(f"{label:15} c={cell:.3f} h={hidden:.3f}")1damage recorded c=0.855 h=0.555
2routine scan c=0.840 h=0.549
3photo verified c=0.890 h=0.640routine scan changes little because its write gate is small and its forget gate is high. These values explain the update; during training the network learns its gates from data.
The Gated Recurrent Unit (GRU) uses one hidden state instead of separate cell and exposed states.[2] Using the convention in which is the fraction kept from old state:
With this notation, close to 1 keeps the prior value. The reset gate controls how much earlier state influences the candidate . Some libraries and explanations name or arrange GRU terms differently, so check the equation before interpreting a printed gate.
1old_state = 0.80
2candidate = -0.20
3
4for keep_gate in [0.95, 0.50, 0.05]:
5 new_state = keep_gate * old_state + (1.0 - keep_gate) * candidate
6 print(f"z={keep_gate:.2f} -> h={new_state:.3f}")1z=0.95 -> h=0.750
2z=0.50 -> h=0.300
3z=0.05 -> h=-0.150
A production batch often contains claim histories with different lengths. Framework code pads the shorter histories to one rectangular tensor, but those padding rows aren't real events. If a final-state classifier reads a GRU after it processes padding, the filler values can silently change the decision.
PyTorch's nn.GRU accepts a packed sequence. pack_padded_sequence records each history's real length so the recurrent cell stops updating that history before its padding rows. Setting batch_first=True makes the input shape (batch, timesteps, features), which matches the tensor printed below:
1import torch
2from torch import nn
3from torch.nn.utils.rnn import pack_padded_sequence
4
5torch.manual_seed(7)
6gru = nn.GRU(input_size=3, hidden_size=2, batch_first=True)
7lengths = torch.tensor([3, 2])
8base = torch.tensor([
9 [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]],
10 [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 0.0]],
11])
12changed_padding = base.clone()
13changed_padding[1, 2] = torch.tensor([9.0, -9.0, 9.0])
14
15def packed_final(batch):
16 packed = pack_padded_sequence(
17 batch, lengths.cpu(), batch_first=True, enforce_sorted=False
18 )
19 _, hidden = gru(packed)
20 return hidden[-1]
21
22def unpacked_final(batch):
23 _, hidden = gru(batch)
24 return hidden[-1]
25
26packed_changed = not torch.allclose(packed_final(base), packed_final(changed_padding))
27unpacked_changed = not torch.allclose(unpacked_final(base), unpacked_final(changed_padding))
28
29print("batch shape:", tuple(base.shape))
30print("packed final shape:", tuple(packed_final(base).shape))
31print("packed state changed by padding:", packed_changed)
32print("unpacked state changed by padding:", unpacked_changed)1batch shape: (2, 3, 3)
2packed final shape: (2, 2)
3packed state changed by padding: False
4unpacked state changed by padding: TrueThe second history has only two real events. Packing makes its final state independent of the third row, while the unpacked GRU treats that row as another event. Masking can solve the same class of problem when an architecture exposes the right state control, but padding must never become evidence.
Cho et al. trained an RNN encoder-decoder in which an encoder turns a source sequence into a fixed-size representation and a decoder generates a target sequence from it.[2] Sutskever et al. demonstrated the approach with multilayer LSTMs for sequence-to-sequence translation.[3]
The fixed-size interface is easy to see: both a short claim history and a longer one leave the encoder as the same two-number state.
1import numpy as np
2
3W_xh = np.array([[0.8, 0.2, 0.1], [0.1, 0.7, 0.4]])
4W_hh = np.array([[0.5, 0.1], [0.0, 0.6]])
5
6def encode(events):
7 state = np.zeros(2)
8 for event in events:
9 state = np.tanh(W_xh @ event + W_hh @ state)
10 return state
11
12history = np.eye(3)
13for length in [1, 2, 3]:
14 context = encode(history[:length])
15 print(f"{length} event(s): shape={context.shape}, context={np.round(context, 3)}")11 event(s): shape=(2,), context=[0.664 0.1 ]
22 event(s): shape=(2,), context=[0.494 0.641]
33 event(s): shape=(2,), context=[0.39 0.655]That fixed-size context is both an interface and a bottleneck. The next chapter studies bottlenecks directly: encoders that learn compact representations and decoders that reconstruct from them.
An RNN has a serial dependency: can't be computed before . In the Transformer's comparison of layer types, a recurrent layer has sequential operations and an maximum path length between positions. A self-attention layer has sequential operations and an maximum path length.[4] Full self-attention compares every allowed pair of positions, so its computation and attention-matrix size grow quadratically with sequence length.
| Question | Recurrent layer | Self-attention layer |
|---|---|---|
| Can positions be computed in parallel within a layer? | No, state is serial | Yes, with the attention mask applied |
| Longest path between two positions in one layer | ||
| Must all earlier evidence fit in one running state? | Yes | No, each allowed position can be attended to directly |
This doesn't make recurrence useless. It establishes the design pressure that leads into attention: long dependency routes and a fixed running-state bottleneck.
The comparison is about processing positions already present in a sequence. Autoregressive generation still emits one new token at a time because each next token depends on tokens generated earlier.
W_xh and W_hh.z_t.Answer every question, then check your score. Score above 75% to mark this lesson complete.
8 questions remaining.
Long Short-Term Memory.
Hochreiter, S., & Schmidhuber, J. · 1997 · Neural Computation, 9(8)
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation.
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. · 2014 · EMNLP 2014
Sequence to Sequence Learning with Neural Networks.
Sutskever, I., Vinyals, O., & Le, Q. V. · 2014 · NeurIPS 2014
Attention Is All You Need.
Vaswani, A., et al. · 2017