LeetLLM
LearnTracksPracticeBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Tracks
  • Practice
  • Blog
  • RSS

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 158 articles completed

🛠️Computing Foundations0/9
Git, Shell, Linux for AIDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models: Images & TextReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnPreparation & PrerequisitesRNNs, LSTMs, GRUs, and Sequence Modeling
🏛️EasyModel Architecture

RNNs, LSTMs, GRUs, and Sequence Modeling

Trace an RNN over ordered events, see why gradients fade or grow, and use LSTM and GRU gates to control memory.

11 min read
Learning path
Step 22 of 158 in the full curriculum
Softmax, Cross-Entropy & OptimizationAutoencoders and VAEs

An action probability can choose ignore, rollback, or escalate for an incident assistant. One event saying "rollback requested" is useful, but it means something different after "error logged" and "reproduction confirmed" than it does by itself.

A sequence model reads ordered evidence. Recurrent neural networks (RNNs) do this by applying one learned update repeatedly and carrying a hidden state from one event to the next. LSTMs and GRUs keep the recurrent idea, then add learned gates that control what state should survive.[1]Reference 1Long Short-Term Memory.https://doi.org/10.1162/neco.1997.9.8.1735[2]Reference 2Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation.https://arxiv.org/abs/1406.1078

Line chart tracks two hidden-state dimensions from zero through error logged, reproduction confirmed, and rollback requested events, ending at 0.390 and 0.655 beside a stacked action bar with rollback highest at 38.9%. Line chart tracks two hidden-state dimensions from zero through error logged, reproduction confirmed, and rollback requested events, ending at 0.390 and 0.655 beside a stacked action bar with rollback highest at 38.9%.
The same recurrent update changes both hidden coordinates after each ordered event. The final state `[0.390, 0.655]` is compressed history, and the action head assigns the largest probability to `rollback`.

An ordered history changes a decision

The order of events is part of the input. The small accumulator below isn't a trained RNN, but it exposes the requirement: a late event has more immediate effect because it's applied after earlier state has decayed.

order-changes-the-state.py
1import numpy as np 2 3signals = { 4 "error logged": 1.0, 5 "reproduction confirmed": 0.4, 6 "rollback requested": 0.9, 7} 8 9def running_state(events): 10 state = 0.0 11 for event in events: 12 state = 0.6 * state + signals[event] 13 return state 14 15incident_order = ["error logged", "reproduction confirmed", "rollback requested"] 16reversed_order = list(reversed(incident_order)) 17 18print("incident order:", round(running_state(incident_order), 3)) 19print("reversed order:", round(running_state(reversed_order), 3)) 20print("same events:", sorted(incident_order) == sorted(reversed_order))
Output
1incident order: 1.5 2reversed order: 1.564 3same events: True

The events are identical as a set, but the final state differs. A useful model must therefore accept variable-length sequences without discarding order.

A recurrent state carries earlier evidence

A basic RNN cell updates one state vector:

ht=tanh⁡(Wxhxt+Whhht−1+bh)h_t = \tanh(W_{xh}x_t + W_{hh}h_{t-1} + b_h)ht​=tanh(Wxh​xt​+Whh​ht−1​+bh​)

Here xtx_txt​ is the current event vector and ht−1h_{t-1}ht−1​ is memory from earlier events. The same matrices WxhW_{xh}Wxh​ and WhhW_{hh}Whh​ are reused at every timestep. A longer history causes more applications of the cell, not more parameters.

In one dimension, the update is easy to inspect:

one-dimensional-rnn.py
1import numpy as np 2 3events = [1.0, 0.4, 0.9] 4w_input, w_state, bias = 0.8, 0.5, 0.0 5state = 0.0 6 7for step, event in enumerate(events, start=1): 8 state = np.tanh(w_input * event + w_state * state + bias) 9 print(f"h_{step} = {state:.3f}")
Output
1h_1 = 0.664 2h_2 = 0.573 3h_3 = 0.764

The cell turns three inputs into one final value. Real cells use vectors so different dimensions can carry different kinds of evidence.

Trace a vector RNN

Represent each incident event with three binary features:

Eventerror_seenrepro_confirmedrollback_requested
Error logged100
Reproduction confirmed010
Rollback requested001

Use a two-dimensional hidden state, small enough to print at every step:

vector-rnn-forward-pass.py
1import numpy as np 2 3events = np.array([ 4 [1.0, 0.0, 0.0], 5 [0.0, 1.0, 0.0], 6 [0.0, 0.0, 1.0], 7]) 8W_xh = np.array([ 9 [0.8, 0.2, 0.1], 10 [0.1, 0.7, 0.4], 11]) 12W_hh = np.array([ 13 [0.5, 0.1], 14 [0.0, 0.6], 15]) 16b_h = np.zeros(2) 17 18state = np.zeros(2) 19for step, event in enumerate(events, start=1): 20 state = np.tanh(W_xh @ event + W_hh @ state + b_h) 21 print(f"h_{step} =", np.round(state, 3)) 22 23print("final shape:", state.shape)
Output
1h_1 = [0.664 0.1 ] 2h_2 = [0.494 0.641] 3h_3 = [0.39 0.655] 4final shape: (2,)
Three unrolled RNN cell applications update one-hot incident events into hidden states while one W_xh matrix and one W_hh matrix are reused before final action logits favor rollback. Three unrolled RNN cell applications update one-hot incident events into hidden states while one W_xh matrix and one W_hh matrix are reused before final action logits favor rollback.
Each timestep uses the same two weight matrices, so longer histories add computation without adding parameters. The final state `[0.390, 0.655]` produces the three action logits shown at right.

Attach a classifier to the last hidden state exactly as you attached one to a feature vector in the previous chapter:

final-state-action-head.py
1import numpy as np 2 3final_state = np.array([0.390, 0.655]) 4W_out = np.array([ 5 [0.7, -0.3], # ignore 6 [-0.4, 0.8], # rollback 7 [0.1, 0.2], # escalate 8]) 9labels = ["ignore", "rollback", "escalate"] 10 11logits = W_out @ final_state 12shifted = logits - logits.max() 13probabilities = np.exp(shifted) / np.exp(shifted).sum() 14 15print("logits:", np.round(logits, 3)) 16print("probabilities:", np.round(probabilities, 3)) 17print("prediction:", labels[int(probabilities.argmax())])
Output
1logits: [0.076 0.368 0.17 ] 2probabilities: [0.291 0.389 0.32 ] 3prediction: rollback

Weight sharing is the core economy of recurrence. This count stays fixed as history length grows:

shared-parameter-count.py
1input_size, hidden_size, output_size = 3, 2, 3 2rnn_parameters = hidden_size * input_size + hidden_size * hidden_size + hidden_size 3head_parameters = output_size * hidden_size + output_size 4 5for timesteps in [3, 30, 300]: 6 print(f"{timesteps:>3} events -> {rnn_parameters + head_parameters} parameters")
Output
13 events -> 21 parameters 2 30 events -> 21 parameters 3300 events -> 21 parameters

The compression is useful but severe: every past event must influence the decision through two final hidden numbers.

When early learning signals fade

Training unfolds the recurrence across time and backpropagates through every copy. This is backpropagation through time (BPTT). For a loss attached to the final hidden state, its gradient path back to the initial state contains a product of recurrent Jacobians:

∂L∂h0=∂L∂hT∏t=1T∂ht∂ht−1\frac{\partial L}{\partial h_0} = \frac{\partial L}{\partial h_T} \prod_{t=1}^{T} \frac{\partial h_t}{\partial h_{t-1}}∂h0​∂L​=∂hT​∂L​t=1∏T​∂ht−1​∂ht​​

If a model receives losses at several timesteps, their backward paths add together. Each long route still contains repeated factors.

In a scalar simplification, a local derivative below 1 shrinks the signal exponentially:

vanishing-gradient-product.py
1for steps in [1, 4, 8, 12]: 2 gradient_factor = 0.6 ** steps 3 print(f"{steps:>2} recurrent steps: {gradient_factor:.6f}")
Output
11 recurrent steps: 0.600000 2 4 recurrent steps: 0.129600 3 8 recurrent steps: 0.016796 412 recurrent steps: 0.002177

The reverse failure also occurs. Repeated factors above 1 can make gradients too large:

exploding-gradients-and-clipping.py
1import numpy as np 2 3raw_gradient = np.array([1.4 ** 12, -0.8 * (1.4 ** 12)]) 4max_norm = 5.0 5norm = np.linalg.norm(raw_gradient) 6clipped = raw_gradient * min(1.0, max_norm / norm) 7 8print("raw norm:", round(float(norm), 3)) 9print("clipped norm:", round(float(np.linalg.norm(clipped)), 3)) 10print("clipped gradient:", np.round(clipped, 3))
Output
1raw norm: 72.604 2clipped norm: 5.0 3clipped gradient: [ 3.904 -3.123]

Gradient clipping can stop an exploding update from destabilizing training. It doesn't give a plain RNN a reliable long memory; a vanished signal is already gone.

LSTM adds a controlled memory track

Long Short-Term Memory (LSTM) introduced gated memory cells designed to preserve useful learning signals over long delays.[1]Reference 1Long Short-Term Memory.https://doi.org/10.1162/neco.1997.9.8.1735 The version below uses the later, common forget-gate form. Gers et al. added the adaptive forget gate so a cell could learn when to reset state in continual streams. With a cell state ctc_tct​, the model has learned controls for keeping, writing, and exposing memory:

ft=σ(Wf[ht−1,xt]+bf)how much old memory to keepit=σ(Wi[ht−1,xt]+bi)how much candidate to writec~t=tanh⁡(Wc[ht−1,xt]+bc)candidate contentct=ft⊙ct−1+it⊙c~tot=σ(Wo[ht−1,xt]+bo)how much memory to exposeht=ot⊙tanh⁡(ct)\begin{aligned} f_t &= \sigma(W_f[h_{t-1},x_t] + b_f) && \text{how much old memory to keep} \\ i_t &= \sigma(W_i[h_{t-1},x_t] + b_i) && \text{how much candidate to write} \\ \tilde{c}_t &= \tanh(W_c[h_{t-1},x_t] + b_c) && \text{candidate content} \\ c_t &= f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \\ o_t &= \sigma(W_o[h_{t-1},x_t] + b_o) && \text{how much memory to expose} \\ h_t &= o_t \odot \tanh(c_t) \end{aligned}ft​it​c~t​ct​ot​ht​​=σ(Wf​[ht−1​,xt​]+bf​)=σ(Wi​[ht−1​,xt​]+bi​)=tanh(Wc​[ht−1​,xt​]+bc​)=ft​⊙ct−1​+it​⊙c~t​=σ(Wo​[ht−1​,xt​]+bo​)=ot​⊙tanh(ct​)​​how much old memory to keephow much candidate to writecandidate contenthow much memory to expose​

The key structural difference is the additive cell update. If the model learns ftf_tft​ near 1 and iti_tit​ near 0 for an important memory dimension, that value can persist instead of being replaced at every step.

One cell dimension is enough to show the mechanism:

lstm-keep-write-expose.py
1import numpy as np 2 3cell = 0.0 4steps = [ 5 ("damage recorded", 0.00, 0.95, 0.90, 0.80), 6 ("routine scan", 0.98, 0.02, 0.10, 0.80), 7 ("photo verified", 0.97, 0.15, 0.50, 0.90), 8] 9 10for label, forget, write, candidate, expose in steps: 11 cell = forget * cell + write * candidate 12 hidden = expose * np.tanh(cell) 13 print(f"{label:15} c={cell:.3f} h={hidden:.3f}")
Output
1damage recorded c=0.855 h=0.555 2routine scan c=0.840 h=0.549 3photo verified c=0.890 h=0.640

routine scan changes little because its write gate is small and its forget gate is high. These values explain the update; during training the network learns its gates from data.

GRU combines the keep and write choice

The Gated Recurrent Unit (GRU) uses one hidden state instead of separate cell and exposed states.[2]Reference 2Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation.https://arxiv.org/abs/1406.1078 Using the convention in which ztz_tzt​ is the fraction kept from old state:

zt=σ(Wz[ht−1,xt])rt=σ(Wr[ht−1,xt])h~t=tanh⁡(Wh[rt⊙ht−1,xt])ht=zt⊙ht−1+(1−zt)⊙h~t\begin{aligned} z_t &= \sigma(W_z[h_{t-1},x_t]) \\ r_t &= \sigma(W_r[h_{t-1},x_t]) \\ \tilde{h}_t &= \tanh(W_h[r_t \odot h_{t-1},x_t]) \\ h_t &= z_t \odot h_{t-1} + (1-z_t) \odot \tilde{h}_t \end{aligned}zt​rt​h~t​ht​​=σ(Wz​[ht−1​,xt​])=σ(Wr​[ht−1​,xt​])=tanh(Wh​[rt​⊙ht−1​,xt​])=zt​⊙ht−1​+(1−zt​)⊙h~t​​

With this notation, ztz_tzt​ close to 1 keeps the prior value. The reset gate rtr_trt​ controls how much earlier state influences the candidate h~t\tilde{h}_th~t​. Some libraries and explanations name or arrange GRU terms differently, so check the equation before interpreting a printed gate.

gru-keep-gate.py
1old_state = 0.80 2candidate = -0.20 3 4for keep_gate in [0.95, 0.50, 0.05]: 5 new_state = keep_gate * old_state + (1.0 - keep_gate) * candidate 6 print(f"z={keep_gate:.2f} -> h={new_state:.3f}")
Output
1z=0.95 -> h=0.750 2z=0.50 -> h=0.300 3z=0.05 -> h=-0.150
Three state-flow diagrams compare a plain RNN nonlinear rewrite, an LSTM retaining 98% of cell memory plus a 2% write, and a GRU blending 95% old state with 5% candidate state. Three state-flow diagrams compare a plain RNN nonlinear rewrite, an LSTM retaining 98% of cell memory plus a 2% write, and a GRU blending 95% old state with 5% candidate state.
A plain RNN routes old state and new input through one nonlinear rewrite. LSTM and GRU add explicit mixture paths, so training can set high keep coefficients when earlier evidence still matters.

Pack padding before a library GRU

A production batch often contains claim histories with different lengths. Framework code pads the shorter histories to one rectangular tensor, but those padding rows aren't real events. If a final-state classifier reads a GRU after it processes padding, the filler values can silently change the decision.

PyTorch's nn.GRU accepts a packed sequence. pack_padded_sequence records each history's real length so the recurrent cell stops updating that history before its padding rows. Setting batch_first=True makes the input shape (batch, timesteps, features), which matches the tensor printed below:

packed-gru-ignores-padding.py
1import torch 2from torch import nn 3from torch.nn.utils.rnn import pack_padded_sequence 4 5torch.manual_seed(7) 6gru = nn.GRU(input_size=3, hidden_size=2, batch_first=True) 7lengths = torch.tensor([3, 2]) 8base = torch.tensor([ 9 [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]], 10 [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 0.0]], 11]) 12changed_padding = base.clone() 13changed_padding[1, 2] = torch.tensor([9.0, -9.0, 9.0]) 14 15def packed_final(batch): 16 packed = pack_padded_sequence( 17 batch, lengths.cpu(), batch_first=True, enforce_sorted=False 18 ) 19 _, hidden = gru(packed) 20 return hidden[-1] 21 22def unpacked_final(batch): 23 _, hidden = gru(batch) 24 return hidden[-1] 25 26packed_changed = not torch.allclose(packed_final(base), packed_final(changed_padding)) 27unpacked_changed = not torch.allclose(unpacked_final(base), unpacked_final(changed_padding)) 28 29print("batch shape:", tuple(base.shape)) 30print("packed final shape:", tuple(packed_final(base).shape)) 31print("packed state changed by padding:", packed_changed) 32print("unpacked state changed by padding:", unpacked_changed)
Output
1batch shape: (2, 3, 3) 2packed final shape: (2, 2) 3packed state changed by padding: False 4unpacked state changed by padding: True

The second history has only two real events. Packing makes its final state independent of the third row, while the unpacked GRU treats that row as another event. Masking can solve the same class of problem when an architecture exposes the right state control, but padding must never become evidence.

An encoder can compress a whole sequence

Cho et al. trained an RNN encoder-decoder in which an encoder turns a source sequence into a fixed-size representation and a decoder generates a target sequence from it.[2]Reference 2Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation.https://arxiv.org/abs/1406.1078 Sutskever et al. demonstrated the approach with multilayer LSTMs for sequence-to-sequence translation.[3]Reference 3Sequence to Sequence Learning with Neural Networks.https://arxiv.org/abs/1409.3215

The fixed-size interface is easy to see: both a short claim history and a longer one leave the encoder as the same two-number state.

fixed-size-encoder-state.py
1import numpy as np 2 3W_xh = np.array([[0.8, 0.2, 0.1], [0.1, 0.7, 0.4]]) 4W_hh = np.array([[0.5, 0.1], [0.0, 0.6]]) 5 6def encode(events): 7 state = np.zeros(2) 8 for event in events: 9 state = np.tanh(W_xh @ event + W_hh @ state) 10 return state 11 12history = np.eye(3) 13for length in [1, 2, 3]: 14 context = encode(history[:length]) 15 print(f"{length} event(s): shape={context.shape}, context={np.round(context, 3)}")
Output
11 event(s): shape=(2,), context=[0.664 0.1 ] 22 event(s): shape=(2,), context=[0.494 0.641] 33 event(s): shape=(2,), context=[0.39 0.655]

That fixed-size context is both an interface and a bottleneck. The next chapter studies bottlenecks directly: encoders that learn compact representations and decoders that reconstruct from them.

Why attention shortens the route

An RNN has a serial dependency: h3h_3h3​ can't be computed before h2h_2h2​. In the Transformer's comparison of layer types, a recurrent layer has O(n)O(n)O(n) sequential operations and an O(n)O(n)O(n) maximum path length between positions. A self-attention layer has O(1)O(1)O(1) sequential operations and an O(1)O(1)O(1) maximum path length.[4]Reference 4Attention Is All You Need.https://arxiv.org/abs/1706.03762 Full self-attention compares every allowed pair of positions, so its computation and attention-matrix size grow quadratically with sequence length.

QuestionRecurrent layerSelf-attention layer
Can positions be computed in parallel within a layer?No, state is serialYes, with the attention mask applied
Longest path between two positions in one layer<spandata−glossary="big−o−notation">O(n)</span><span data-glossary="big-o-notation">O(n)</span><spandata−glossary="big−o−notation">O(n)</span>O(1)O(1)O(1)
Must all earlier evidence fit in one running state?YesNo, each allowed position can be attended to directly

This doesn't make recurrence useless. It establishes the design pressure that leads into attention: long dependency routes and a fixed running-state bottleneck.

The comparison is about processing positions already present in a sequence. Autoregressive generation still emits one new token at a time because each next token depends on tokens generated earlier.

Debugging checks

  • Print tensor shapes at the cell boundary: input feature size, hidden size, and output size should agree with the weight matrices.
  • Confirm that a timestep loop reuses weights. Creating new weights inside the loop changes the model.
  • Check which GRU convention your implementation uses before reading gate values.
  • For padded batches, prevent padding from updating state through packing or masking. The packed-GRU example above shows the failure directly.
  • Clip exploding gradients when needed, then diagnose vanishing memory separately.

Mastery check

Key concepts

  • Ordered events require a state that's updated in order.
  • An RNN reuses one cell's weights across all timesteps.
  • The final hidden state can feed a standard logits and softmax action head.
  • BPTT multiplies recurrent derivatives across time, producing vanishing or exploding gradients.
  • LSTM uses a separate cell state plus forget, input, and output gates.
  • GRU uses update and reset gates around a single hidden state.
  • Packed recurrent inputs prevent padding rows from changing a shorter history's final state.
  • Encoder-decoder recurrence exposes the fixed-size sequence bottleneck.
  • Self-attention reduces the path length between distant positions within a layer.

Evaluation rubric

  • Foundational: Writes the RNN recurrence and explains weight sharing across timesteps.
  • Foundational: Traces the three-event forward pass and maps the final state to action logits.
  • Intermediate: Explains vanishing and exploding gradients as repeated derivative products.
  • Intermediate: Calculates one LSTM cell update and interprets its gates.
  • Intermediate: Reads a declared GRU update-gate convention without reversing keep and write behavior.
  • Intermediate: Demonstrates why a padded batch needs packing or masking before a final-state classifier.
  • Advanced: Compares a recurrent encoder bottleneck with a self-attention layer's direct paths.

Common pitfalls

  • Symptom: Reordering events doesn't affect the result. Cause: History was pooled as an unordered set. Fix: Process events in timestamp order and test a reversed sequence.
  • Symptom: Parameter count is claimed to grow with sequence length. Cause: Unrolled computation was confused with separate parameters. Fix: Mark every timestep as sharing W_xh and W_hh.
  • Symptom: Training produces unstable updates. Cause: Recurrent gradients explode. Fix: Inspect gradient norms and apply clipping while checking learning-rate and recurrence choices.
  • Symptom: Early evidence is never learned. Cause: Gradients vanish through many recurrent updates. Fix: Test gated recurrence or an attention-based architecture for the required dependency length.
  • Symptom: A GRU gate is explained backward. Cause: A different update-gate convention was assumed. Fix: State the exact equation before interpreting z_t.
  • Symptom: Short histories change prediction when batch padding changes. Cause: Padding rows updated the recurrent state as if they were events. Fix: Pack real sequence lengths or mask state updates before classification.

Follow-up questions

Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.An incident model must handle histories of 3, 30, or 300 events, keep event order, and not add learned weights for longer histories. Which design matches a basic RNN?
2.At step 3 of a vector RNN, h_2 = [0.494, 0.641], W_xh x_3 = [0.1, 0.4], and W_hh = [[0.5, 0.1], [0, 0.6]]. The zero-bias update applies tanh. For ignore, rollback, and escalate, W_out = [[0.7, -0.3], [-0.4, 0.8], [0.1, 0.2]]. Which result follows?
3.During BPTT, suppose the relevant scalar derivative along an early event's path is 0.6 at each of 12 recurrent steps. What does this imply, and why is gradient clipping not the fix for that signal?
4.An LSTM has c_{t-1} = 0.855. At a routine scan, its forget gate is 0.98, input gate is 0.02, candidate is 0.10, and output gate is 0.80. Using c_t = f_t c_{t-1} + i_t candidate and h_t = o_t tanh(c_t), what are the new states?
5.A GRU uses h_t = z_t h_{t-1} + (1 - z_t) candidate, where z_t is explicitly the fraction kept from old state. If h_{t-1} = 0.80, candidate = -0.20, and z_t = 0.95, what is h_t?
6.In a batch_first GRU batch with shape (2, 3, 3), the real lengths are [3, 2]. The third row of the second history is padding, and changing only that row changes the final state when the batch is run unpacked. What should the classifier use?
7.A recurrent encoder with hidden size 2 processes one history containing 2 events and another containing 200 events. A decoder receives only each encoder's final state. Which statement describes the interface and its bottleneck?
8.A model must compare distant positions already present in a long sequence within one layer. Which trade-off matches the recurrent layer versus full self-attention comparison?

8 questions remaining.

Next Step
Continue to Autoencoders and VAEs

A recurrent encoder compressed an ordered history into a fixed-size state; next you'll train encoders and decoders around a learned representation bottleneck.

PreviousSoftmax, Cross-Entropy & Optimization
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Long Short-Term Memory.

Hochreiter, S., & Schmidhuber, J. · 1997 · Neural Computation, 9(8)

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation.

Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. · 2014 · EMNLP 2014

Sequence to Sequence Learning with Neural Networks.

Sutskever, I., Vinyals, O., & Le, Q. V. · 2014 · NeurIPS 2014

Attention Is All You Need.

Vaswani, A., et al. · 2017