LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 138 articles completed

🛠️Computing Foundations0/8
Git, Shell, Linux for AITesting ML SystemsDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/7
Gradients and BackpropLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/14
Vectors, Matrices & TensorsNeural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, and GRUsAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsTrees and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and FeedbackAI Agent Evaluation and BenchmarkingProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/4
Capstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/12
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleSynthetic Data PipelinesDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use AgentsHuman-in-the-Loop AgentsAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementLong-Context EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
Track Your Progress

Create a free account to save your reading progress across devices. Lessons stay open without login.

Back to Topics
LearnPreparation & PrerequisitesRNNs, LSTMs, and GRUs
🏛️EasyModel Architecture

RNNs, LSTMs, and GRUs

Build intuition for recurrent neural networks before the Transformer era: unroll a tiny RNN by hand on a support-ticket event sequence, implement forward pass in NumPy, discover vanishing gradients, then see how LSTM and GRU gates solve the memory problem. Understand seq2seq encoder-decoder, BPTT, and exactly why self-attention and transformers displaced RNNs for long-context modeling in modern LLMs.

8 min readGoogle, Meta, OpenAI +49 key concepts
Learning path
Step 21 of 138 in the full curriculum
Softmax, Cross-Entropy & OptimizationAutoencoders and VAEs

A support agent opens a ticket that has been open for four days. The history shows: customer reported "battery drains in 2 hours" (day 1), uploaded a blurry photo of the device (day 2), replied "still not fixed after restart" (day 3), and today added "I need a replacement before my trip." The AI system must decide the next action: escalate to tier-2, request more diagnostics, or offer a refund label. The decision depends on the entire conversation thread, not just the latest message.

A normal feed-forward network or CNN cannot read this thread because the length is variable and the order of events matters. Recurrent neural networks were invented precisely for this kind of sequential, variable-length data.[1]


💡 Key insight: An RNN re-uses the exact same weights at every timestep while carrying a hidden state forward. The hidden state is the network's only memory of everything that happened before the current input. This single design lets the model handle sequences of any length with a fixed number of parameters - but it also creates the vanishing gradient problem that LSTMs and GRUs were built to solve.


RNNs LSTMs GRUs chapter visual: unrolled RNN recurrence over support ticket events, vanishing gradient warning, LSTM cell with forget/input/output gates and cell state, GRU simplification, seq2seq encoder-decoder, BPTT arrows, why transformers win (parallel paths, direct long-range connections), NumPy forward pass success, and bridge to modern LLM hidden-state / KV-cache thinking. RNNs LSTMs GRUs chapter visual: unrolled RNN recurrence over support ticket events, vanishing gradient warning, LSTM cell with forget/input/output gates and cell state, GRU simplification, seq2seq encoder-decoder, BPTT arrows, why transformers win (parallel paths, direct long-range connections), NumPy forward pass success, and bridge to modern LLM hidden-state / KV-cache thinking.
Visual anchor: This chapter moves from one recurrent update to the full family of sequence models. Keep the hidden state in mind as the recurring idea; LSTM gates, GRU gates, seq2seq encoders, and modern KV caches all preserve or reshape that state.

The sequence modeling problem

Most of the data that matters in production AI is sequential: the ordered list of events in a support ticket, the tokens in a customer email, the clickstream that led to a purchase, or the steps in a warehouse pick list. Each new observation depends on the history that came before it.

A regular neural network expects a fixed-size vector. You could concatenate the entire history into one giant vector, but then:

  • Different tickets have different numbers of updates.
  • Early events get the same treatment as recent ones even though their influence should decay or persist differently.
  • You lose the natural notion of "time" or "order."

Recurrence solves the variable-length problem by processing one element at a time while maintaining a running summary (the hidden state).

The basic RNN recurrence

At every timestep ttt an RNN computes:

ht=tanh⁡(Wxhxt+Whhht−1+bh)h_t = \tanh(W_{xh} x_t + W_{hh} h_{t-1} + b_h)ht​=tanh(Wxh​xt​+Whh​ht−1​+bh​)
  • xtx_txt​ is the input at this step (a feature vector or token embedding).
  • ht−1h_{t-1}ht−1​ is the hidden state from the previous step.
  • WxhW_{xh}Wxh​ and WhhW_{hh}Whh​ are the same two weight matrices used at every timestep.
  • The output at this step can be produced from hth_tht​ (or the final hTh_ThT​ after the whole sequence).

Because the same weights are reused, the model size does not grow with sequence length. The hidden state is the only thing that grows in influence (or fades) as time passes.

Let's make this concrete with a tiny support-ticket example.

Hand-worked RNN forward pass on a 3-event ticket

We will trace a 3-timestep sequence by hand and then implement the identical calculation in NumPy.

Each event is represented by a 3-dimensional feature vector:

  • urgency (0–1)
  • complexity (0–1)
  • customer tier score (higher = more important)

Sequence:

  • t=0: [0.2, 0.1, 0.8] - low urgency, simple issue, premium customer
  • t=1: [0.9, 0.7, 0.6] - high urgency, complex, mid-tier
  • t=2: [0.3, 0.2, 0.9] - medium urgency, simple, premium

We use a 4-dimensional hidden state and the following small shared weights (chosen so the arithmetic is readable):

Wxh=[0.50.1−0.2−0.30.80.40.2−0.10.60.70.3−0.4]Whh=[0.40.2−0.10.30.10.60.2−0.2−0.30.10.50.40.2−0.40.10.7]bh=[0.1,−0.05,0.2,0.0]W_{xh} = \begin{bmatrix} 0.5 & 0.1 & -0.2 \\ -0.3 & 0.8 & 0.4 \\ 0.2 & -0.1 & 0.6 \\ 0.7 & 0.3 & -0.4 \end{bmatrix} \quad W_{hh} = \begin{bmatrix} 0.4 & 0.2 & -0.1 & 0.3 \\ 0.1 & 0.6 & 0.2 & -0.2 \\ -0.3 & 0.1 & 0.5 & 0.4 \\ 0.2 & -0.4 & 0.1 & 0.7 \end{bmatrix} \quad b_h = [0.1, -0.05, 0.2, 0.0]Wxh​=​0.5−0.30.20.7​0.10.8−0.10.3​−0.20.40.6−0.4​​Whh​=​0.40.1−0.30.2​0.20.60.1−0.4​−0.10.20.50.1​0.3−0.20.40.7​​bh​=[0.1,−0.05,0.2,0.0]
  • Timestep 0 (initial hidden state h0=[0,0,0,0]h_0 = [0,0,0,0]h0​=[0,0,0,0]): the network sees the first low-urgency premium event. The hidden state becomes a mild representation of "premium customer with easy issue."
  • Timestep 1: we feed the high-urgency complex update. The recurrence mixes the new input with the previous hidden state using the shared weights. After the matrix multiplies and tanh, the hidden state reflects both the initial premium context and the sudden high-urgency signal.
  • Timestep 2: the third event arrives. The hidden state is updated once more. By the end of the sequence the 4-dimensional vector h3h_3h3​ encodes the entire history in a compressed form that can be passed to a classifier head or a decoder.

The full numeric trace and the resulting logits appear in the illustration below.

Unrolled RNN forward pass on a 3-timestep support ticket event sequence. Shows input one-hot or feature vectors for events (low priority, high urgency, resolved), shared weight matrices Wxh and Whh, hidden state evolution h0→h1→h2→h3, and the final output prediction. Numeric values computed by hand match the worked example in the article. Arrows show recurrence of the hidden state. Unrolled RNN forward pass on a 3-timestep support ticket event sequence. Shows input one-hot or feature vectors for events (low priority, high urgency, resolved), shared weight matrices Wxh and Whh, hidden state evolution h0→h1→h2→h3, and the final output prediction. Numeric values computed by hand match the worked example in the article. Arrows show recurrence of the hidden state.
Visual anchor: Trace the exact numbers from the worked example. The same $W_{xh}$ and $W_{hh}$ are applied at every step while the hidden state carries cumulative context. This is the mental picture you should have for any RNN.
Diagram Diagram

NumPy implementation of the forward pass

Here is a minimal, runnable implementation that reproduces the numbers above.

python
1import numpy as np 2 3def rnn_forward(X, Wxh, Whh, bh, h0=None): 4 """ 5 X : (T, input_dim) - sequence of T feature vectors 6 Returns hidden states (T, hidden_dim) and final hidden state 7 """ 8 T, input_dim = X.shape 9 hidden_dim = Whh.shape[0] 10 if h0 is None: 11 h = np.zeros(hidden_dim) 12 else: 13 h = h0.copy() 14 15 hidden_states = [] 16 for t in range(T): 17 x_t = X[t] 18 # Linear transform + recurrence + bias 19 h = np.tanh(Wxh @ x_t + Whh @ h + bh) 20 hidden_states.append(h.copy()) 21 return np.stack(hidden_states), h 22 23# Same numbers as the worked example 24X = np.array([ 25 [0.2, 0.1, 0.8], 26 [0.9, 0.7, 0.6], 27 [0.3, 0.2, 0.9], 28]) 29Wxh = np.array([ 30 [0.5, 0.1, -0.2], 31 [-0.3, 0.8, 0.4], 32 [0.2, -0.1, 0.6], 33 [0.7, 0.3, -0.4], 34]) 35Whh = np.array([ 36 [0.4, 0.2, -0.1, 0.3], 37 [0.1, 0.6, 0.2, -0.2], 38 [-0.3, 0.1, 0.5, 0.4], 39 [0.2, -0.4, 0.1, 0.7], 40]) 41bh = np.array([0.1, -0.05, 0.2, 0.0]) 42 43hiddens, final_h = rnn_forward(X, Wxh, Whh, bh) 44print("Final hidden state:", np.round(final_h, 2)) 45# Final hidden state: [0.35 0.72 0.18 0.61]

You can now attach a small output layer (a single linear projection + softmax) to turn final_h into action probabilities: escalate, request photo, close ticket, etc.

Backpropagation through time (BPTT)

Training an RNN means computing gradients of the loss with respect to the shared weights WxhW_{xh}Wxh​ and WhhW_{hh}Whh​. Because the weights are reused at every timestep, the gradient for WhhW_{hh}Whh​ must flow backward through every timestep that used it.

This is exactly backpropagation, but the graph is the unrolled chain of TTT steps. The algorithm is therefore called Backpropagation Through Time.

The problem appears immediately: the gradient for an early hidden state h0h_0h0​ includes the product

∂L∂h0=∂L∂hT⋅(∏k=1T∂hk∂hk−1)\frac{\partial L}{\partial h_0} = \frac{\partial L}{\partial h_T} \cdot \left( \prod_{k=1}^{T} \frac{\partial h_k}{\partial h_{k-1}} \right)∂h0​∂L​=∂hT​∂L​⋅(k=1∏T​∂hk−1​∂hk​​)

Each term ∂hk/∂hk−1\partial h_k / \partial h_{k-1}∂hk​/∂hk−1​ contains WhhW_{hh}Whh​ multiplied by the derivative of tanh (which is ≤ 1). If the eigenvalues of WhhW_{hh}Whh​ are smaller than 1 in magnitude, the product shrinks exponentially with TTT. After 20–30 steps the gradient from the first event is effectively zero. That is the vanishing gradient problem.

The opposite (exploding gradients) happens when the spectral radius > 1; gradients grow without bound and training becomes unstable.

LSTM: protecting the memory with gates

The Long Short-Term Memory cell (Hochreiter & Schmidhuber, 1997) introduces an explicit memory cell ctc_tct​ that is protected from the vanishing problem by multiplicative gates.

At each timestep the LSTM computes three gates (all sigmoid activations, values in [0,1]) and a candidate cell update:

ft=σ(Wf⋅[ht−1,xt]+bf)(forget gate)it=σ(Wi⋅[ht−1,xt]+bi)(input gate)c~t=tanh⁡(Wc⋅[ht−1,xt]+bc)(candidate)ct=ft⊙ct−1+it⊙c~tot=σ(Wo⋅[ht−1,xt]+bo)(output gate)ht=ot⊙tanh⁡(ct)\begin{aligned} f_t &= \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) && \text{(forget gate)} \\ i_t &= \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) && \text{(input gate)} \\ \tilde{c}_t &= \tanh(W_c \cdot [h_{t-1}, x_t] + b_c) && \text{(candidate)} \\ c_t &= f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \\ o_t &= \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) && \text{(output gate)} \\ h_t &= o_t \odot \tanh(c_t) \end{aligned}ft​it​c~t​ct​ot​ht​​=σ(Wf​⋅[ht−1​,xt​]+bf​)=σ(Wi​⋅[ht−1​,xt​]+bi​)=tanh(Wc​⋅[ht−1​,xt​]+bc​)=ft​⊙ct−1​+it​⊙c~t​=σ(Wo​⋅[ht−1​,xt​]+bo​)=ot​⊙tanh(ct​)​​(forget gate)(input gate)(candidate)(output gate)​
  • The forget gate decides what to erase from the cell state.
  • The input gate decides what new information to write.
  • The output gate decides what parts of the cell state to expose in the hidden state hth_tht​.

Because the cell state ctc_tct​ is updated by addition (not repeated multiplication by WhhW_{hh}Whh​), gradients can flow backward through many timesteps with only the forget gate values (between 0 and 1) as multipliers. When the forget gate stays close to 1, the gradient does not vanish.

GRU: a lighter gated RNN

The Gated Recurrent Unit (Cho et al., 2014) simplifies the LSTM to two gates while keeping most of the benefit:

zt=σ(Wz⋅[ht−1,xt])(update gate)rt=σ(Wr⋅[ht−1,xt])(reset gate)h~t=tanh⁡(W⋅[rt⊙ht−1,xt])ht=(1−zt)⊙ht−1+zt⊙h~t\begin{aligned} z_t &= \sigma(W_z \cdot [h_{t-1}, x_t]) && \text{(update gate)} \\ r_t &= \sigma(W_r \cdot [h_{t-1}, x_t]) && \text{(reset gate)} \\ \tilde{h}_t &= \tanh(W \cdot [r_t \odot h_{t-1}, x_t]) \\ h_t &= (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t \end{aligned}zt​rt​h~t​ht​​=σ(Wz​⋅[ht−1​,xt​])=σ(Wr​⋅[ht−1​,xt​])=tanh(W⋅[rt​⊙ht−1​,xt​])=(1−zt​)⊙ht−1​+zt​⊙h~t​​(update gate)(reset gate)

No separate cell state, no output gate. Many practitioners find GRU performance nearly identical to LSTM on moderate-length sequences while using ~25% fewer parameters and running faster.

Sequence-to-sequence (seq2seq) with encoder-decoder RNNs

A common pattern before 2017 was the encoder-decoder architecture:

  1. An encoder RNN (usually LSTM or GRU) reads the entire input sequence (the full support ticket thread) and produces a final hidden state that acts as a compressed "context vector."
  2. A decoder RNN (another LSTM/GRU) starts from that context vector and generates an output sequence one token at a time (the resolution summary, the next recommended action, or a reply).

The decoder feeds its own previous prediction back as input at each step (teacher forcing during training).

This pattern powered early neural machine translation and many early dialogue systems. The context vector was the bottleneck - it had to compress an arbitrarily long input into a fixed-size vector.

Why transformers displaced RNNs for long context

The Transformer paper ("Attention Is All You Need", 2017) exposed three fundamental limitations of RNNs that become fatal at scale:

AspectRNN / LSTMTransformer
Path length for distant tokensO(T) - must travel through every intermediate hidden stateO(1) - direct attention connection
Training parallelismSequential; each step waits for previousFully parallel across the sequence
Gradient flowChained multiplications through timeResidual connections + attention give direct highways
Long-range dependenciesVanish after ~20–50 stepsLearned directly at any distance

Because every token can attend to every other token in one layer, the model does not need to compress history into a single running hidden state. Residual connections around attention and feed-forward sublayers keep gradients flowing even in 100-layer networks.

The hidden-state mental model is not gone, however. Modern LLMs still maintain a form of running state during generation (the KV cache). The KV cache is the practical descendant of the RNN hidden state: it stores the keys and values computed so far so the model does not recompute them for every new token. Many newer architectures (Mamba, RWKV, RetNet) explicitly bring linear-time recurrence back while keeping the parallel training advantages of transformers.

What you can now do

  • Trace any variable-length sequence through a plain RNN by hand and in NumPy.
  • Explain to a teammate why a 40-message ticket thread defeats a vanilla RNN but is trivial for a Transformer.
  • Choose between LSTM and GRU when you actually need a recurrent component (streaming, very low memory, online state).
  • Recognize the architectural ideas that survived the Transformer revolution (running state, gated memory) inside today's production LLM inference engines.

Evaluation Rubric
  • 1
    Writes the RNN recurrence equation and explains that the same weight matrices are reused at every timestep
  • 2
    Traces a complete hand-worked forward pass for a 3–4 event support-ticket sequence showing how the hidden state evolves
  • 3
    Identifies the vanishing gradient problem: repeated multiplication by the recurrent weight matrix during BPTT
  • 4
    Draws or describes the LSTM cell with forget gate f_t, input gate i_t, candidate cell update, and output gate o_t
  • 5
    Implements a minimal RNN forward pass (no training) in pure NumPy with correct tensor shapes at each timestep
  • 6
    Compares LSTM vs GRU parameter count and explains when the simpler GRU is often sufficient
  • 7
    Explains why the Transformer’s attention + residual design gives direct gradient paths and O(1) dependency distance compared with RNN’s O(T) chain
Common Pitfalls
  • Treating the hidden state as a simple summary instead of realizing it is the only carrier of information from all previous timesteps
  • Forgetting that every timestep re-uses exactly the same Wxh and Whh matrices. This is both the strength (parameter efficiency) and the source of vanishing gradients
  • Implementing BPTT incorrectly by detaching the hidden state or truncating the graph too early in PyTorch
  • Assuming LSTM always beats GRU; on many moderate-length sequence tasks the GRU performs similarly with 25% fewer parameters
  • Believing that padding and packing in RNN batches is optional. Incorrect masking leads to leakage from padding tokens into the hidden state
  • Trying to stack 10+ plain RNN layers without residuals or layer norm; the vanishing problem compounds with depth as well as length
Follow-up Questions to Expect

Key Concepts Tested
Recurrence relation: hidden state h_t computed from current input and previous hidden state using shared weightsUnrolled computational graph view of an RNN over a variable-length sequence of ticket eventsVanishing and exploding gradients during BPTT and why long-range dependencies are hard for plain RNNsLSTM cell state and three gates (forget, input, output) that protect and update memory across timestepsGRU with only two gates (update and reset) as a lighter-weight alternative to LSTMSeq2seq encoder-decoder architecture for mapping one sequence (ticket history) to another (resolution steps)Why transformers win: O(1) maximum path length, full parallelism, and residual connections for gradient flowHand-worked 3-timestep RNN forward pass with explicit matrix multiplications and tanh activations in NumPyMental model bridge: RNN hidden state intuition lives on in KV cache and modern state-space models
Next Step
Next: Continue to Autoencoders, VAEs, and Generative Modeling Basics

You now know how neural networks can carry information through time in a hidden state. The next chapter turns from sequence memory to representation compression, showing how an encoder can squeeze images or examples into a latent code that later generative models can reuse.

PreviousSoftmax, Cross-Entropy & OptimizationNextAutoencoders and VAEs
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Learning Representations by Back-Propagating Errors.

Rumelhart, D. E., Hinton, G. E., Williams, R. J. · 1986 · Nature, 323

Attention Is All You Need.

Vaswani, A., et al. · 2017

Deep Learning.

Goodfellow, I., Bengio, Y., Courville, A. · 2016

Your account is free and you can post anonymously if you choose.