LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnInference & Production ScaleMamba & State Space Models
🚀HardInference Optimization

Mamba & State Space Models

Master linear-time sequence modeling: from S4 and HiPPO to Mamba's selective recurrence, Mamba-2's SSD framework, Mamba-3's inference-first refinements, and modern hybrid Transformer-SSM designs.

35 min read
Learning path
Step 138 of 155 in the full curriculum
Mixture of Experts ArchitectureReasoning & Test-Time Compute

Mamba & State Space Models

Mixture of Experts showed one way to spend less compute per token: route each token through only a few feed-forward network (FFN) experts. State Space Models (SSMs) attack a different cost: attention over long histories.

State space models such as Mamba are alternatives to transformer attention for long sequences. This chapter explains the tradeoff: linear-time recurrence and selective state updates instead of keeping every previous token visible through attention.

Imagine processing a 1,000-event order timeline. One strategy is to keep every previous event open and visible while reading each new one. This approach catches every detail, but by event 500, your workspace is buried under history and it becomes unbearably slow. That's how transformers behave during autoregressive generation: each new token attends over the growing key-value (KV) cache, so decoding gets more expensive as context grows.

State Space Models (SSMs) take a different approach: instead of rereading everything, they keep a compact running summary that updates with each new event. In recurrent form, the update cost stays linear in sequence length and the model carries a fixed-size state instead of a context-length-dependent cache.[1] The Mamba architecture[1] made selective SSMs competitive with transformers at language-model scale, and later work like Mamba-2[2], Mamba-3[3], and hybrid architectures like Jamba[4] and Nemotron-H[5] pushed that line of work further.

Understanding SSMs matters because long-context serving is often limited by attention's growing KV cache, while Mamba-style models push historical information into a fixed-size recurrent state during decoding.[1]

Set expectations up front: a fixed recurrent state removes KV-cache growth during decode, but it also creates a retrieval risk because history is compressed rather than explicitly addressable. Hybrid Transformer-SSM models are one candidate when a workload needs both bounded decode state and occasional attention access. This chapter teaches the mechanism and the measurements needed to decide whether that trade pays off.

Transformer vs. Mamba comparison showing growing KV-cache access versus one fixed-size recurrent state. Transformer vs. Mamba comparison showing growing KV-cache access versus one fixed-size recurrent state.
Attention keeps token-level history available through a growing KV cache. Mamba-style SSMs keep a fixed recurrent state, reducing decode-state growth while turning history into a compressed summary.

From control theory to language models

The math behind SSMs is rooted in continuous systems, but the core idea is simple. Think of a warehouse routing controller: it has a current state (current backlog and carrier capacity), receives input (new orders and scan updates), and produces output (which lane or carrier should handle the next package). The state updates based on both its previous state and the new input. SSMs do exactly this for text: they maintain a hidden state that continuously updates as each new word arrives.

State Space Models originate from control theory, where they describe dynamical systems via continuous differential equations. The key variables are:

  • x(t)∈Rx(t) \in \mathbb{R}x(t)∈R: the input signal
  • h(t)∈Rsh(t) \in \mathbb{R}^sh(t)∈Rs: the hidden state, a vector of dimension sss that acts as a compressed summary of all past inputs
  • y(t)∈Ry(t) \in \mathbb{R}y(t)∈R: the output
  • A∈Rs×s\mathbf{A} \in \mathbb{R}^{s \times s}A∈Rs×s: the state transition matrix, shaping how much past information to retain
  • B∈Rs×1\mathbf{B} \in \mathbb{R}^{s \times 1}B∈Rs×1: the input matrix, shaping how much new information to absorb
  • C∈R1×s\mathbf{C} \in \mathbb{R}^{1 \times s}C∈R1×s: the output matrix, which reads out the relevant part of the state

These equations are often written for scalar input/output to keep the notation readable. Real neural layers run many channels in parallel, but the same recurrence applies channel-wise.

From these, the continuous dynamics are:

h′(t)=Ah(t)+Bx(t)h'(t) = \mathbf{A}h(t) + \mathbf{B}x(t)h′(t)=Ah(t)+Bx(t) y(t)=Ch(t)+Dx(t)y(t) = \mathbf{C}h(t) + \mathbf{D}x(t)y(t)=Ch(t)+Dx(t)

What this means

At every moment, the hidden state h(t)h(t)h(t) does two things simultaneously: it decays based on A\mathbf{A}A (controlled forgetting), and it absorbs new input scaled by B\mathbf{B}B. The output y(t)y(t)y(t) is then a linear readout of the state via C\mathbf{C}C. The D\mathbf{D}D term is a skip connection that lets the input pass through unchanged, similar to a residual connection. Think of the hidden state as a running summary that continuously updates as new tokens arrive.

Discretization

Tokens are discrete, but the equations above describe smooth, continuous change. To apply them to token sequences, we discretize by sampling at intervals of Δ\DeltaΔ timesteps. The standard approach is zero-order hold (ZOH), which converts the continuous matrices into discrete equivalents:[6]

Aˉ=eΔABˉ=(ΔA)−1(eΔA−I)⋅ΔB\begin{aligned} \bar{\mathbf{A}} &= e^{\Delta \mathbf{A}} \\ \bar{\mathbf{B}} &= (\Delta \mathbf{A})^{-1} \left(e^{\Delta \mathbf{A}} - \mathbf{I}\right) \cdot \Delta \mathbf{B} \end{aligned}AˉBˉ​=eΔA=(ΔA)−1(eΔA−I)⋅ΔB​

where eΔAe^{\Delta \mathbf{A}}eΔA is the matrix exponential. These discrete matrices encode the same dynamics as the continuous equations, but in a form suitable for step-by-step updates.

The discrete recurrence then becomes:

ht=Aˉht−1+Bˉxth_t = \bar{\mathbf{A}} h_{t-1} + \bar{\mathbf{B}} x_tht​=Aˉht−1​+Bˉxt​ yt=Cht+Dxty_t = \mathbf{C} h_t + \mathbf{D} x_tyt​=Cht​+Dxt​

At each step: decay the old state by Aˉ\bar{\mathbf{A}}Aˉ, absorb the new input scaled by Bˉ\bar{\mathbf{B}}Bˉ, and read out through C\mathbf{C}C. The D\mathbf{D}D term is a skip connection (the input contributes directly to the output). For this recurrent layer, one decode update uses state sized by the model rather than by preceding context length.

Walkthrough with numbers

To make the recurrence concrete, imagine a one-dimensional state tracking the running importance of package scan events. Set:

  • Aˉ=0.9\bar{A} = 0.9Aˉ=0.9 (the state keeps 90% of its previous value)
  • Bˉ=0.2\bar{B} = 0.2Bˉ=0.2 (new input contributes 20%)
  • Cˉ=1.0\bar{C} = 1.0Cˉ=1.0 (read the full state)
  • Dˉ=0\bar{D} = 0Dˉ=0 (no skip)

Now process four scans with importance scores x=[3,1,4,2]x = [3, 1, 4, 2]x=[3,1,4,2]:

StepInput xtx_txt​State ht=0.9 ht−1+0.2 xth_t = 0.9 \, h_{t-1} + 0.2 \, x_tht​=0.9ht−1​+0.2xt​Output yty_tyt​
130.9⋅0+0.2⋅3=0.600.9 \cdot 0 + 0.2 \cdot 3 = 0.600.9⋅0+0.2⋅3=0.600.60
210.9⋅0.60+0.2⋅1=0.740.9 \cdot 0.60 + 0.2 \cdot 1 = 0.740.9⋅0.60+0.2⋅1=0.740.74
340.9⋅0.74+0.2⋅4=1.470.9 \cdot 0.74 + 0.2 \cdot 4 = 1.470.9⋅0.74+0.2⋅4=1.471.47
420.9⋅1.47+0.2⋅2=1.720.9 \cdot 1.47 + 0.2 \cdot 2 = 1.720.9⋅1.47+0.2⋅2=1.721.72

Notice how the state smooths the input sequence. A high input like 4 pulls the state up, but the old state still carries weight. This compression is the whole point: instead of storing every scan, the SSM keeps a fixed-size summary.

discrete-recurrence.py
1inputs = [3, 1, 4, 2] 2a_bar = 0.9 3b_bar = 0.2 4state = 0.0 5states = [] 6 7for value in inputs: 8 state = a_bar * state + b_bar * value 9 states.append(round(state, 3)) 10 11print("states:", states) 12print("final state:", states[-1])
Output
1states: [0.6, 0.74, 1.466, 1.719] 2final state: 1.719

If Aˉ\bar{A}Aˉ were 0.1 instead of 0.9, the final state would change sharply. With the same inputs and Bˉ=0.2\bar{B}=0.2Bˉ=0.2, the final state is about 0.4830.4830.483 instead of 1.7191.7191.719: earlier events contribute far less. Whether that is harmful depends on what information the task needs to retain.

S4: structured state spaces

The Structured State Space for Sequence Modeling (S4)[6] advanced neural SSMs by giving them structured long-range dynamics and efficient computation views. The following diagram illustrates the basic S4 architecture, where the hidden state is updated iteratively using structured matrices.

State space recurrence diagram showing input and prior state becoming one token-step update and output readout. State space recurrence diagram showing input and prior state becoming one token-step update and output readout.
The SSM layer is easiest to read as a state update: combine the previous state with the new input, produce a new state, then read an output. S4 made this practical by adding structure that preserves long-range memory and supports efficient training.

HiPPO initialization

In S4, the continuous-time A\mathbf{A}A matrix starts from the HiPPO (High-order Polynomial Projection Operators) framework, which compresses input history into coefficients of orthogonal polynomials.[7][6]

Without this structure, a long recurrence would tend to wash out older information. HiPPO gives S4 a principled way to preserve information over long horizons instead of relying on an arbitrary recurrent matrix to discover that behavior from scratch.

Dual computation modes

S4 can be computed in two equivalent ways. For notational simplicity, the costs below are written for one channel; a full layer runs many channels in parallel.

Recurrent mode (generation)

Efficient for generation: ht=Aˉht−1+Bˉxt,yt=Cht+Dxth_t = \bar{\mathbf{A}} h_{t-1} + \bar{\mathbf{B}} x_t, \quad y_t = \mathbf{C} h_t + \mathbf{D} x_tht​=Aˉht−1​+Bˉxt​,yt​=Cht​+Dxt​

Time: O(L⋅s)O(L \cdot s)O(L⋅s), Memory: O(s)O(s)O(s) per step.

Convolutional mode (training)

Efficient for training: y=x∗Kˉ,where Kˉt=CAˉtBˉy = x * \bar{K}, \quad \text{where } \bar{K}_t = \mathbf{C} \bar{\mathbf{A}}^t \bar{\mathbf{B}}y=x∗Kˉ,where Kˉt​=CAˉtBˉ

In plain terms

The recurrence above can be "unrolled" into a single convolution with a pre-computed kernel Kˉ\bar{K}Kˉ. This allows processing the entire sequence at once using FFT (Fast Fourier Transform), so the sequence-mixing part of training runs in O(Llog⁡L)O(L \log L)O(LlogL) time instead of a naive token-by-token loop. Same math, different execution strategy.

The LTI problem

Here's S4's critical limitation. Imagine a conveyor belt in a factory that processes every item at exactly the same speed with exactly the same steps, whether it's a diamond ring or a pebble. That's efficient, but it can't prioritize. It can't be told to "slow down for the diamond and speed past the pebbles."

S4 is a Linear Time-Invariant (LTI) system: matrices A\mathbf{A}A, B\mathbf{B}B, and C\mathbf{C}C are fixed regardless of input content. It can represent position-dependent decay through recurrence, but it cannot change its write/read policy because a particular input token is important.[1]

This is different from attention, where query-key interactions provide content-dependent access. Fixed LTI SSMs struggle on selective-copy style tasks used to test content-aware memory; Mamba addresses that weakness with input-dependent parameters.

Mamba: selective state spaces

Mamba[1] solves the LTI limitation with one core idea: make part of the SSM input-dependent.

The selective mechanism

Think of a conveyor scanner moving through package events. Routine scans such as "label printed" may need a small state update, while a phrase like "address corrected after failed delivery" may need a stronger update. The content of each event conditions the update rule. That is Mamba's selective mechanism.

Instead of fixed B\mathbf{B}B, C\mathbf{C}C, and Δ\DeltaΔ, Mamba computes them as functions of the input:

Bt=LinearB(xt),Ct=LinearC(xt),Δt=softplus(LinearΔ(xt))\mathbf{B}_t = \text{Linear}_B(x_t), \quad \mathbf{C}_t = \text{Linear}_C(x_t), \quad \Delta_t = \text{softplus}(\text{Linear}_\Delta(x_t))Bt​=LinearB​(xt​),Ct​=LinearC​(xt​),Δt​=softplus(LinearΔ​(xt​))

Unlike S4, which uses the same B\mathbf{B}B, C\mathbf{C}C, and Δ\DeltaΔ for every token, Mamba computes them fresh from each token's content. The learned transition A\mathbf{A}A and skip term D\mathbf{D}D remain input-independent in Mamba-1; selectivity enters through Bt\mathbf{B}_tBt​, Ct\mathbf{C}_tCt​, and Δt\Delta_tΔt​.[1]

The step size Δt\Delta_tΔt​ is especially important: it controls how much the model updates its hidden state:

  • Large Δt\Delta_tΔt​: Decay older state faster and absorb more of the current input
  • Small Δt\Delta_tΔt​: Keep more of the previous state and make a smaller update
delta-controls-retention.py
1from math import exp 2 3a = -1.0 4for delta in (0.1, 1.0, 2.0): 5 retained_fraction = exp(delta * a) 6 print(f"delta={delta:.1f} retains {retained_fraction:.3f} of previous state")
Output
1delta=0.1 retains 0.905 of previous state 2delta=1.0 retains 0.368 of previous state 3delta=2.0 retains 0.135 of previous state

This gives Mamba a learned, content-dependent recurrent update while retaining linear-length recurrent computation. It does not give every output explicit access to every prior token as attention does.

Beginners often assume Mamba "remembers everything" because it avoids the KV cache. It doesn't. A large Δt\Delta_tΔt​ can make the model forget past tokens just as aggressively as attention sometimes ignores distant positions. The difference is that Mamba learns how much to forget per token, not whether to access a full history. In a package-tracking model, a large Δt\Delta_tΔt​ on a routine "label printed" scan could wash out the previous "address corrected" event if the model hasn't learned to gate that update carefully.

Architecture

A Mamba block replaces attention with a specific stack: projection, causal convolution, selective SSM update, and gating.[1] It uses SiLU (Sigmoid Linear Unit) activations and softplus (a smooth approximation of ReLU defined as log(1 + e^x), which is always positive and differentiable everywhere) to keep the step sizes positive:

Mamba block diagram showing projection split, causal convolution, selective scan, gate path, and output projection. Mamba block diagram showing projection split, causal convolution, selective scan, gate path, and output projection.
A Mamba block is not just a recurrence. The production block projects the token, mixes nearby tokens with a causal convolution, computes token-dependent SSM parameters, runs selective scan, gates the result, and projects back to the model dimension.

The code below focuses on the selective recurrence itself rather than the entire fused production block. It leaves out the causal convolution and output gate on purpose, so the core SSM update stays easy to inspect and the tensor shapes stay correct.

architecture.py
1import torch 2import torch.nn as nn 3import torch.nn.functional as F 4 5class SelectiveSSM(nn.Module): 6 """Reference selective SSM. Clear and correct, but not hardware optimized.""" 7 8 def __init__(self, d_model: int, d_state: int = 64): 9 super().__init__() 10 self.u_proj = nn.Linear(d_model, d_state) 11 self.b_proj = nn.Linear(d_model, d_state) 12 self.c_proj = nn.Linear(d_model, d_state) 13 self.dt_proj = nn.Linear(d_model, d_state) 14 self.a_log = nn.Parameter(torch.zeros(d_state)) 15 self.out_proj = nn.Linear(d_state, d_model) 16 17 def forward(self, x: torch.Tensor) -> torch.Tensor: 18 batch, length, _ = x.shape 19 A = -torch.exp(self.a_log).view(1, -1) # Stable diagonal dynamics. 20 h = x.new_zeros(batch, A.size(-1)) 21 outputs = [] 22 23 for t in range(length): 24 x_t = x[:, t, :] 25 u_t = self.u_proj(x_t) 26 B_t = self.b_proj(x_t) 27 C_t = self.c_proj(x_t) 28 delta_t = F.softplus(self.dt_proj(x_t)) 29 30 A_bar = torch.exp(delta_t * A) 31 B_bar = torch.where( 32 A.abs() > 1e-6, 33 ((A_bar - 1.0) / A) * B_t, 34 delta_t * B_t, 35 ) 36 37 h = A_bar * h + B_bar * u_t 38 y_t = self.out_proj(C_t * h) + x_t # Simple skip connection. 39 outputs.append(y_t) 40 41 return torch.stack(outputs, dim=1) 42 43torch.manual_seed(0) 44ssm = SelectiveSSM(d_model=8, d_state=4) 45x = torch.randn(1, 2, 8) 46y = ssm(x) 47 48print("input shape:", tuple(x.shape)) 49print("output shape:", tuple(y.shape)) 50print("state dimension:", ssm.a_log.numel())
Output
1input shape: (1, 2, 8) 2output shape: (1, 2, 8) 3state dimension: 4

Separate from recurrence math, the serving-memory claim is easy to sanity-check numerically.

state_scaling.py
1def transformer_kv_bytes(layers: int, heads: int, head_dim: int, tokens: int, bytes_per_value: int = 2) -> int: 2 return 2 * layers * heads * head_dim * tokens * bytes_per_value 3 4def ssm_state_bytes( 5 layers: int, 6 inner_channels: int, 7 state_size: int, 8 conv_width: int, 9 bytes_per_value: int = 2, 10) -> int: 11 recurrent_state = layers * inner_channels * state_size 12 convolution_state = layers * inner_channels * (conv_width - 1) 13 return (recurrent_state + convolution_state) * bytes_per_value 14 15layers = 32 16heads = 32 17head_dim = 128 18inner_channels = 4096 19state_size = 16 20conv_width = 4 21 22for tokens in [128, 512, 2048, 8192]: 23 kv_mb = transformer_kv_bytes(layers, heads, head_dim, tokens) / 1024**2 24 ssm_mb = ssm_state_bytes(layers, inner_channels, state_size, conv_width) / 1024**2 25 print(f"context={tokens:5d} -> transformer_kv={kv_mb:7.1f} MB, ssm_state={ssm_mb:4.1f} MB")
Output
1context= 128 -> transformer_kv= 64.0 MB, ssm_state= 4.8 MB 2context= 512 -> transformer_kv= 256.0 MB, ssm_state= 4.8 MB 3context= 2048 -> transformer_kv= 1024.0 MB, ssm_state= 4.8 MB 4context= 8192 -> transformer_kv= 4096.0 MB, ssm_state= 4.8 MB

What goes wrong when you build this from scratch

The code above is shape-correct for the simplified recurrence, but beginners hit predictable snags when they try to extend it. Here are three common symptoms, their causes, and how to fix them.

SymptomCauseFix
State values explode to NaN within a few stepsA_bar can exceed 1.0 if softplus isn't used to keep Δt\Delta_tΔt​ positive, or if A\mathbf{A}A isn't constrained negativeKeep softplus on Δt\Delta_tΔt​ and initialize a_log so that A = -exp(a_log) stays negative
Model acts like a plain RNN with no content awarenessB_t and C_t are accidentally hardcoded as constants instead of Linear(x_t) projectionsVerify that b_proj and c_proj are actually called with the current token x_t inside the loop
Training is unbearably slow on long sequencesUsing a Python for loop over the sequence dimension on GPUThe reference code above is for reading, not training. Production Mamba relies on fused CUDA scan kernels that keep state in SRAM, not Python loops

The hardware-aware parallel scan

The key engineering innovation in Mamba is the hardware-aware scan algorithm. At first glance, the recurrence ht=Aˉtht−1+Bˉtxth_t = \bar{\mathbf{A}}_t h_{t-1} + \bar{\mathbf{B}}_t x_tht​=Aˉt​ht−1​+Bˉt​xt​ appears strictly sequential. If ht−1h_{t-1}ht−1​ must be known before computing hth_tht​, processing a sequence seems to require a slow token-by-token loop, which would severely bottleneck training on GPUs compared to the heavily parallelized attention mechanism.

The bottleneck isn't only arithmetic. It's also memory traffic: if every token update spills intermediate state back to high-bandwidth memory (HBM), the recurrence becomes bandwidth-bound. Mamba addresses that by fusing discretization and state updates so the recurrent state stays in fast on-chip static random-access memory (SRAM) as much as possible, then using an associative scan primitive to recover parallelism across positions.[1]

Because the recurrence is associative after the right reformulation, the model can group updates and compute intermediate states in a tree-like structure. Consider the task of calculating a running total for a long list of numbers. Doing it sequentially means adding one number at a time. A parallel scan is like splitting the list into smaller chunks, calculating subtotals for each chunk simultaneously, and then combining those subtotals in a tree-like fashion. This way, the final running total is calculated much faster.

Selective scan diagram showing chunked training reduction and one-step serving recurrence. Selective scan diagram showing chunked training reduction and one-step serving recurrence.
Training builds all hidden states with a chunked scan. Serving only advances one recurrent state at a time.

The algorithm operates in three phases to improve hardware utilization:

  1. First, it computes all the input-dependent parameters and pairs of (Aˉt,Bˉtxt)(\bar{\mathbf{A}}_t, \bar{\mathbf{B}}_t x_t)(Aˉt​,Bˉt​xt​) in parallel across the entire sequence.
  2. Next, it applies the parallel prefix sum using a highly optimized associative operator to combine these terms.
  3. Finally, it reads off all the hidden state values hth_tht​ simultaneously.
associative-scan-composition.py
1def compose(right: tuple[float, float], left: tuple[float, float]) -> tuple[float, float]: 2 """Compose h -> a*h + b transforms: apply left, then right.""" 3 a_right, b_right = right 4 a_left, b_left = left 5 return a_right * a_left, a_right * b_left + b_right 6 7steps = [(0.9, 0.6), (0.9, 0.2), (0.9, 0.8), (0.9, 0.4)] 8left_chunk = compose(steps[1], steps[0]) 9right_chunk = compose(steps[3], steps[2]) 10whole_sequence = compose(right_chunk, left_chunk) 11 12print("left chunk transform:", tuple(round(value, 3) for value in left_chunk)) 13print("right chunk transform:", tuple(round(value, 3) for value in right_chunk)) 14print("state from zero:", round(whole_sequence[1], 3))
Output
1left chunk transform: (0.81, 0.74) 2right chunk transform: (0.81, 1.12) 3state from zero: 1.719

By breaking the worst sequential dependency, this hardware-aware approach achieves O(L)O(L)O(L) total work with much smaller critical-path depth than a naive recurrent loop. That makes Mamba trainable on GPUs, but it still depends on custom scan kernels and doesn't map as cleanly to tensor-core matmuls as transformers do. That remaining gap is what Mamba-2 addresses.

Mamba-2: state space duality

While Mamba proved that selective SSMs were capable of high performance, the parallel scan operation was still a custom hardware path that underused the matrix-multiplication engines (Tensor Cores) that make GPUs so fast.

Mamba-2[2] attacked that limitation through Structured State Space Duality (SSD). The core observation is that selective SSMs and variants of attention can both be written as operations over structured semiseparable matrices. That lets Mamba-2 use a chunked algorithm where:

  • intra-chunk outputs are computed in parallel
  • chunk final states are computed with batched matrix multiplications
  • only the much shorter inter-chunk state passing remains a scan
  • the initial-state contribution for each chunk is added back with more batched matrix multiplications

In the SSD algorithm, more work is expressed as matrix multiplications, while a reduced inter-chunk recurrent core remains.

Mamba-1 vs. Mamba-2 comparison

FeatureMamba-1Mamba-2
Core OperationSelective ScanSSD block decomposition
Hardware ExecutionCustom scan-heavy CUDA kernelsMostly batched matmuls plus a shorter scan
Core Layer SpeedLinear-time scan2-8x reported speedup over Mamba-1 core layer[2]
Theoretical LinkSelective SSM recurrenceSSD bridge to structured masked attention

The Mamba-2 paper reports 2-8x faster core-layer execution than Mamba-1 while retaining competitive language-model results in its experiments.[2] Treat this as evidence for the SSD implementation path, not a guarantee for every end-to-end server.

ssd-chunk-boundary-count.py
1sequence_length = 16384 2chunk_size = 256 3chunks = sequence_length // chunk_size 4 5print("tokens processed within chunks:", sequence_length) 6print("chunk boundary states to scan:", chunks) 7print("boundary reduction factor:", f"{sequence_length / chunks:.0f}x")
Output
1tokens processed within chunks: 16384 2chunk boundary states to scan: 64 3boundary reduction factor: 256x

Beyond Mamba-2: Mamba-3

Work on SSMs did not stop at SSD. Mamba-3[3] takes an explicitly inference-first view and changes the recurrent update along three axes: a more expressive trapezoidal discretization (a second-order accurate update that generalizes Mamba-1/Mamba-2's simpler scheme), a complex-valued state update evaluated on state-tracking tasks such as parity and modular arithmetic, and a multi-input, multi-output (MIMO) formulation intended to raise arithmetic intensity during memory-bound decode.

In the paper's 1.5B-scale experiments, Mamba-3 improved average downstream accuracy over the next-best linear baseline, and the MIMO variant improved it further. The paper also reports matching Mamba-2 perplexity with half the state size.[3] These are experiment-scoped results to validate again in a deployment stack.

Training infrastructure for SSMs

While transformers have benefited from years of optimization (like FlashAttention and highly tuned PyTorch implementations), SSMs require different training infrastructure. The selective scan that makes Mamba efficient can't be expressed as a plain Python loop without throwing away the whole performance story.

The need for custom kernels

Training Mamba efficiently requires more than just writing a for loop in Python. The original implementation relies on custom kernels that keep the state in fast SRAM (Static Random-Access Memory) on the GPU rather than writing it back to slower HBM (High Bandwidth Memory). Early adoption was slower for exactly this reason: the ecosystem had to build SSM-specific kernels, serving paths, and state management instead of reusing the standard transformer stack. This is similar to how a regional courier hub must design custom conveyor layouts for its specific package mix, rather than plugging into a universal postal facility built for letter-sized mail.

Complexity comparison

Comparing the computational complexity of transformers and SSMs reveals why SSMs are so attractive for long sequences. The following chart and table isolate the sequence-mixing part of one layer, not the MLP or projection costs.

Generation state scaling chart showing transformer KV cache growing with context length while Mamba recurrent state stays flat. Generation state scaling chart showing transformer KV cache growing with context length while Mamba recurrent state stays flat.
The chart normalizes each path to its own 4K state footprint and shows growth, not absolute bytes or full model cost. Transformer KV cache grows with context length; Mamba recurrent state is fixed by layer design.
OperationTrainingGeneration (per token)Generation State / Cache
Self-attentionO(L2d)O(L^2 d)O(L2d)O(Ld)O(Ld)O(Ld)O(L⋅d)O(L \cdot d)O(L⋅d) grows linearly
S4 (LTI SSM)O(Llog⁡L⋅d)O(L \log L \cdot d)O(LlogL⋅d) (FFT convolution)O(s⋅d)O(s \cdot d)O(s⋅d)O(s⋅d)O(s \cdot d)O(s⋅d) constant
Mamba (selective SSM)O(L⋅s⋅d)O(L \cdot s \cdot d)O(L⋅s⋅d) (parallel scan)O(s⋅d)O(s \cdot d)O(s⋅d)O(s⋅d)O(s \cdot d)O(s⋅d) constant
Linear attentionO(L⋅d2)O(L \cdot d^2)O(L⋅d2)O(d2)O(d^2)O(d2)O(d2)O(d^2)O(d2) constant

The critical advantage during generation: SSMs maintain a fixed-size state with roughly s⋅ds \cdot ds⋅d elements per layer regardless of context length. A transformer serving 128K context must keep on the order of 128K × d KV cache (Key-Value cache, which stores the results of previous key and value computations to avoid re-computing them for each new token) activations per layer. In a fulfillment center processing 100,000 package events per day, an SSM-based routing model keeps the same dashboard-sized state at event 1 and event 100,000, while a transformer would need a warehouse-sized filing system that grows with every scan.

That is the core analogy: full manifest versus fixed summary. A transformer's KV cache keeps every scan event for every package in a growing manifest. An SSM's state is a fixed-size operations summary; no matter how many events arrived, the summary has the same size because it keeps compressed state. The tradeoff is that sometimes a specific event would have been preferable to the summary.

The original Mamba paper reported roughly 5x higher inference throughput than transformers of similar size in its evaluated generation setup.[1] Treat that as a reported result, not a guarantee: real speedups depend on batch size, context length, quality requirements, and how mature the SSM kernels are relative to optimized attention.

reported-throughput-gate.py
1baseline_tokens_per_second = 120 2reported_multiplier = 5.0 3measured_candidate_tokens_per_second = 410 4quality_regression_points = 0.4 5allowed_regression_points = 0.5 6 7reported_tokens_per_second = baseline_tokens_per_second * reported_multiplier 8passes_local_gate = ( 9 measured_candidate_tokens_per_second > baseline_tokens_per_second 10 and quality_regression_points <= allowed_regression_points 11) 12 13print("paper-scale illustration:", reported_tokens_per_second, "tokens/s") 14print("locally measured candidate:", measured_candidate_tokens_per_second, "tokens/s") 15print("candidate passes measured gate:", passes_local_gate)
Output
1paper-scale illustration: 600.0 tokens/s 2locally measured candidate: 410 tokens/s 3candidate passes measured gate: True

Hybrid architectures: combining trade-offs

Think of a fulfillment tracker that usually updates a compact running state (SSM mode, fast and efficient), but occasionally opens the exact event history for a tricky address correction or failed delivery scan (attention mode, slower but precise).

Fixed-state SSM designs can regress on tasks requiring precise retrieval from context, while attention-only stacks incur KV-cache growth as contexts expand. Hybrid architectures are candidates for workloads that must balance those risks:

Hybrid models: Jamba and Nemotron-H

Jamba[4] interleaves Transformer and Mamba layers in a structured hybrid stack and adds MoE to some multilayer perceptron (MLP) layers. In the released configuration, each 8-layer Jamba block uses a 1:7 attention-to-Mamba ratio, with MoE applied every other layer. Nemotron-H[5] uses attention in roughly 8% of layers (4 self-attention layers out of 52 in the 8B model) and fills the remainder with an even split of Mamba-2 and FFN layers.

The hybrid lesson is broader than any one model family: keep most depth on the fixed-state path, then add attention layers when evaluation shows that explicit token access matters. The main design choice is how often those layers appear and how much of the stack stays on Mamba-family layers.

  • Architecture: Alternating blocks with Mamba layers handling the bulk of computation and attention layers acting as retrieval checkpoints
  • MoE (Mixture of Experts) integration: Jamba uses routing on some MLP (Multi-Layer Perceptron) layers to increase total capacity without activating every parameter

Reported hybrid-model evaluations motivate three measurements:

  • long-context throughput versus an attention-only baseline
  • KV-cache memory reduction as the number of attention layers falls
  • retrieval and in-context-learning quality versus both pure SSM and attention-only baselines
hybrid-kv-cache-fraction.py
1total_layers = 32 2attention_layers = 4 3full_attention_kv_gb = 16.0 4hybrid_kv_gb = full_attention_kv_gb * attention_layers / total_layers 5 6print("attention layer fraction:", f"{attention_layers / total_layers:.1%}") 7print("illustrative full-attention KV:", f"{full_attention_kv_gb:.1f} GB") 8print("illustrative hybrid KV:", f"{hybrid_kv_gb:.1f} GB")
Output
1attention layer fraction: 12.5% 2illustrative full-attention KV: 16.0 GB 3illustrative hybrid KV: 2.0 GB

Why hybrids work

The attention layers serve as explicit-access checkpoints: tokens periodically get attention over the context, while Mamba layers process the remaining positions through recurrent state:

Hybrid architecture diagram showing sparse attention checkpoints among mostly Mamba or Mamba-2 layers. Hybrid architecture diagram showing sparse attention checkpoints among mostly Mamba or Mamba-2 layers.
Hybrid stacks keep most layers on the fixed-state path, then insert attention layers to test recall-heavy behavior against memory cost. Jamba also uses MoE in some MLP layers, while Nemotron-H includes Mamba-2, self-attention, and FFN layers.

When to use SSMs vs. transformers vs. hybrids

Choosing the right architecture depends heavily on the specific workload, context length, and memory constraints. The choice matrix below gives a practical heuristic instead of pretending there's one universal winner.

Architecture choice chart mapping exact recall, context growth, and memory budget to transformer, SSM, or hybrid choices. Architecture choice chart mapping exact recall, context growth, and memory budget to transformer, SSM, or hybrid choices.
The architecture choice depends on failure mode. Use attention when exact copying matters, SSMs when fixed memory matters more, and hybrids when the workload needs both long-context efficiency and periodic exact retrieval.
ScenarioCandidate to benchmarkWhy
Standard chat (short to moderate context)TransformerMature optimized tooling; validate quality and latency
Long document analysis (128K+)HybridTest memory savings against retrieval accuracy
Real-time streamingSSMFixed decode state is easier to budget as history grows
In-context learning (few-shot)TransformerExplicit token access is a strong baseline for copying/matching
Edge deployment (limited memory)SSM or SSM-heavy hybridFixed state is easier to budget than a growing KV cache
RAG (Retrieval-Augmented Generation) with precise retrievalTransformer or HybridEvaluate whether attention layers preserve required evidence use
architecture-selection-gate.py
1candidates = { 2 "transformer": {"recall": 0.97, "decode_memory_gb": 18.0, "p95_ms": 210}, 3 "hybrid": {"recall": 0.96, "decode_memory_gb": 6.0, "p95_ms": 150}, 4 "ssm": {"recall": 0.89, "decode_memory_gb": 2.0, "p95_ms": 120}, 5} 6requirements = {"recall": 0.95, "decode_memory_gb": 8.0, "p95_ms": 180} 7 8approved = [ 9 name 10 for name, metrics in candidates.items() 11 if metrics["recall"] >= requirements["recall"] 12 and metrics["decode_memory_gb"] <= requirements["decode_memory_gb"] 13 and metrics["p95_ms"] <= requirements["p95_ms"] 14] 15 16print("approved candidates:", approved)
Output
1approved candidates: ['hybrid']

When evaluating SSMs for deployment, profile actual workloads rather than relying on theoretical complexity alone. SSM candidates are motivated when long-context decoding makes KV-cache growth a bottleneck. Prompt ingestion still scales with input length, and for short-context batch inference, optimized transformer implementations with FlashAttention may be faster in practice due to mature GPU kernels.

prefill-versus-decode-accounting.py
1transformer = {"prefill_ms": 120, "decode_ms_per_token": 3.2} 2ssm = {"prefill_ms": 135, "decode_ms_per_token": 1.8} 3 4for generated_tokens in (16, 256): 5 transformer_total = transformer["prefill_ms"] + generated_tokens * transformer["decode_ms_per_token"] 6 ssm_total = ssm["prefill_ms"] + generated_tokens * ssm["decode_ms_per_token"] 7 winner = "ssm" if ssm_total < transformer_total else "transformer" 8 print(f"generated={generated_tokens}: winner={winner}, delta_ms={abs(ssm_total - transformer_total):.1f}")
Output
1generated=16: winner=ssm, delta_ms=7.4 2generated=256: winner=ssm, delta_ms=343.4

Mastery check

Evaluation rubric

  • Explains the core serving tradeoff: attention keeps visible token history, while SSMs keep a fixed recurrent state
  • Works through continuous dynamics, ZOH discretization, and the discrete recurrence without losing the meaning of each matrix
  • Explains why S4 was useful but limited by fixed, input-independent dynamics
  • Explains what makes Mamba selective: token-dependent Bt\mathbf{B}_tBt​, Ct\mathbf{C}_tCt​, and Δt\Delta_tΔt​
  • Distinguishes Mamba-1's scan-heavy kernel path from Mamba-2's SSD reformulation and tensor-core-friendly matmul path
  • Explains why hybrids keep sparse attention checkpoints instead of replacing attention everywhere

Common pitfalls

Symptom: Exact-copy or few-shot evals regress even though long-context throughput looks better. Cause: Team treated fixed recurrent state as if it offered the same token-level retrieval as attention. Pure SSMs compress history, so exact copying can get worse. Fix: Benchmark attention-only and hybrid candidates when retrieval is core behavior; measure whether inserted attention layers recover enough accuracy for their cache cost.

Symptom: Benchmarks say "linear time," but GPU throughput is still disappointing. Cause: Implementation fell back to Python loops or weak scan kernels. Mamba-1 needs hardware-aware fused scans, and Mamba-2 matters because it moves more work onto batched matmuls. Fix: Profile real kernels, not equations. Check whether stack is using optimized scan or SSD kernels before claiming a serving win.

Symptom: Routine tokens wipe out earlier important facts. Cause: Model learned a poor selective update, so a large Δt\Delta_tΔt​ or weak read/write projections let unimportant tokens overwrite useful state. Fix: Evaluate long-range recall tasks directly. If exact recovery matters, add attention checkpoints or reduce burden on recurrent state.

Symptom: Decode memory looks great, but prompt latency is still too high. Cause: Fixed recurrent state only helps after state is built. Prefill still has to process whole prompt and can remain expensive. Fix: Separate prefill and decode in profiling. SSMs help most when long generation dominates or when context keeps growing over time.

Symptom: Architecture review lumps RWKV, linear attention, S4, Mamba, and Mamba-2 into one bucket. Cause: "Linear" became shorthand for several different mechanisms with different retrieval, kernel, and training tradeoffs. Fix: Name specific trick. For Mamba, key distinction is token-dependent Bt\mathbf{B}_tBt​, Ct\mathbf{C}_tCt​, and Δt\Delta_tΔt​, not merely "non-quadratic scaling."

Follow-up questions

A team wants 256K-token log triage on 24 GB GPUs. Why is a pure transformer stack risky, and why might a hybrid be safer than a pure SSM?

A pure transformer stack is risky because KV-cache growth scales with context length, so long decode can become a memory bottleneck quickly. A pure SSM removes context-length-dependent decode-state growth, but it may lose retrieval accuracy on rare important log lines. A hybrid is a candidate because it adds some attention access while reducing cache growth versus an attention-only stack; validate it with those log-retrieval cases.

Why can't standard S4-style SSMs perform in-context learning as effectively as attention-based models?

Standard SSMs such as S4 use fixed, input-independent dynamics, so every token is processed with the same update rule regardless of content. That makes prompt-dependent routing, exact copying, and few-shot pattern matching harder than in attention models. Mamba improves this by making B\mathbf{B}B, C\mathbf{C}C, and Δ\DeltaΔ input-dependent, but the full history is still compressed into a fixed-size state.

A benchmark shows better long-context throughput after switching to Mamba-1, but GPU utilization is still poor. What should team inspect next?

Inspect kernel path, not only asymptotic complexity. Mamba-1 depends on hardware-aware fused scans, so Python loops or weak kernels can erase theoretical gain. This is exactly why Mamba-2 matters: it reformulates more of work into tensor-core-friendly batched matrix multiplications.

What is real serving advantage of SSMs over transformers for long-context generation, and what does that advantage not solve?

Real advantage is constant-size recurrent state during decode, which keeps generation memory from growing with context length. That can improve long-context throughput and fit tighter edge budgets. It does not solve exact token retrieval automatically, and it does not make prompt ingestion free because prefill still has to process whole input.

Why did Jamba keep attention layers instead of using only Mamba layers?

Jamba's released configuration uses a 1:7 attention-to-Mamba ratio inside each block. That design reduces the number of KV-cached attention layers and reintroduces occasional explicit token access. Treat its reported results as evidence for that configuration; a production choice still needs throughput, memory, and recall-heavy evaluations.[4]

What this unlocks next

You now have a working mental model of how SSMs compress sequence history into fixed recurrent state, why Mamba's selective mechanism matters, and where hybrid architectures sit in the design space. The next article in the path is Reasoning & Test-Time Compute. It builds on these serving concepts: additional generated tokens alter decode memory, latency, batching, and architecture trade-offs.

Next Step
Continue to Reasoning & Test-Time Compute

State-space models show one way to handle long sequences efficiently; reasoning systems ask when it is worth spending extra inference compute instead.

PreviousMixture of Experts Architecture
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu & Dao · 2023

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Dao & Gu · 2024

Mamba-3: Improved Sequence Modeling using State Space Principles

Gu, A., et al. · 2026

Jamba: A Hybrid Transformer-Mamba Language Model

AI21 Labs · 2024

Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models

NVIDIA · 2025

Efficiently Modeling Long Sequences with Structured State Spaces

Gu et al. · 2022 · ICLR 2022

HiPPO: Recurrent Memory with Optimal Polynomial Projections.

Gu, A., Dao, T., Ermon, S., Rudra, A., & Ré, C. · 2020 · NeurIPS 2020