LearnInference & Production ScaleMamba & State Space Models

🚀HardInference Optimization

Mamba & State Space Models

Master linear-time sequence modeling: from S4 and HiPPO to Mamba's selective recurrence, Mamba-2's SSD framework, Mamba-3's inference-first refinements, and modern hybrid Transformer-SSM designs.

35 min read

Learning path

Step 138 of 155 in the full curriculum

Mixture of Experts Architecture Reasoning & Test-Time Compute

Mamba & State Space Models

Mixture of Experts showed one way to spend less compute per token: route each token through only a few feed-forward network (FFN) experts. State Space Models (SSMs) attack a different cost: attention over long histories.

State space models such as Mamba are alternatives to transformer attention for long sequences. This chapter explains the tradeoff: linear-time recurrence and selective state updates instead of keeping every previous token visible through attention.

Imagine processing a 1,000-event order timeline. One strategy is to keep every previous event open and visible while reading each new one. This approach catches every detail, but by event 500, your workspace is buried under history and it becomes unbearably slow. That's how transformers behave during autoregressive generation: each new token attends over the growing key-value (KV) cache, so decoding gets more expensive as context grows.

State Space Models (SSMs) take a different approach: instead of rereading everything, they keep a compact running summary that updates with each new event. In recurrent form, the update cost stays linear in sequence length and the model carries a fixed-size state instead of a context-length-dependent cache.^[1] The Mamba architecture^[1] made selective SSMs competitive with transformers at language-model scale, and later work like Mamba-2^[2], Mamba-3^[3], and hybrid architectures like Jamba^[4] and Nemotron-H^[5] pushed that line of work further.

Understanding SSMs matters because long-context serving is often limited by attention's growing KV cache, while Mamba-style models push historical information into a fixed-size recurrent state during decoding.^[1]

Set expectations up front: a fixed recurrent state removes KV-cache growth during decode, but it also creates a retrieval risk because history is compressed rather than explicitly addressable. Hybrid Transformer-SSM models are one candidate when a workload needs both bounded decode state and occasional attention access. This chapter teaches the mechanism and the measurements needed to decide whether that trade pays off.

Transformer vs. Mamba comparison showing growing KV-cache access versus one fixed-size recurrent state. — Attention keeps token-level history available through a growing KV cache. Mamba-style SSMs keep a fixed recurrent state, reducing decode-state growth while turning history into a compressed summary.

From control theory to language models

The math behind SSMs is rooted in continuous systems, but the core idea is simple. Think of a warehouse routing controller: it has a current state (current backlog and carrier capacity), receives input (new orders and scan updates), and produces output (which lane or carrier should handle the next package). The state updates based on both its previous state and the new input. SSMs do exactly this for text: they maintain a hidden state that continuously updates as each new word arrives.

State Space Models originate from control theory, where they describe dynamical systems via continuous differential equations. The key variables are:

$x(t) \in \mathbb{R}$ : the input signal
$h(t) \in \mathbb{R}^s$ : the hidden state, a vector of dimension $s$ that acts as a compressed summary of all past inputs
$y(t) \in \mathbb{R}$ : the output
$\mathbf{A} \in \mathbb{R}^{s \times s}$ : the state transition matrix, shaping how much past information to retain
$\mathbf{B} \in \mathbb{R}^{s \times 1}$ : the input matrix, shaping how much new information to absorb
$\mathbf{C} \in \mathbb{R}^{1 \times s}$ : the output matrix, which reads out the relevant part of the state

These equations are often written for scalar input/output to keep the notation readable. Real neural layers run many channels in parallel, but the same recurrence applies channel-wise.

From these, the continuous dynamics are:

$h'(t) = \mathbf{A}h(t) + \mathbf{B}x(t)$ $y(t) = \mathbf{C}h(t) + \mathbf{D}x(t)$

What this means

At every moment, the hidden state $h(t)$ does two things simultaneously: it decays based on $\mathbf{A}$ (controlled forgetting), and it absorbs new input scaled by $\mathbf{B}$ . The output $y(t)$ is then a linear readout of the state via $\mathbf{C}$ . The $\mathbf{D}$ term is a skip connection that lets the input pass through unchanged, similar to a residual connection. Think of the hidden state as a running summary that continuously updates as new tokens arrive.

Discretization

Tokens are discrete, but the equations above describe smooth, continuous change. To apply them to token sequences, we discretize by sampling at intervals of $\Delta$ timesteps. The standard approach is zero-order hold (ZOH), which converts the continuous matrices into discrete equivalents:^[6]

\begin{aligned} \bar{\mathbf{A}} &= e^{\Delta \mathbf{A}} \\ \bar{\mathbf{B}} &= (\Delta \mathbf{A})^{-1} \left(e^{\Delta \mathbf{A}} - \mathbf{I}\right) \cdot \Delta \mathbf{B} \end{aligned}

where $e^{\Delta \mathbf{A}}$ is the matrix exponential. These discrete matrices encode the same dynamics as the continuous equations, but in a form suitable for step-by-step updates.

The discrete recurrence then becomes:

$h_t = \bar{\mathbf{A}} h_{t-1} + \bar{\mathbf{B}} x_t$ $y_t = \mathbf{C} h_t + \mathbf{D} x_t$

At each step: decay the old state by $\bar{\mathbf{A}}$ , absorb the new input scaled by $\bar{\mathbf{B}}$ , and read out through $\mathbf{C}$ . The $\mathbf{D}$ term is a skip connection (the input contributes directly to the output). For this recurrent layer, one decode update uses state sized by the model rather than by preceding context length.

Walkthrough with numbers

To make the recurrence concrete, imagine a one-dimensional state tracking the running importance of package scan events. Set:

$\bar{A} = 0.9$ (the state keeps 90% of its previous value)
$\bar{B} = 0.2$ (new input contributes 20%)
$\bar{C} = 1.0$ (read the full state)
$\bar{D} = 0$ (no skip)

Now process four scans with importance scores $x = [3, 1, 4, 2]$ :

Step	Input $x_t$	State $h_t = 0.9 \, h_{t-1} + 0.2 \, x_t$	Output $y_t$
1	3	$0.9 \cdot 0 + 0.2 \cdot 3 = 0.60$	0.60
2	1	$0.9 \cdot 0.60 + 0.2 \cdot 1 = 0.74$	0.74
3	4	$0.9 \cdot 0.74 + 0.2 \cdot 4 = 1.47$	1.47
4	2	$0.9 \cdot 1.47 + 0.2 \cdot 2 = 1.72$	1.72

Notice how the state smooths the input sequence. A high input like 4 pulls the state up, but the old state still carries weight. This compression is the whole point: instead of storing every scan, the SSM keeps a fixed-size summary.

discrete-recurrence.py

inputs = [3, 1, 4, 2]
a_bar = 0.9
b_bar = 0.2
state = 0.0
states = []

for value in inputs:
    state = a_bar * state + b_bar * value
    states.append(round(state, 3))

print("states:", states)
print("final state:", states[-1])

Output

states: [0.6, 0.74, 1.466, 1.719]
final state: 1.719

If $\bar{A}$ were 0.1 instead of 0.9, the final state would change sharply. With the same inputs and $\bar{B}=0.2$ , the final state is about $0.483$ instead of $1.719$ : earlier events contribute far less. Whether that is harmful depends on what information the task needs to retain.

S4: structured state spaces

The Structured State Space for Sequence Modeling (S4)^[6] advanced neural SSMs by giving them structured long-range dynamics and efficient computation views. The following diagram illustrates the basic S4 architecture, where the hidden state is updated iteratively using structured matrices.

State space recurrence diagram showing input and prior state becoming one token-step update and output readout. — The SSM layer is easiest to read as a state update: combine the previous state with the new input, produce a new state, then read an output. S4 made this practical by adding structure that preserves long-range memory and supports efficient training.

HiPPO initialization

In S4, the continuous-time $\mathbf{A}$ matrix starts from the HiPPO (High-order Polynomial Projection Operators) framework, which compresses input history into coefficients of orthogonal polynomials.^[7]^[6]

Without this structure, a long recurrence would tend to wash out older information. HiPPO gives S4 a principled way to preserve information over long horizons instead of relying on an arbitrary recurrent matrix to discover that behavior from scratch.

Dual computation modes

S4 can be computed in two equivalent ways. For notational simplicity, the costs below are written for one channel; a full layer runs many channels in parallel.

Recurrent mode (generation)

Efficient for generation: $h_t = \bar{\mathbf{A}} h_{t-1} + \bar{\mathbf{B}} x_t, \quad y_t = \mathbf{C} h_t + \mathbf{D} x_t$

Time: $O(L \cdot s)$ , Memory: $O(s)$ per step.

Convolutional mode (training)

Efficient for training: $y = x * \bar{K}, \quad \text{where } \bar{K}_t = \mathbf{C} \bar{\mathbf{A}}^t \bar{\mathbf{B}}$

In plain terms

The recurrence above can be "unrolled" into a single convolution with a pre-computed kernel $\bar{K}$ . This allows processing the entire sequence at once using FFT (Fast Fourier Transform), so the sequence-mixing part of training runs in $O(L \log L)$ time instead of a naive token-by-token loop. Same math, different execution strategy.

The LTI problem

Here's S4's critical limitation. Imagine a conveyor belt in a factory that processes every item at exactly the same speed with exactly the same steps, whether it's a diamond ring or a pebble. That's efficient, but it can't prioritize. It can't be told to "slow down for the diamond and speed past the pebbles."

S4 is a Linear Time-Invariant (LTI) system: matrices $\mathbf{A}$ , $\mathbf{B}$ , and $\mathbf{C}$ are fixed regardless of input content. It can represent position-dependent decay through recurrence, but it cannot change its write/read policy because a particular input token is important.^[1]

This is different from attention, where query-key interactions provide content-dependent access. Fixed LTI SSMs struggle on selective-copy style tasks used to test content-aware memory; Mamba addresses that weakness with input-dependent parameters.

Mamba: selective state spaces

Mamba^[1] solves the LTI limitation with one core idea: make part of the SSM input-dependent.

The selective mechanism

Think of a conveyor scanner moving through package events. Routine scans such as "label printed" may need a small state update, while a phrase like "address corrected after failed delivery" may need a stronger update. The content of each event conditions the update rule. That is Mamba's selective mechanism.

Instead of fixed $\mathbf{B}$ , $\mathbf{C}$ , and $\Delta$ , Mamba computes them as functions of the input:

$\mathbf{B}_t = \text{Linear}_B(x_t), \quad \mathbf{C}_t = \text{Linear}_C(x_t), \quad \Delta_t = \text{softplus}(\text{Linear}_\Delta(x_t))$

Unlike S4, which uses the same $\mathbf{B}$ , $\mathbf{C}$ , and $\Delta$ for every token, Mamba computes them fresh from each token's content. The learned transition $\mathbf{A}$ and skip term $\mathbf{D}$ remain input-independent in Mamba-1; selectivity enters through $\mathbf{B}_t$ , $\mathbf{C}_t$ , and $\Delta_t$ .^[1]

The step size $\Delta_t$ is especially important: it controls how much the model updates its hidden state:

Large $\Delta_t$ : Decay older state faster and absorb more of the current input
Small $\Delta_t$ : Keep more of the previous state and make a smaller update

delta-controls-retention.py

from math import exp

a = -1.0
for delta in (0.1, 1.0, 2.0):
    retained_fraction = exp(delta * a)
    print(f"delta={delta:.1f} retains {retained_fraction:.3f} of previous state")

Output

delta=0.1 retains 0.905 of previous state
delta=1.0 retains 0.368 of previous state
delta=2.0 retains 0.135 of previous state

This gives Mamba a learned, content-dependent recurrent update while retaining linear-length recurrent computation. It does not give every output explicit access to every prior token as attention does.

Beginners often assume Mamba "remembers everything" because it avoids the KV cache. It doesn't. A large $\Delta_t$ can make the model forget past tokens just as aggressively as attention sometimes ignores distant positions. The difference is that Mamba learns how much to forget per token, not whether to access a full history. In a package-tracking model, a large $\Delta_t$ on a routine "label printed" scan could wash out the previous "address corrected" event if the model hasn't learned to gate that update carefully.

Architecture

A Mamba block replaces attention with a specific stack: projection, causal convolution, selective SSM update, and gating.^[1] It uses SiLU (Sigmoid Linear Unit) activations and softplus (a smooth approximation of ReLU defined as log(1 + e^x), which is always positive and differentiable everywhere) to keep the step sizes positive:

Mamba block diagram showing projection split, causal convolution, selective scan, gate path, and output projection. — A Mamba block is not just a recurrence. The production block projects the token, mixes nearby tokens with a causal convolution, computes token-dependent SSM parameters, runs selective scan, gates the result, and projects back to the model dimension.

The code below focuses on the selective recurrence itself rather than the entire fused production block. It leaves out the causal convolution and output gate on purpose, so the core SSM update stays easy to inspect and the tensor shapes stay correct.

architecture.py

import torch
import torch.nn as nn
import torch.nn.functional as F

class SelectiveSSM(nn.Module):
    """Reference selective SSM. Clear and correct, but not hardware optimized."""

    def __init__(self, d_model: int, d_state: int = 64):
        super().__init__()
        self.u_proj = nn.Linear(d_model, d_state)
        self.b_proj = nn.Linear(d_model, d_state)
        self.c_proj = nn.Linear(d_model, d_state)
        self.dt_proj = nn.Linear(d_model, d_state)
        self.a_log = nn.Parameter(torch.zeros(d_state))
        self.out_proj = nn.Linear(d_state, d_model)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        batch, length, _ = x.shape
        A = -torch.exp(self.a_log).view(1, -1)  # Stable diagonal dynamics.
        h = x.new_zeros(batch, A.size(-1))
        outputs = []

        for t in range(length):
            x_t = x[:, t, :]
            u_t = self.u_proj(x_t)
            B_t = self.b_proj(x_t)
            C_t = self.c_proj(x_t)
            delta_t = F.softplus(self.dt_proj(x_t))

            A_bar = torch.exp(delta_t * A)
            B_bar = torch.where(
                A.abs() > 1e-6,
                ((A_bar - 1.0) / A) * B_t,
                delta_t * B_t,
            )

            h = A_bar * h + B_bar * u_t
            y_t = self.out_proj(C_t * h) + x_t  # Simple skip connection.
            outputs.append(y_t)

        return torch.stack(outputs, dim=1)

torch.manual_seed(0)
ssm = SelectiveSSM(d_model=8, d_state=4)
x = torch.randn(1, 2, 8)
y = ssm(x)

print("input shape:", tuple(x.shape))
print("output shape:", tuple(y.shape))
print("state dimension:", ssm.a_log.numel())

Output

input shape: (1, 2, 8)
output shape: (1, 2, 8)
state dimension: 4

Separate from recurrence math, the serving-memory claim is easy to sanity-check numerically.

state_scaling.py

def transformer_kv_bytes(layers: int, heads: int, head_dim: int, tokens: int, bytes_per_value: int = 2) -> int:
    return 2 * layers * heads * head_dim * tokens * bytes_per_value

def ssm_state_bytes(
    layers: int,
    inner_channels: int,
    state_size: int,
    conv_width: int,
    bytes_per_value: int = 2,
) -> int:
    recurrent_state = layers * inner_channels * state_size
    convolution_state = layers * inner_channels * (conv_width - 1)
    return (recurrent_state + convolution_state) * bytes_per_value

layers = 32
heads = 32
head_dim = 128
inner_channels = 4096
state_size = 16
conv_width = 4

for tokens in [128, 512, 2048, 8192]:
    kv_mb = transformer_kv_bytes(layers, heads, head_dim, tokens) / 1024**2
    ssm_mb = ssm_state_bytes(layers, inner_channels, state_size, conv_width) / 1024**2
    print(f"context={tokens:5d} -> transformer_kv={kv_mb:7.1f} MB, ssm_state={ssm_mb:4.1f} MB")

Output

context=  128 -> transformer_kv=   64.0 MB, ssm_state= 4.8 MB
context=  512 -> transformer_kv=  256.0 MB, ssm_state= 4.8 MB
context= 2048 -> transformer_kv= 1024.0 MB, ssm_state= 4.8 MB
context= 8192 -> transformer_kv= 4096.0 MB, ssm_state= 4.8 MB

What goes wrong when you build this from scratch

The code above is shape-correct for the simplified recurrence, but beginners hit predictable snags when they try to extend it. Here are three common symptoms, their causes, and how to fix them.

Symptom	Cause	Fix
State values explode to `NaN` within a few steps	`A_bar` can exceed 1.0 if `softplus` isn't used to keep $\Delta_t$ positive, or if $\mathbf{A}$ isn't constrained negative	Keep `softplus` on $\Delta_t$ and initialize `a_log` so that `A = -exp(a_log)` stays negative
Model acts like a plain RNN with no content awareness	`B_t` and `C_t` are accidentally hardcoded as constants instead of `Linear(x_t)` projections	Verify that `b_proj` and `c_proj` are actually called with the current token `x_t` inside the loop
Training is unbearably slow on long sequences	Using a Python `for` loop over the sequence dimension on GPU	The reference code above is for reading, not training. Production Mamba relies on fused CUDA scan kernels that keep state in SRAM, not Python loops

The hardware-aware parallel scan

The key engineering innovation in Mamba is the hardware-aware scan algorithm. At first glance, the recurrence $h_t = \bar{\mathbf{A}}_t h_{t-1} + \bar{\mathbf{B}}_t x_t$ appears strictly sequential. If $h_{t-1}$ must be known before computing $h_t$ , processing a sequence seems to require a slow token-by-token loop, which would severely bottleneck training on GPUs compared to the heavily parallelized attention mechanism.

The bottleneck isn't only arithmetic. It's also memory traffic: if every token update spills intermediate state back to high-bandwidth memory (HBM), the recurrence becomes bandwidth-bound. Mamba addresses that by fusing discretization and state updates so the recurrent state stays in fast on-chip static random-access memory (SRAM) as much as possible, then using an associative scan primitive to recover parallelism across positions.^[1]

Because the recurrence is associative after the right reformulation, the model can group updates and compute intermediate states in a tree-like structure. Consider the task of calculating a running total for a long list of numbers. Doing it sequentially means adding one number at a time. A parallel scan is like splitting the list into smaller chunks, calculating subtotals for each chunk simultaneously, and then combining those subtotals in a tree-like fashion. This way, the final running total is calculated much faster.

Selective scan diagram showing chunked training reduction and one-step serving recurrence. — Training builds all hidden states with a chunked scan. Serving only advances one recurrent state at a time.

The algorithm operates in three phases to improve hardware utilization:

First, it computes all the input-dependent parameters and pairs of $(\bar{\mathbf{A}}_t, \bar{\mathbf{B}}_t x_t)$ in parallel across the entire sequence.
Next, it applies the parallel prefix sum using a highly optimized associative operator to combine these terms.
Finally, it reads off all the hidden state values $h_t$ simultaneously.

associative-scan-composition.py

def compose(right: tuple[float, float], left: tuple[float, float]) -> tuple[float, float]:
    """Compose h -> a*h + b transforms: apply left, then right."""
    a_right, b_right = right
    a_left, b_left = left
    return a_right * a_left, a_right * b_left + b_right

steps = [(0.9, 0.6), (0.9, 0.2), (0.9, 0.8), (0.9, 0.4)]
left_chunk = compose(steps[1], steps[0])
right_chunk = compose(steps[3], steps[2])
whole_sequence = compose(right_chunk, left_chunk)

print("left chunk transform:", tuple(round(value, 3) for value in left_chunk))
print("right chunk transform:", tuple(round(value, 3) for value in right_chunk))
print("state from zero:", round(whole_sequence[1], 3))

Output

left chunk transform: (0.81, 0.74)
right chunk transform: (0.81, 1.12)
state from zero: 1.719

By breaking the worst sequential dependency, this hardware-aware approach achieves $O(L)$ total work with much smaller critical-path depth than a naive recurrent loop. That makes Mamba trainable on GPUs, but it still depends on custom scan kernels and doesn't map as cleanly to tensor-core matmuls as transformers do. That remaining gap is what Mamba-2 addresses.

Mamba-2: state space duality

While Mamba proved that selective SSMs were capable of high performance, the parallel scan operation was still a custom hardware path that underused the matrix-multiplication engines (Tensor Cores) that make GPUs so fast.

Mamba-2^[2] attacked that limitation through Structured State Space Duality (SSD). The core observation is that selective SSMs and variants of attention can both be written as operations over structured semiseparable matrices. That lets Mamba-2 use a chunked algorithm where:

intra-chunk outputs are computed in parallel
chunk final states are computed with batched matrix multiplications
only the much shorter inter-chunk state passing remains a scan
the initial-state contribution for each chunk is added back with more batched matrix multiplications

In the SSD algorithm, more work is expressed as matrix multiplications, while a reduced inter-chunk recurrent core remains.

Mamba-1 vs. Mamba-2 comparison

Feature	Mamba-1	Mamba-2
Core Operation	Selective Scan	SSD block decomposition
Hardware Execution	Custom scan-heavy CUDA kernels	Mostly batched matmuls plus a shorter scan
Core Layer Speed	Linear-time scan	2-8x reported speedup over Mamba-1 core layer^[2]
Theoretical Link	Selective SSM recurrence	SSD bridge to structured masked attention

The Mamba-2 paper reports 2-8x faster core-layer execution than Mamba-1 while retaining competitive language-model results in its experiments.^[2] Treat this as evidence for the SSD implementation path, not a guarantee for every end-to-end server.

ssd-chunk-boundary-count.py

sequence_length = 16384
chunk_size = 256
chunks = sequence_length // chunk_size

print("tokens processed within chunks:", sequence_length)
print("chunk boundary states to scan:", chunks)
print("boundary reduction factor:", f"{sequence_length / chunks:.0f}x")

Output

tokens processed within chunks: 16384
chunk boundary states to scan: 64
boundary reduction factor: 256x

Beyond Mamba-2: Mamba-3

Work on SSMs did not stop at SSD. Mamba-3^[3] takes an explicitly inference-first view and changes the recurrent update along three axes: a more expressive trapezoidal discretization (a second-order accurate update that generalizes Mamba-1/Mamba-2's simpler scheme), a complex-valued state update evaluated on state-tracking tasks such as parity and modular arithmetic, and a multi-input, multi-output (MIMO) formulation intended to raise arithmetic intensity during memory-bound decode.

In the paper's 1.5B-scale experiments, Mamba-3 improved average downstream accuracy over the next-best linear baseline, and the MIMO variant improved it further. The paper also reports matching Mamba-2 perplexity with half the state size.^[3] These are experiment-scoped results to validate again in a deployment stack.

Training infrastructure for SSMs

While transformers have benefited from years of optimization (like FlashAttention and highly tuned PyTorch implementations), SSMs require different training infrastructure. The selective scan that makes Mamba efficient can't be expressed as a plain Python loop without throwing away the whole performance story.

The need for custom kernels

Training Mamba efficiently requires more than just writing a for loop in Python. The original implementation relies on custom kernels that keep the state in fast SRAM (Static Random-Access Memory) on the GPU rather than writing it back to slower HBM (High Bandwidth Memory). Early adoption was slower for exactly this reason: the ecosystem had to build SSM-specific kernels, serving paths, and state management instead of reusing the standard transformer stack. This is similar to how a regional courier hub must design custom conveyor layouts for its specific package mix, rather than plugging into a universal postal facility built for letter-sized mail.

Complexity comparison

Comparing the computational complexity of transformers and SSMs reveals why SSMs are so attractive for long sequences. The following chart and table isolate the sequence-mixing part of one layer, not the MLP or projection costs.

Generation state scaling chart showing transformer KV cache growing with context length while Mamba recurrent state stays flat. — The chart normalizes each path to its own 4K state footprint and shows growth, not absolute bytes or full model cost. Transformer KV cache grows with context length; Mamba recurrent state is fixed by layer design.

Operation	Training	Generation (per token)	Generation State / Cache
Self-attention	$O(L^2 d)$	$O(Ld)$	$O(L \cdot d)$ grows linearly
S4 (LTI SSM)	$O(L \log L \cdot d)$ (FFT convolution)	$O(s \cdot d)$	$O(s \cdot d)$ constant
Mamba (selective SSM)	$O(L \cdot s \cdot d)$ (parallel scan)	$O(s \cdot d)$	$O(s \cdot d)$ constant
Linear attention	$O(L \cdot d^2)$	$O(d^2)$	$O(d^2)$ constant

The critical advantage during generation: SSMs maintain a fixed-size state with roughly $s \cdot d$ elements per layer regardless of context length. A transformer serving 128K context must keep on the order of 128K × d KV cache (Key-Value cache, which stores the results of previous key and value computations to avoid re-computing them for each new token) activations per layer. In a fulfillment center processing 100,000 package events per day, an SSM-based routing model keeps the same dashboard-sized state at event 1 and event 100,000, while a transformer would need a warehouse-sized filing system that grows with every scan.

That is the core analogy: full manifest versus fixed summary. A transformer's KV cache keeps every scan event for every package in a growing manifest. An SSM's state is a fixed-size operations summary; no matter how many events arrived, the summary has the same size because it keeps compressed state. The tradeoff is that sometimes a specific event would have been preferable to the summary.

The original Mamba paper reported roughly 5x higher inference throughput than transformers of similar size in its evaluated generation setup.^[1] Treat that as a reported result, not a guarantee: real speedups depend on batch size, context length, quality requirements, and how mature the SSM kernels are relative to optimized attention.

reported-throughput-gate.py

baseline_tokens_per_second = 120
reported_multiplier = 5.0
measured_candidate_tokens_per_second = 410
quality_regression_points = 0.4
allowed_regression_points = 0.5

reported_tokens_per_second = baseline_tokens_per_second * reported_multiplier
passes_local_gate = (
    measured_candidate_tokens_per_second > baseline_tokens_per_second
    and quality_regression_points <= allowed_regression_points
)

print("paper-scale illustration:", reported_tokens_per_second, "tokens/s")
print("locally measured candidate:", measured_candidate_tokens_per_second, "tokens/s")
print("candidate passes measured gate:", passes_local_gate)

Output

paper-scale illustration: 600.0 tokens/s
locally measured candidate: 410 tokens/s
candidate passes measured gate: True

Hybrid architectures: combining trade-offs

Think of a fulfillment tracker that usually updates a compact running state (SSM mode, fast and efficient), but occasionally opens the exact event history for a tricky address correction or failed delivery scan (attention mode, slower but precise).

Fixed-state SSM designs can regress on tasks requiring precise retrieval from context, while attention-only stacks incur KV-cache growth as contexts expand. Hybrid architectures are candidates for workloads that must balance those risks:

Hybrid models: Jamba and Nemotron-H

Jamba^[4] interleaves Transformer and Mamba layers in a structured hybrid stack and adds MoE to some multilayer perceptron (MLP) layers. In the released configuration, each 8-layer Jamba block uses a 1:7 attention-to-Mamba ratio, with MoE applied every other layer. Nemotron-H^[5] uses attention in roughly 8% of layers (4 self-attention layers out of 52 in the 8B model) and fills the remainder with an even split of Mamba-2 and FFN layers.

The hybrid lesson is broader than any one model family: keep most depth on the fixed-state path, then add attention layers when evaluation shows that explicit token access matters. The main design choice is how often those layers appear and how much of the stack stays on Mamba-family layers.

Architecture: Alternating blocks with Mamba layers handling the bulk of computation and attention layers acting as retrieval checkpoints
MoE (Mixture of Experts) integration: Jamba uses routing on some MLP (Multi-Layer Perceptron) layers to increase total capacity without activating every parameter

Reported hybrid-model evaluations motivate three measurements:

long-context throughput versus an attention-only baseline
KV-cache memory reduction as the number of attention layers falls
retrieval and in-context-learning quality versus both pure SSM and attention-only baselines

hybrid-kv-cache-fraction.py

total_layers = 32
attention_layers = 4
full_attention_kv_gb = 16.0
hybrid_kv_gb = full_attention_kv_gb * attention_layers / total_layers

print("attention layer fraction:", f"{attention_layers / total_layers:.1%}")
print("illustrative full-attention KV:", f"{full_attention_kv_gb:.1f} GB")
print("illustrative hybrid KV:", f"{hybrid_kv_gb:.1f} GB")

Output

attention layer fraction: 12.5%
illustrative full-attention KV: 16.0 GB
illustrative hybrid KV: 2.0 GB

Why hybrids work

The attention layers serve as explicit-access checkpoints: tokens periodically get attention over the context, while Mamba layers process the remaining positions through recurrent state:

Hybrid architecture diagram showing sparse attention checkpoints among mostly Mamba or Mamba-2 layers. — Hybrid stacks keep most layers on the fixed-state path, then insert attention layers to test recall-heavy behavior against memory cost. Jamba also uses MoE in some MLP layers, while Nemotron-H includes Mamba-2, self-attention, and FFN layers.

When to use SSMs vs. transformers vs. hybrids

Choosing the right architecture depends heavily on the specific workload, context length, and memory constraints. The choice matrix below gives a practical heuristic instead of pretending there's one universal winner.

Architecture choice chart mapping exact recall, context growth, and memory budget to transformer, SSM, or hybrid choices. — The architecture choice depends on failure mode. Use attention when exact copying matters, SSMs when fixed memory matters more, and hybrids when the workload needs both long-context efficiency and periodic exact retrieval.

Scenario	Candidate to benchmark	Why
Standard chat (short to moderate context)	Transformer	Mature optimized tooling; validate quality and latency
Long document analysis (128K+)	Hybrid	Test memory savings against retrieval accuracy
Real-time streaming	SSM	Fixed decode state is easier to budget as history grows
In-context learning (few-shot)	Transformer	Explicit token access is a strong baseline for copying/matching
Edge deployment (limited memory)	SSM or SSM-heavy hybrid	Fixed state is easier to budget than a growing KV cache
RAG (Retrieval-Augmented Generation) with precise retrieval	Transformer or Hybrid	Evaluate whether attention layers preserve required evidence use

architecture-selection-gate.py

candidates = {
    "transformer": {"recall": 0.97, "decode_memory_gb": 18.0, "p95_ms": 210},
    "hybrid": {"recall": 0.96, "decode_memory_gb": 6.0, "p95_ms": 150},
    "ssm": {"recall": 0.89, "decode_memory_gb": 2.0, "p95_ms": 120},
}
requirements = {"recall": 0.95, "decode_memory_gb": 8.0, "p95_ms": 180}

approved = [
    name
    for name, metrics in candidates.items()
    if metrics["recall"] >= requirements["recall"]
    and metrics["decode_memory_gb"] <= requirements["decode_memory_gb"]
    and metrics["p95_ms"] <= requirements["p95_ms"]
]

print("approved candidates:", approved)

Output

approved candidates: ['hybrid']

When evaluating SSMs for deployment, profile actual workloads rather than relying on theoretical complexity alone. SSM candidates are motivated when long-context decoding makes KV-cache growth a bottleneck. Prompt ingestion still scales with input length, and for short-context batch inference, optimized transformer implementations with FlashAttention may be faster in practice due to mature GPU kernels.

prefill-versus-decode-accounting.py

transformer = {"prefill_ms": 120, "decode_ms_per_token": 3.2}
ssm = {"prefill_ms": 135, "decode_ms_per_token": 1.8}

for generated_tokens in (16, 256):
    transformer_total = transformer["prefill_ms"] + generated_tokens * transformer["decode_ms_per_token"]
    ssm_total = ssm["prefill_ms"] + generated_tokens * ssm["decode_ms_per_token"]
    winner = "ssm" if ssm_total < transformer_total else "transformer"
    print(f"generated={generated_tokens}: winner={winner}, delta_ms={abs(ssm_total - transformer_total):.1f}")

Output

generated=16: winner=ssm, delta_ms=7.4
generated=256: winner=ssm, delta_ms=343.4

Mastery check

Evaluation rubric

Explains the core serving tradeoff: attention keeps visible token history, while SSMs keep a fixed recurrent state
Works through continuous dynamics, ZOH discretization, and the discrete recurrence without losing the meaning of each matrix
Explains why S4 was useful but limited by fixed, input-independent dynamics
Explains what makes Mamba selective: token-dependent $\mathbf{B}_t$ , $\mathbf{C}_t$ , and $\Delta_t$
Distinguishes Mamba-1's scan-heavy kernel path from Mamba-2's SSD reformulation and tensor-core-friendly matmul path
Explains why hybrids keep sparse attention checkpoints instead of replacing attention everywhere

Common pitfalls

Symptom: Exact-copy or few-shot evals regress even though long-context throughput looks better. Cause: Team treated fixed recurrent state as if it offered the same token-level retrieval as attention. Pure SSMs compress history, so exact copying can get worse. Fix: Benchmark attention-only and hybrid candidates when retrieval is core behavior; measure whether inserted attention layers recover enough accuracy for their cache cost.

Symptom: Benchmarks say "linear time," but GPU throughput is still disappointing. Cause: Implementation fell back to Python loops or weak scan kernels. Mamba-1 needs hardware-aware fused scans, and Mamba-2 matters because it moves more work onto batched matmuls. Fix: Profile real kernels, not equations. Check whether stack is using optimized scan or SSD kernels before claiming a serving win.

Symptom: Routine tokens wipe out earlier important facts. Cause: Model learned a poor selective update, so a large $\Delta_t$ or weak read/write projections let unimportant tokens overwrite useful state. Fix: Evaluate long-range recall tasks directly. If exact recovery matters, add attention checkpoints or reduce burden on recurrent state.

Symptom: Decode memory looks great, but prompt latency is still too high. Cause: Fixed recurrent state only helps after state is built. Prefill still has to process whole prompt and can remain expensive. Fix: Separate prefill and decode in profiling. SSMs help most when long generation dominates or when context keeps growing over time.

Symptom: Architecture review lumps RWKV, linear attention, S4, Mamba, and Mamba-2 into one bucket. Cause: "Linear" became shorthand for several different mechanisms with different retrieval, kernel, and training tradeoffs. Fix: Name specific trick. For Mamba, key distinction is token-dependent $\mathbf{B}_t$ , $\mathbf{C}_t$ , and $\Delta_t$ , not merely "non-quadratic scaling."

Follow-up questions

A team wants 256K-token log triage on 24 GB GPUs. Why is a pure transformer stack risky, and why might a hybrid be safer than a pure SSM?

A pure transformer stack is risky because KV-cache growth scales with context length, so long decode can become a memory bottleneck quickly. A pure SSM removes context-length-dependent decode-state growth, but it may lose retrieval accuracy on rare important log lines. A hybrid is a candidate because it adds some attention access while reducing cache growth versus an attention-only stack; validate it with those log-retrieval cases.

Why can't standard S4-style SSMs perform in-context learning as effectively as attention-based models?

Standard SSMs such as S4 use fixed, input-independent dynamics, so every token is processed with the same update rule regardless of content. That makes prompt-dependent routing, exact copying, and few-shot pattern matching harder than in attention models. Mamba improves this by making $\mathbf{B}$ , $\mathbf{C}$ , and $\Delta$ input-dependent, but the full history is still compressed into a fixed-size state.

A benchmark shows better long-context throughput after switching to Mamba-1, but GPU utilization is still poor. What should team inspect next?

Inspect kernel path, not only asymptotic complexity. Mamba-1 depends on hardware-aware fused scans, so Python loops or weak kernels can erase theoretical gain. This is exactly why Mamba-2 matters: it reformulates more of work into tensor-core-friendly batched matrix multiplications.

What is real serving advantage of SSMs over transformers for long-context generation, and what does that advantage not solve?

Real advantage is constant-size recurrent state during decode, which keeps generation memory from growing with context length. That can improve long-context throughput and fit tighter edge budgets. It does not solve exact token retrieval automatically, and it does not make prompt ingestion free because prefill still has to process whole input.

Why did Jamba keep attention layers instead of using only Mamba layers?

Jamba's released configuration uses a 1:7 attention-to-Mamba ratio inside each block. That design reduces the number of KV-cached attention layers and reintroduces occasional explicit token access. Treat its reported results as evidence for that configuration; a production choice still needs throughput, memory, and recall-heavy evaluations.^[4]

What this unlocks next

You now have a working mental model of how SSMs compress sequence history into fixed recurrent state, why Mamba's selective mechanism matters, and where hybrid architectures sit in the design space. The next article in the path is Reasoning & Test-Time Compute. It builds on these serving concepts: additional generated tokens alter decode memory, latency, batching, and architecture trade-offs.

Next Step

Continue to Reasoning & Test-Time Compute

State-space models show one way to handle long sequences efficiently; reasoning systems ask when it is worth spending extra inference compute instead.

PreviousMixture of Experts Architecture

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu & Dao · 2023

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Dao & Gu · 2024

Mamba-3: Improved Sequence Modeling using State Space Principles

Gu, A., et al. · 2026

Jamba: A Hybrid Transformer-Mamba Language Model

AI21 Labs · 2024

Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models

NVIDIA · 2025

Efficiently Modeling Long Sequences with Structured State Spaces

Gu et al. · 2022 · ICLR 2022

HiPPO: Recurrent Memory with Optimal Polynomial Projections.

Gu, A., Dao, T., Ermon, S., Rudra, A., & Ré, C. · 2020 · NeurIPS 2020

Back to Topics

LearnInference & Production ScaleMamba & State Space Models

🚀HardInference Optimization

Mamba & State Space Models

Master linear-time sequence modeling: from S4 and HiPPO to Mamba's selective recurrence, Mamba-2's SSD framework, Mamba-3's inference-first refinements, and modern hybrid Transformer-SSM designs.

35 min read

Learning path

Step 138 of 155 in the full curriculum

Mixture of Experts Architecture Reasoning & Test-Time Compute

Mamba & State Space Models

From control theory to language models

State Space Models originate from control theory, where they describe dynamical systems via continuous differential equations. The key variables are:

$x(t) \in \mathbb{R}$ : the input signal
$h(t) \in \mathbb{R}^s$ : the hidden state, a vector of dimension $s$ that acts as a compressed summary of all past inputs
$y(t) \in \mathbb{R}$ : the output
$\mathbf{A} \in \mathbb{R}^{s \times s}$ : the state transition matrix, shaping how much past information to retain
$\mathbf{B} \in \mathbb{R}^{s \times 1}$ : the input matrix, shaping how much new information to absorb
$\mathbf{C} \in \mathbb{R}^{1 \times s}$ : the output matrix, which reads out the relevant part of the state

These equations are often written for scalar input/output to keep the notation readable. Real neural layers run many channels in parallel, but the same recurrence applies channel-wise.

From these, the continuous dynamics are:

$h'(t) = \mathbf{A}h(t) + \mathbf{B}x(t)$ $y(t) = \mathbf{C}h(t) + \mathbf{D}x(t)$

What this means

Discretization

\begin{aligned} \bar{\mathbf{A}} &= e^{\Delta \mathbf{A}} \\ \bar{\mathbf{B}} &= (\Delta \mathbf{A})^{-1} \left(e^{\Delta \mathbf{A}} - \mathbf{I}\right) \cdot \Delta \mathbf{B} \end{aligned}

where $e^{\Delta \mathbf{A}}$ is the matrix exponential. These discrete matrices encode the same dynamics as the continuous equations, but in a form suitable for step-by-step updates.

The discrete recurrence then becomes:

$h_t = \bar{\mathbf{A}} h_{t-1} + \bar{\mathbf{B}} x_t$ $y_t = \mathbf{C} h_t + \mathbf{D} x_t$

Walkthrough with numbers

To make the recurrence concrete, imagine a one-dimensional state tracking the running importance of package scan events. Set:

$\bar{A} = 0.9$ (the state keeps 90% of its previous value)
$\bar{B} = 0.2$ (new input contributes 20%)
$\bar{C} = 1.0$ (read the full state)
$\bar{D} = 0$ (no skip)

Now process four scans with importance scores $x = [3, 1, 4, 2]$ :

Step	Input $x_t$	State $h_t = 0.9 \, h_{t-1} + 0.2 \, x_t$	Output $y_t$
1	3	$0.9 \cdot 0 + 0.2 \cdot 3 = 0.60$	0.60
2	1	$0.9 \cdot 0.60 + 0.2 \cdot 1 = 0.74$	0.74
3	4	$0.9 \cdot 0.74 + 0.2 \cdot 4 = 1.47$	1.47
4	2	$0.9 \cdot 1.47 + 0.2 \cdot 2 = 1.72$	1.72

discrete-recurrence.py

inputs = [3, 1, 4, 2]
a_bar = 0.9
b_bar = 0.2
state = 0.0
states = []

for value in inputs:
    state = a_bar * state + b_bar * value
    states.append(round(state, 3))

print("states:", states)
print("final state:", states[-1])

Output

states: [0.6, 0.74, 1.466, 1.719]
final state: 1.719

S4: structured state spaces

HiPPO initialization

Dual computation modes

S4 can be computed in two equivalent ways. For notational simplicity, the costs below are written for one channel; a full layer runs many channels in parallel.

Recurrent mode (generation)

Efficient for generation: $h_t = \bar{\mathbf{A}} h_{t-1} + \bar{\mathbf{B}} x_t, \quad y_t = \mathbf{C} h_t + \mathbf{D} x_t$

Time: $O(L \cdot s)$ , Memory: $O(s)$ per step.

Convolutional mode (training)

Efficient for training: $y = x * \bar{K}, \quad \text{where } \bar{K}_t = \mathbf{C} \bar{\mathbf{A}}^t \bar{\mathbf{B}}$

In plain terms

The LTI problem

Mamba: selective state spaces

Mamba^[1] solves the LTI limitation with one core idea: make part of the SSM input-dependent.

The selective mechanism

Instead of fixed $\mathbf{B}$ , $\mathbf{C}$ , and $\Delta$ , Mamba computes them as functions of the input:

$\mathbf{B}_t = \text{Linear}_B(x_t), \quad \mathbf{C}_t = \text{Linear}_C(x_t), \quad \Delta_t = \text{softplus}(\text{Linear}_\Delta(x_t))$

The step size $\Delta_t$ is especially important: it controls how much the model updates its hidden state:

Large $\Delta_t$ : Decay older state faster and absorb more of the current input
Small $\Delta_t$ : Keep more of the previous state and make a smaller update

delta-controls-retention.py

from math import exp

a = -1.0
for delta in (0.1, 1.0, 2.0):
    retained_fraction = exp(delta * a)
    print(f"delta={delta:.1f} retains {retained_fraction:.3f} of previous state")

Output

delta=0.1 retains 0.905 of previous state
delta=1.0 retains 0.368 of previous state
delta=2.0 retains 0.135 of previous state

Architecture

architecture.py

import torch
import torch.nn as nn
import torch.nn.functional as F

class SelectiveSSM(nn.Module):
    """Reference selective SSM. Clear and correct, but not hardware optimized."""

    def __init__(self, d_model: int, d_state: int = 64):
        super().__init__()
        self.u_proj = nn.Linear(d_model, d_state)
        self.b_proj = nn.Linear(d_model, d_state)
        self.c_proj = nn.Linear(d_model, d_state)
        self.dt_proj = nn.Linear(d_model, d_state)
        self.a_log = nn.Parameter(torch.zeros(d_state))
        self.out_proj = nn.Linear(d_state, d_model)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        batch, length, _ = x.shape
        A = -torch.exp(self.a_log).view(1, -1)  # Stable diagonal dynamics.
        h = x.new_zeros(batch, A.size(-1))
        outputs = []

        for t in range(length):
            x_t = x[:, t, :]
            u_t = self.u_proj(x_t)
            B_t = self.b_proj(x_t)
            C_t = self.c_proj(x_t)
            delta_t = F.softplus(self.dt_proj(x_t))

            A_bar = torch.exp(delta_t * A)
            B_bar = torch.where(
                A.abs() > 1e-6,
                ((A_bar - 1.0) / A) * B_t,
                delta_t * B_t,
            )

            h = A_bar * h + B_bar * u_t
            y_t = self.out_proj(C_t * h) + x_t  # Simple skip connection.
            outputs.append(y_t)

        return torch.stack(outputs, dim=1)

torch.manual_seed(0)
ssm = SelectiveSSM(d_model=8, d_state=4)
x = torch.randn(1, 2, 8)
y = ssm(x)

print("input shape:", tuple(x.shape))
print("output shape:", tuple(y.shape))
print("state dimension:", ssm.a_log.numel())

Output

input shape: (1, 2, 8)
output shape: (1, 2, 8)
state dimension: 4

Separate from recurrence math, the serving-memory claim is easy to sanity-check numerically.

state_scaling.py

def transformer_kv_bytes(layers: int, heads: int, head_dim: int, tokens: int, bytes_per_value: int = 2) -> int:
    return 2 * layers * heads * head_dim * tokens * bytes_per_value

def ssm_state_bytes(
    layers: int,
    inner_channels: int,
    state_size: int,
    conv_width: int,
    bytes_per_value: int = 2,
) -> int:
    recurrent_state = layers * inner_channels * state_size
    convolution_state = layers * inner_channels * (conv_width - 1)
    return (recurrent_state + convolution_state) * bytes_per_value

layers = 32
heads = 32
head_dim = 128
inner_channels = 4096
state_size = 16
conv_width = 4

for tokens in [128, 512, 2048, 8192]:
    kv_mb = transformer_kv_bytes(layers, heads, head_dim, tokens) / 1024**2
    ssm_mb = ssm_state_bytes(layers, inner_channels, state_size, conv_width) / 1024**2
    print(f"context={tokens:5d} -> transformer_kv={kv_mb:7.1f} MB, ssm_state={ssm_mb:4.1f} MB")

Output

context=  128 -> transformer_kv=   64.0 MB, ssm_state= 4.8 MB
context=  512 -> transformer_kv=  256.0 MB, ssm_state= 4.8 MB
context= 2048 -> transformer_kv= 1024.0 MB, ssm_state= 4.8 MB
context= 8192 -> transformer_kv= 4096.0 MB, ssm_state= 4.8 MB

What goes wrong when you build this from scratch

The code above is shape-correct for the simplified recurrence, but beginners hit predictable snags when they try to extend it. Here are three common symptoms, their causes, and how to fix them.

Symptom	Cause	Fix
State values explode to `NaN` within a few steps	`A_bar` can exceed 1.0 if `softplus` isn't used to keep $\Delta_t$ positive, or if $\mathbf{A}$ isn't constrained negative	Keep `softplus` on $\Delta_t$ and initialize `a_log` so that `A = -exp(a_log)` stays negative
Model acts like a plain RNN with no content awareness	`B_t` and `C_t` are accidentally hardcoded as constants instead of `Linear(x_t)` projections	Verify that `b_proj` and `c_proj` are actually called with the current token `x_t` inside the loop
Training is unbearably slow on long sequences	Using a Python `for` loop over the sequence dimension on GPU	The reference code above is for reading, not training. Production Mamba relies on fused CUDA scan kernels that keep state in SRAM, not Python loops

The hardware-aware parallel scan

The algorithm operates in three phases to improve hardware utilization:

First, it computes all the input-dependent parameters and pairs of $(\bar{\mathbf{A}}_t, \bar{\mathbf{B}}_t x_t)$ in parallel across the entire sequence.
Next, it applies the parallel prefix sum using a highly optimized associative operator to combine these terms.
Finally, it reads off all the hidden state values $h_t$ simultaneously.

associative-scan-composition.py

def compose(right: tuple[float, float], left: tuple[float, float]) -> tuple[float, float]:
    """Compose h -> a*h + b transforms: apply left, then right."""
    a_right, b_right = right
    a_left, b_left = left
    return a_right * a_left, a_right * b_left + b_right

steps = [(0.9, 0.6), (0.9, 0.2), (0.9, 0.8), (0.9, 0.4)]
left_chunk = compose(steps[1], steps[0])
right_chunk = compose(steps[3], steps[2])
whole_sequence = compose(right_chunk, left_chunk)

print("left chunk transform:", tuple(round(value, 3) for value in left_chunk))
print("right chunk transform:", tuple(round(value, 3) for value in right_chunk))
print("state from zero:", round(whole_sequence[1], 3))

Output

left chunk transform: (0.81, 0.74)
right chunk transform: (0.81, 1.12)
state from zero: 1.719

Mamba-2: state space duality

intra-chunk outputs are computed in parallel
chunk final states are computed with batched matrix multiplications
only the much shorter inter-chunk state passing remains a scan
the initial-state contribution for each chunk is added back with more batched matrix multiplications

In the SSD algorithm, more work is expressed as matrix multiplications, while a reduced inter-chunk recurrent core remains.

Mamba-1 vs. Mamba-2 comparison

Feature	Mamba-1	Mamba-2
Core Operation	Selective Scan	SSD block decomposition
Hardware Execution	Custom scan-heavy CUDA kernels	Mostly batched matmuls plus a shorter scan
Core Layer Speed	Linear-time scan	2-8x reported speedup over Mamba-1 core layer^[2]
Theoretical Link	Selective SSM recurrence	SSD bridge to structured masked attention

ssd-chunk-boundary-count.py

sequence_length = 16384
chunk_size = 256
chunks = sequence_length // chunk_size

print("tokens processed within chunks:", sequence_length)
print("chunk boundary states to scan:", chunks)
print("boundary reduction factor:", f"{sequence_length / chunks:.0f}x")

Output

tokens processed within chunks: 16384
chunk boundary states to scan: 64
boundary reduction factor: 256x

Beyond Mamba-2: Mamba-3

Training infrastructure for SSMs

The need for custom kernels

Complexity comparison

Operation	Training	Generation (per token)	Generation State / Cache
Self-attention	$O(L^2 d)$	$O(Ld)$	$O(L \cdot d)$ grows linearly
S4 (LTI SSM)	$O(L \log L \cdot d)$ (FFT convolution)	$O(s \cdot d)$	$O(s \cdot d)$ constant
Mamba (selective SSM)	$O(L \cdot s \cdot d)$ (parallel scan)	$O(s \cdot d)$	$O(s \cdot d)$ constant
Linear attention	$O(L \cdot d^2)$	$O(d^2)$	$O(d^2)$ constant

reported-throughput-gate.py

baseline_tokens_per_second = 120
reported_multiplier = 5.0
measured_candidate_tokens_per_second = 410
quality_regression_points = 0.4
allowed_regression_points = 0.5

reported_tokens_per_second = baseline_tokens_per_second * reported_multiplier
passes_local_gate = (
    measured_candidate_tokens_per_second > baseline_tokens_per_second
    and quality_regression_points <= allowed_regression_points
)

print("paper-scale illustration:", reported_tokens_per_second, "tokens/s")
print("locally measured candidate:", measured_candidate_tokens_per_second, "tokens/s")
print("candidate passes measured gate:", passes_local_gate)

Output

paper-scale illustration: 600.0 tokens/s
locally measured candidate: 410 tokens/s
candidate passes measured gate: True

Hybrid architectures: combining trade-offs

Hybrid models: Jamba and Nemotron-H

Architecture: Alternating blocks with Mamba layers handling the bulk of computation and attention layers acting as retrieval checkpoints
MoE (Mixture of Experts) integration: Jamba uses routing on some MLP (Multi-Layer Perceptron) layers to increase total capacity without activating every parameter

Reported hybrid-model evaluations motivate three measurements:

long-context throughput versus an attention-only baseline
KV-cache memory reduction as the number of attention layers falls
retrieval and in-context-learning quality versus both pure SSM and attention-only baselines

hybrid-kv-cache-fraction.py

total_layers = 32
attention_layers = 4
full_attention_kv_gb = 16.0
hybrid_kv_gb = full_attention_kv_gb * attention_layers / total_layers

print("attention layer fraction:", f"{attention_layers / total_layers:.1%}")
print("illustrative full-attention KV:", f"{full_attention_kv_gb:.1f} GB")
print("illustrative hybrid KV:", f"{hybrid_kv_gb:.1f} GB")

Output

attention layer fraction: 12.5%
illustrative full-attention KV: 16.0 GB
illustrative hybrid KV: 2.0 GB

Why hybrids work

The attention layers serve as explicit-access checkpoints: tokens periodically get attention over the context, while Mamba layers process the remaining positions through recurrent state:

When to use SSMs vs. transformers vs. hybrids

Scenario	Candidate to benchmark	Why
Standard chat (short to moderate context)	Transformer	Mature optimized tooling; validate quality and latency
Long document analysis (128K+)	Hybrid	Test memory savings against retrieval accuracy
Real-time streaming	SSM	Fixed decode state is easier to budget as history grows
In-context learning (few-shot)	Transformer	Explicit token access is a strong baseline for copying/matching
Edge deployment (limited memory)	SSM or SSM-heavy hybrid	Fixed state is easier to budget than a growing KV cache
RAG (Retrieval-Augmented Generation) with precise retrieval	Transformer or Hybrid	Evaluate whether attention layers preserve required evidence use

architecture-selection-gate.py

candidates = {
    "transformer": {"recall": 0.97, "decode_memory_gb": 18.0, "p95_ms": 210},
    "hybrid": {"recall": 0.96, "decode_memory_gb": 6.0, "p95_ms": 150},
    "ssm": {"recall": 0.89, "decode_memory_gb": 2.0, "p95_ms": 120},
}
requirements = {"recall": 0.95, "decode_memory_gb": 8.0, "p95_ms": 180}

approved = [
    name
    for name, metrics in candidates.items()
    if metrics["recall"] >= requirements["recall"]
    and metrics["decode_memory_gb"] <= requirements["decode_memory_gb"]
    and metrics["p95_ms"] <= requirements["p95_ms"]
]

print("approved candidates:", approved)

Output

approved candidates: ['hybrid']

prefill-versus-decode-accounting.py

transformer = {"prefill_ms": 120, "decode_ms_per_token": 3.2}
ssm = {"prefill_ms": 135, "decode_ms_per_token": 1.8}

for generated_tokens in (16, 256):
    transformer_total = transformer["prefill_ms"] + generated_tokens * transformer["decode_ms_per_token"]
    ssm_total = ssm["prefill_ms"] + generated_tokens * ssm["decode_ms_per_token"]
    winner = "ssm" if ssm_total < transformer_total else "transformer"
    print(f"generated={generated_tokens}: winner={winner}, delta_ms={abs(ssm_total - transformer_total):.1f}")

Output

generated=16: winner=ssm, delta_ms=7.4
generated=256: winner=ssm, delta_ms=343.4

Mastery check

Evaluation rubric

Explains the core serving tradeoff: attention keeps visible token history, while SSMs keep a fixed recurrent state
Works through continuous dynamics, ZOH discretization, and the discrete recurrence without losing the meaning of each matrix
Explains why S4 was useful but limited by fixed, input-independent dynamics
Explains what makes Mamba selective: token-dependent $\mathbf{B}_t$ , $\mathbf{C}_t$ , and $\Delta_t$
Distinguishes Mamba-1's scan-heavy kernel path from Mamba-2's SSD reformulation and tensor-core-friendly matmul path
Explains why hybrids keep sparse attention checkpoints instead of replacing attention everywhere

Common pitfalls

Follow-up questions

A team wants 256K-token log triage on 24 GB GPUs. Why is a pure transformer stack risky, and why might a hybrid be safer than a pure SSM?

Why can't standard S4-style SSMs perform in-context learning as effectively as attention-based models?

A benchmark shows better long-context throughput after switching to Mamba-1, but GPU utilization is still poor. What should team inspect next?

What is real serving advantage of SSMs over transformers for long-context generation, and what does that advantage not solve?

Why did Jamba keep attention layers instead of using only Mamba layers?

What this unlocks next

Next Step

Continue to Reasoning & Test-Time Compute

State-space models show one way to handle long sequences efficiently; reasoning systems ask when it is worth spending extra inference compute instead.

PreviousMixture of Experts Architecture

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu & Dao · 2023

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Dao & Gu · 2024

Mamba-3: Improved Sequence Modeling using State Space Principles

Gu, A., et al. · 2026

Jamba: A Hybrid Transformer-Mamba Language Model

AI21 Labs · 2024

Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models

NVIDIA · 2025

Efficiently Modeling Long Sequences with Structured State Spaces

Gu et al. · 2022 · ICLR 2022

HiPPO: Recurrent Memory with Optimal Polynomial Projections.

Gu, A., Dao, T., Ermon, S., Rudra, A., & Ré, C. · 2020 · NeurIPS 2020

Mamba & State Space Models

Mamba & State Space Models

What is the core serving tradeoff between attention and Mamba-style SSMs?

From control theory to language models

In the state-space equations, what role does the hidden state play?

What this means

Discretization

Walkthrough with numbers

If you change A-bar to 0.5 and B-bar to 0.5, what is the final state after processing [3, 1, 4, 2] from zero state?

S4: structured state spaces

HiPPO initialization

Dual computation modes

Recurrent mode (generation)

Convolutional mode (training)

In plain terms

Why does S4 need two computation views?

The LTI problem

Why is a fixed LTI SSM weaker than attention for content-dependent retrieval?

Mamba: selective state spaces

The selective mechanism

If Δt\Delta_tΔt​ is very large, does the model remember more or less of the previous state?

What does "selective" mean in Mamba?

Architecture

Why is the reference PyTorch loop useful for learning but wrong for production training?

What goes wrong when you build this from scratch

The hardware-aware parallel scan

What makes the selective scan "hardware-aware"?

Mamba-2: state space duality

Why does Mamba-2 matter if Mamba-1 was already linear time?

Mamba-1 vs. Mamba-2 comparison

Beyond Mamba-2: Mamba-3

What kind of improvement does Mamba-3 target compared with earlier Mamba work?

Training infrastructure for SSMs

The need for custom kernels

Complexity comparison

Why is constant-memory decoding both an advantage and a limitation?

Hybrid architectures: combining trade-offs

Hybrid models: Jamba and Nemotron-H

Why do hybrids keep occasional attention layers instead of using only SSM layers?

Why hybrids work

When to use SSMs vs. transformers vs. hybrids

When should a production team prefer a transformer over an SSM-heavy model despite worse long-context asymptotics?

Mastery check

Evaluation rubric

Common pitfalls

Follow-up questions

A team wants 256K-token log triage on 24 GB GPUs. Why is a pure transformer stack risky, and why might a hybrid be safer than a pure SSM?

Why can't standard S4-style SSMs perform in-context learning as effectively as attention-based models?

A benchmark shows better long-context throughput after switching to Mamba-1, but GPU utilization is still poor. What should team inspect next?

What is real serving advantage of SSMs over transformers for long-context generation, and what does that advantage not solve?

Why did Jamba keep attention layers instead of using only Mamba layers?

What this unlocks next

Mamba & State Space Models

Mamba & State Space Models

What is the core serving tradeoff between attention and Mamba-style SSMs?

From control theory to language models

In the state-space equations, what role does the hidden state play?

What this means

Discretization

Walkthrough with numbers

If you change A-bar to 0.5 and B-bar to 0.5, what is the final state after processing [3, 1, 4, 2] from zero state?

S4: structured state spaces

HiPPO initialization

Dual computation modes

Recurrent mode (generation)

Convolutional mode (training)

In plain terms

Why does S4 need two computation views?

The LTI problem

Why is a fixed LTI SSM weaker than attention for content-dependent retrieval?

Mamba: selective state spaces

The selective mechanism

If Δt\Delta_tΔt​ is very large, does the model remember more or less of the previous state?

What does "selective" mean in Mamba?

Architecture

Why is the reference PyTorch loop useful for learning but wrong for production training?

What goes wrong when you build this from scratch

The hardware-aware parallel scan

What makes the selective scan "hardware-aware"?

Mamba-2: state space duality