Master linear-time sequence modeling: from S4 and HiPPO to Mamba's selective recurrence, Mamba-2's SSD framework, Mamba-3's inference-first refinements, and modern hybrid Transformer-SSM designs.
Mixture of Experts showed one way to spend less compute per token: route each token through only a few feed-forward network (FFN) experts. State Space Models (SSMs) attack a different cost: attention over long histories.
State space models such as Mamba are alternatives to transformer attention for long sequences. This chapter explains the tradeoff: linear-time recurrence and selective state updates instead of keeping every previous token visible through attention.
Imagine processing a 1,000-event order timeline. One strategy is to keep every previous event open and visible while reading each new one. This approach catches every detail, but by event 500, your workspace is buried under history and it becomes unbearably slow. That's how transformers behave during autoregressive generation: each new token attends over the growing key-value (KV) cache, so decoding gets more expensive as context grows.
State Space Models (SSMs) take a different approach: instead of rereading everything, they keep a compact running summary that updates with each new event. In recurrent form, the update cost stays linear in sequence length and the model carries a fixed-size state instead of a context-length-dependent cache.[1] The Mamba architecture[1] made selective SSMs competitive with transformers at language-model scale, and later work like Mamba-2[2], Mamba-3[3], and hybrid architectures like Jamba[4] and Nemotron-H[5] pushed that line of work further.
Understanding SSMs matters because long-context serving is often limited by attention's growing KV cache, while Mamba-style models push historical information into a fixed-size recurrent state during decoding.[1]
Set expectations up front: a fixed recurrent state removes KV-cache growth during decode, but it also creates a retrieval risk because history is compressed rather than explicitly addressable. Hybrid Transformer-SSM models are one candidate when a workload needs both bounded decode state and occasional attention access. This chapter teaches the mechanism and the measurements needed to decide whether that trade pays off.
The math behind SSMs is rooted in continuous systems, but the core idea is simple. Think of a warehouse routing controller: it has a current state (current backlog and carrier capacity), receives input (new orders and scan updates), and produces output (which lane or carrier should handle the next package). The state updates based on both its previous state and the new input. SSMs do exactly this for text: they maintain a hidden state that continuously updates as each new word arrives.
State Space Models originate from control theory, where they describe dynamical systems via continuous differential equations. The key variables are:
These equations are often written for scalar input/output to keep the notation readable. Real neural layers run many channels in parallel, but the same recurrence applies channel-wise.
From these, the continuous dynamics are:
At every moment, the hidden state does two things simultaneously: it decays based on (controlled forgetting), and it absorbs new input scaled by . The output is then a linear readout of the state via . The term is a skip connection that lets the input pass through unchanged, similar to a residual connection. Think of the hidden state as a running summary that continuously updates as new tokens arrive.
Tokens are discrete, but the equations above describe smooth, continuous change. To apply them to token sequences, we discretize by sampling at intervals of timesteps. The standard approach is zero-order hold (ZOH), which converts the continuous matrices into discrete equivalents:[6]
where is the matrix exponential. These discrete matrices encode the same dynamics as the continuous equations, but in a form suitable for step-by-step updates.
The discrete recurrence then becomes:
At each step: decay the old state by , absorb the new input scaled by , and read out through . The term is a skip connection (the input contributes directly to the output). For this recurrent layer, one decode update uses state sized by the model rather than by preceding context length.
To make the recurrence concrete, imagine a one-dimensional state tracking the running importance of package scan events. Set:
Now process four scans with importance scores :
| Step | Input | State | Output |
|---|---|---|---|
| 1 | 3 | 0.60 | |
| 2 | 1 | 0.74 | |
| 3 | 4 | 1.47 | |
| 4 | 2 | 1.72 |
Notice how the state smooths the input sequence. A high input like 4 pulls the state up, but the old state still carries weight. This compression is the whole point: instead of storing every scan, the SSM keeps a fixed-size summary.
1inputs = [3, 1, 4, 2]
2a_bar = 0.9
3b_bar = 0.2
4state = 0.0
5states = []
6
7for value in inputs:
8 state = a_bar * state + b_bar * value
9 states.append(round(state, 3))
10
11print("states:", states)
12print("final state:", states[-1])1states: [0.6, 0.74, 1.466, 1.719]
2final state: 1.719If were 0.1 instead of 0.9, the final state would change sharply. With the same inputs and , the final state is about instead of : earlier events contribute far less. Whether that is harmful depends on what information the task needs to retain.
The Structured State Space for Sequence Modeling (S4)[6] advanced neural SSMs by giving them structured long-range dynamics and efficient computation views. The following diagram illustrates the basic S4 architecture, where the hidden state is updated iteratively using structured matrices.
In S4, the continuous-time matrix starts from the HiPPO (High-order Polynomial Projection Operators) framework, which compresses input history into coefficients of orthogonal polynomials.[7][6]
Without this structure, a long recurrence would tend to wash out older information. HiPPO gives S4 a principled way to preserve information over long horizons instead of relying on an arbitrary recurrent matrix to discover that behavior from scratch.
S4 can be computed in two equivalent ways. For notational simplicity, the costs below are written for one channel; a full layer runs many channels in parallel.
Efficient for generation:
Time: , Memory: per step.
Efficient for training:
The recurrence above can be "unrolled" into a single convolution with a pre-computed kernel . This allows processing the entire sequence at once using FFT (Fast Fourier Transform), so the sequence-mixing part of training runs in time instead of a naive token-by-token loop. Same math, different execution strategy.
Here's S4's critical limitation. Imagine a conveyor belt in a factory that processes every item at exactly the same speed with exactly the same steps, whether it's a diamond ring or a pebble. That's efficient, but it can't prioritize. It can't be told to "slow down for the diamond and speed past the pebbles."
S4 is a Linear Time-Invariant (LTI) system: matrices , , and are fixed regardless of input content. It can represent position-dependent decay through recurrence, but it cannot change its write/read policy because a particular input token is important.[1]
This is different from attention, where query-key interactions provide content-dependent access. Fixed LTI SSMs struggle on selective-copy style tasks used to test content-aware memory; Mamba addresses that weakness with input-dependent parameters.
Mamba[1] solves the LTI limitation with one core idea: make part of the SSM input-dependent.
Think of a conveyor scanner moving through package events. Routine scans such as "label printed" may need a small state update, while a phrase like "address corrected after failed delivery" may need a stronger update. The content of each event conditions the update rule. That is Mamba's selective mechanism.
Instead of fixed , , and , Mamba computes them as functions of the input:
Unlike S4, which uses the same , , and for every token, Mamba computes them fresh from each token's content. The learned transition and skip term remain input-independent in Mamba-1; selectivity enters through , , and .[1]
The step size is especially important: it controls how much the model updates its hidden state:
1from math import exp
2
3a = -1.0
4for delta in (0.1, 1.0, 2.0):
5 retained_fraction = exp(delta * a)
6 print(f"delta={delta:.1f} retains {retained_fraction:.3f} of previous state")1delta=0.1 retains 0.905 of previous state
2delta=1.0 retains 0.368 of previous state
3delta=2.0 retains 0.135 of previous stateThis gives Mamba a learned, content-dependent recurrent update while retaining linear-length recurrent computation. It does not give every output explicit access to every prior token as attention does.
Beginners often assume Mamba "remembers everything" because it avoids the KV cache. It doesn't. A large can make the model forget past tokens just as aggressively as attention sometimes ignores distant positions. The difference is that Mamba learns how much to forget per token, not whether to access a full history. In a package-tracking model, a large on a routine "label printed" scan could wash out the previous "address corrected" event if the model hasn't learned to gate that update carefully.
A Mamba block replaces attention with a specific stack: projection, causal convolution, selective SSM update, and gating.[1] It uses SiLU (Sigmoid Linear Unit) activations and softplus (a smooth approximation of ReLU defined as log(1 + e^x), which is always positive and differentiable everywhere) to keep the step sizes positive:
The code below focuses on the selective recurrence itself rather than the entire fused production block. It leaves out the causal convolution and output gate on purpose, so the core SSM update stays easy to inspect and the tensor shapes stay correct.
1import torch
2import torch.nn as nn
3import torch.nn.functional as F
4
5class SelectiveSSM(nn.Module):
6 """Reference selective SSM. Clear and correct, but not hardware optimized."""
7
8 def __init__(self, d_model: int, d_state: int = 64):
9 super().__init__()
10 self.u_proj = nn.Linear(d_model, d_state)
11 self.b_proj = nn.Linear(d_model, d_state)
12 self.c_proj = nn.Linear(d_model, d_state)
13 self.dt_proj = nn.Linear(d_model, d_state)
14 self.a_log = nn.Parameter(torch.zeros(d_state))
15 self.out_proj = nn.Linear(d_state, d_model)
16
17 def forward(self, x: torch.Tensor) -> torch.Tensor:
18 batch, length, _ = x.shape
19 A = -torch.exp(self.a_log).view(1, -1) # Stable diagonal dynamics.
20 h = x.new_zeros(batch, A.size(-1))
21 outputs = []
22
23 for t in range(length):
24 x_t = x[:, t, :]
25 u_t = self.u_proj(x_t)
26 B_t = self.b_proj(x_t)
27 C_t = self.c_proj(x_t)
28 delta_t = F.softplus(self.dt_proj(x_t))
29
30 A_bar = torch.exp(delta_t * A)
31 B_bar = torch.where(
32 A.abs() > 1e-6,
33 ((A_bar - 1.0) / A) * B_t,
34 delta_t * B_t,
35 )
36
37 h = A_bar * h + B_bar * u_t
38 y_t = self.out_proj(C_t * h) + x_t # Simple skip connection.
39 outputs.append(y_t)
40
41 return torch.stack(outputs, dim=1)
42
43torch.manual_seed(0)
44ssm = SelectiveSSM(d_model=8, d_state=4)
45x = torch.randn(1, 2, 8)
46y = ssm(x)
47
48print("input shape:", tuple(x.shape))
49print("output shape:", tuple(y.shape))
50print("state dimension:", ssm.a_log.numel())1input shape: (1, 2, 8)
2output shape: (1, 2, 8)
3state dimension: 4Separate from recurrence math, the serving-memory claim is easy to sanity-check numerically.
1def transformer_kv_bytes(layers: int, heads: int, head_dim: int, tokens: int, bytes_per_value: int = 2) -> int:
2 return 2 * layers * heads * head_dim * tokens * bytes_per_value
3
4def ssm_state_bytes(
5 layers: int,
6 inner_channels: int,
7 state_size: int,
8 conv_width: int,
9 bytes_per_value: int = 2,
10) -> int:
11 recurrent_state = layers * inner_channels * state_size
12 convolution_state = layers * inner_channels * (conv_width - 1)
13 return (recurrent_state + convolution_state) * bytes_per_value
14
15layers = 32
16heads = 32
17head_dim = 128
18inner_channels = 4096
19state_size = 16
20conv_width = 4
21
22for tokens in [128, 512, 2048, 8192]:
23 kv_mb = transformer_kv_bytes(layers, heads, head_dim, tokens) / 1024**2
24 ssm_mb = ssm_state_bytes(layers, inner_channels, state_size, conv_width) / 1024**2
25 print(f"context={tokens:5d} -> transformer_kv={kv_mb:7.1f} MB, ssm_state={ssm_mb:4.1f} MB")1context= 128 -> transformer_kv= 64.0 MB, ssm_state= 4.8 MB
2context= 512 -> transformer_kv= 256.0 MB, ssm_state= 4.8 MB
3context= 2048 -> transformer_kv= 1024.0 MB, ssm_state= 4.8 MB
4context= 8192 -> transformer_kv= 4096.0 MB, ssm_state= 4.8 MBThe code above is shape-correct for the simplified recurrence, but beginners hit predictable snags when they try to extend it. Here are three common symptoms, their causes, and how to fix them.
| Symptom | Cause | Fix |
|---|---|---|
State values explode to NaN within a few steps | A_bar can exceed 1.0 if softplus isn't used to keep positive, or if isn't constrained negative | Keep softplus on and initialize a_log so that A = -exp(a_log) stays negative |
| Model acts like a plain RNN with no content awareness | B_t and C_t are accidentally hardcoded as constants instead of Linear(x_t) projections | Verify that b_proj and c_proj are actually called with the current token x_t inside the loop |
| Training is unbearably slow on long sequences | Using a Python for loop over the sequence dimension on GPU | The reference code above is for reading, not training. Production Mamba relies on fused CUDA scan kernels that keep state in SRAM, not Python loops |
The key engineering innovation in Mamba is the hardware-aware scan algorithm. At first glance, the recurrence appears strictly sequential. If must be known before computing , processing a sequence seems to require a slow token-by-token loop, which would severely bottleneck training on GPUs compared to the heavily parallelized attention mechanism.
The bottleneck isn't only arithmetic. It's also memory traffic: if every token update spills intermediate state back to high-bandwidth memory (HBM), the recurrence becomes bandwidth-bound. Mamba addresses that by fusing discretization and state updates so the recurrent state stays in fast on-chip static random-access memory (SRAM) as much as possible, then using an associative scan primitive to recover parallelism across positions.[1]
Because the recurrence is associative after the right reformulation, the model can group updates and compute intermediate states in a tree-like structure. Consider the task of calculating a running total for a long list of numbers. Doing it sequentially means adding one number at a time. A parallel scan is like splitting the list into smaller chunks, calculating subtotals for each chunk simultaneously, and then combining those subtotals in a tree-like fashion. This way, the final running total is calculated much faster.
The algorithm operates in three phases to improve hardware utilization:
1def compose(right: tuple[float, float], left: tuple[float, float]) -> tuple[float, float]:
2 """Compose h -> a*h + b transforms: apply left, then right."""
3 a_right, b_right = right
4 a_left, b_left = left
5 return a_right * a_left, a_right * b_left + b_right
6
7steps = [(0.9, 0.6), (0.9, 0.2), (0.9, 0.8), (0.9, 0.4)]
8left_chunk = compose(steps[1], steps[0])
9right_chunk = compose(steps[3], steps[2])
10whole_sequence = compose(right_chunk, left_chunk)
11
12print("left chunk transform:", tuple(round(value, 3) for value in left_chunk))
13print("right chunk transform:", tuple(round(value, 3) for value in right_chunk))
14print("state from zero:", round(whole_sequence[1], 3))1left chunk transform: (0.81, 0.74)
2right chunk transform: (0.81, 1.12)
3state from zero: 1.719By breaking the worst sequential dependency, this hardware-aware approach achieves total work with much smaller critical-path depth than a naive recurrent loop. That makes Mamba trainable on GPUs, but it still depends on custom scan kernels and doesn't map as cleanly to tensor-core matmuls as transformers do. That remaining gap is what Mamba-2 addresses.
While Mamba proved that selective SSMs were capable of high performance, the parallel scan operation was still a custom hardware path that underused the matrix-multiplication engines (Tensor Cores) that make GPUs so fast.
Mamba-2[2] attacked that limitation through Structured State Space Duality (SSD). The core observation is that selective SSMs and variants of attention can both be written as operations over structured semiseparable matrices. That lets Mamba-2 use a chunked algorithm where:
In the SSD algorithm, more work is expressed as matrix multiplications, while a reduced inter-chunk recurrent core remains.
| Feature | Mamba-1 | Mamba-2 |
|---|---|---|
| Core Operation | Selective Scan | SSD block decomposition |
| Hardware Execution | Custom scan-heavy CUDA kernels | Mostly batched matmuls plus a shorter scan |
| Core Layer Speed | Linear-time scan | 2-8x reported speedup over Mamba-1 core layer[2] |
| Theoretical Link | Selective SSM recurrence | SSD bridge to structured masked attention |
The Mamba-2 paper reports 2-8x faster core-layer execution than Mamba-1 while retaining competitive language-model results in its experiments.[2] Treat this as evidence for the SSD implementation path, not a guarantee for every end-to-end server.
1sequence_length = 16384
2chunk_size = 256
3chunks = sequence_length // chunk_size
4
5print("tokens processed within chunks:", sequence_length)
6print("chunk boundary states to scan:", chunks)
7print("boundary reduction factor:", f"{sequence_length / chunks:.0f}x")1tokens processed within chunks: 16384
2chunk boundary states to scan: 64
3boundary reduction factor: 256xWork on SSMs did not stop at SSD. Mamba-3[3] takes an explicitly inference-first view and changes the recurrent update along three axes: a more expressive trapezoidal discretization (a second-order accurate update that generalizes Mamba-1/Mamba-2's simpler scheme), a complex-valued state update evaluated on state-tracking tasks such as parity and modular arithmetic, and a multi-input, multi-output (MIMO) formulation intended to raise arithmetic intensity during memory-bound decode.
In the paper's 1.5B-scale experiments, Mamba-3 improved average downstream accuracy over the next-best linear baseline, and the MIMO variant improved it further. The paper also reports matching Mamba-2 perplexity with half the state size.[3] These are experiment-scoped results to validate again in a deployment stack.
While transformers have benefited from years of optimization (like FlashAttention and highly tuned PyTorch implementations), SSMs require different training infrastructure. The selective scan that makes Mamba efficient can't be expressed as a plain Python loop without throwing away the whole performance story.
Training Mamba efficiently requires more than just writing a for loop in Python. The original implementation relies on custom kernels that keep the state in fast SRAM (Static Random-Access Memory) on the GPU rather than writing it back to slower HBM (High Bandwidth Memory). Early adoption was slower for exactly this reason: the ecosystem had to build SSM-specific kernels, serving paths, and state management instead of reusing the standard transformer stack. This is similar to how a regional courier hub must design custom conveyor layouts for its specific package mix, rather than plugging into a universal postal facility built for letter-sized mail.
Comparing the computational complexity of transformers and SSMs reveals why SSMs are so attractive for long sequences. The following chart and table isolate the sequence-mixing part of one layer, not the MLP or projection costs.
| Operation | Training | Generation (per token) | Generation State / Cache |
|---|---|---|---|
| Self-attention | grows linearly | ||
| S4 (LTI SSM) | (FFT convolution) | constant | |
| Mamba (selective SSM) | (parallel scan) | constant | |
| Linear attention | constant |
The critical advantage during generation: SSMs maintain a fixed-size state with roughly elements per layer regardless of context length. A transformer serving 128K context must keep on the order of 128K × d KV cache (Key-Value cache, which stores the results of previous key and value computations to avoid re-computing them for each new token) activations per layer. In a fulfillment center processing 100,000 package events per day, an SSM-based routing model keeps the same dashboard-sized state at event 1 and event 100,000, while a transformer would need a warehouse-sized filing system that grows with every scan.
That is the core analogy: full manifest versus fixed summary. A transformer's KV cache keeps every scan event for every package in a growing manifest. An SSM's state is a fixed-size operations summary; no matter how many events arrived, the summary has the same size because it keeps compressed state. The tradeoff is that sometimes a specific event would have been preferable to the summary.
The original Mamba paper reported roughly 5x higher inference throughput than transformers of similar size in its evaluated generation setup.[1] Treat that as a reported result, not a guarantee: real speedups depend on batch size, context length, quality requirements, and how mature the SSM kernels are relative to optimized attention.
1baseline_tokens_per_second = 120
2reported_multiplier = 5.0
3measured_candidate_tokens_per_second = 410
4quality_regression_points = 0.4
5allowed_regression_points = 0.5
6
7reported_tokens_per_second = baseline_tokens_per_second * reported_multiplier
8passes_local_gate = (
9 measured_candidate_tokens_per_second > baseline_tokens_per_second
10 and quality_regression_points <= allowed_regression_points
11)
12
13print("paper-scale illustration:", reported_tokens_per_second, "tokens/s")
14print("locally measured candidate:", measured_candidate_tokens_per_second, "tokens/s")
15print("candidate passes measured gate:", passes_local_gate)1paper-scale illustration: 600.0 tokens/s
2locally measured candidate: 410 tokens/s
3candidate passes measured gate: TrueThink of a fulfillment tracker that usually updates a compact running state (SSM mode, fast and efficient), but occasionally opens the exact event history for a tricky address correction or failed delivery scan (attention mode, slower but precise).
Fixed-state SSM designs can regress on tasks requiring precise retrieval from context, while attention-only stacks incur KV-cache growth as contexts expand. Hybrid architectures are candidates for workloads that must balance those risks:
Jamba[4] interleaves Transformer and Mamba layers in a structured hybrid stack and adds MoE to some multilayer perceptron (MLP) layers. In the released configuration, each 8-layer Jamba block uses a 1:7 attention-to-Mamba ratio, with MoE applied every other layer. Nemotron-H[5] uses attention in roughly 8% of layers (4 self-attention layers out of 52 in the 8B model) and fills the remainder with an even split of Mamba-2 and FFN layers.
The hybrid lesson is broader than any one model family: keep most depth on the fixed-state path, then add attention layers when evaluation shows that explicit token access matters. The main design choice is how often those layers appear and how much of the stack stays on Mamba-family layers.
Reported hybrid-model evaluations motivate three measurements:
1total_layers = 32
2attention_layers = 4
3full_attention_kv_gb = 16.0
4hybrid_kv_gb = full_attention_kv_gb * attention_layers / total_layers
5
6print("attention layer fraction:", f"{attention_layers / total_layers:.1%}")
7print("illustrative full-attention KV:", f"{full_attention_kv_gb:.1f} GB")
8print("illustrative hybrid KV:", f"{hybrid_kv_gb:.1f} GB")1attention layer fraction: 12.5%
2illustrative full-attention KV: 16.0 GB
3illustrative hybrid KV: 2.0 GBThe attention layers serve as explicit-access checkpoints: tokens periodically get attention over the context, while Mamba layers process the remaining positions through recurrent state:
Choosing the right architecture depends heavily on the specific workload, context length, and memory constraints. The choice matrix below gives a practical heuristic instead of pretending there's one universal winner.
| Scenario | Candidate to benchmark | Why |
|---|---|---|
| Standard chat (short to moderate context) | Transformer | Mature optimized tooling; validate quality and latency |
| Long document analysis (128K+) | Hybrid | Test memory savings against retrieval accuracy |
| Real-time streaming | SSM | Fixed decode state is easier to budget as history grows |
| In-context learning (few-shot) | Transformer | Explicit token access is a strong baseline for copying/matching |
| Edge deployment (limited memory) | SSM or SSM-heavy hybrid | Fixed state is easier to budget than a growing KV cache |
| RAG (Retrieval-Augmented Generation) with precise retrieval | Transformer or Hybrid | Evaluate whether attention layers preserve required evidence use |
1candidates = {
2 "transformer": {"recall": 0.97, "decode_memory_gb": 18.0, "p95_ms": 210},
3 "hybrid": {"recall": 0.96, "decode_memory_gb": 6.0, "p95_ms": 150},
4 "ssm": {"recall": 0.89, "decode_memory_gb": 2.0, "p95_ms": 120},
5}
6requirements = {"recall": 0.95, "decode_memory_gb": 8.0, "p95_ms": 180}
7
8approved = [
9 name
10 for name, metrics in candidates.items()
11 if metrics["recall"] >= requirements["recall"]
12 and metrics["decode_memory_gb"] <= requirements["decode_memory_gb"]
13 and metrics["p95_ms"] <= requirements["p95_ms"]
14]
15
16print("approved candidates:", approved)1approved candidates: ['hybrid']When evaluating SSMs for deployment, profile actual workloads rather than relying on theoretical complexity alone. SSM candidates are motivated when long-context decoding makes KV-cache growth a bottleneck. Prompt ingestion still scales with input length, and for short-context batch inference, optimized transformer implementations with FlashAttention may be faster in practice due to mature GPU kernels.
1transformer = {"prefill_ms": 120, "decode_ms_per_token": 3.2}
2ssm = {"prefill_ms": 135, "decode_ms_per_token": 1.8}
3
4for generated_tokens in (16, 256):
5 transformer_total = transformer["prefill_ms"] + generated_tokens * transformer["decode_ms_per_token"]
6 ssm_total = ssm["prefill_ms"] + generated_tokens * ssm["decode_ms_per_token"]
7 winner = "ssm" if ssm_total < transformer_total else "transformer"
8 print(f"generated={generated_tokens}: winner={winner}, delta_ms={abs(ssm_total - transformer_total):.1f}")1generated=16: winner=ssm, delta_ms=7.4
2generated=256: winner=ssm, delta_ms=343.4Symptom: Exact-copy or few-shot evals regress even though long-context throughput looks better. Cause: Team treated fixed recurrent state as if it offered the same token-level retrieval as attention. Pure SSMs compress history, so exact copying can get worse. Fix: Benchmark attention-only and hybrid candidates when retrieval is core behavior; measure whether inserted attention layers recover enough accuracy for their cache cost.
Symptom: Benchmarks say "linear time," but GPU throughput is still disappointing. Cause: Implementation fell back to Python loops or weak scan kernels. Mamba-1 needs hardware-aware fused scans, and Mamba-2 matters because it moves more work onto batched matmuls. Fix: Profile real kernels, not equations. Check whether stack is using optimized scan or SSD kernels before claiming a serving win.
Symptom: Routine tokens wipe out earlier important facts. Cause: Model learned a poor selective update, so a large or weak read/write projections let unimportant tokens overwrite useful state. Fix: Evaluate long-range recall tasks directly. If exact recovery matters, add attention checkpoints or reduce burden on recurrent state.
Symptom: Decode memory looks great, but prompt latency is still too high. Cause: Fixed recurrent state only helps after state is built. Prefill still has to process whole prompt and can remain expensive. Fix: Separate prefill and decode in profiling. SSMs help most when long generation dominates or when context keeps growing over time.
Symptom: Architecture review lumps RWKV, linear attention, S4, Mamba, and Mamba-2 into one bucket. Cause: "Linear" became shorthand for several different mechanisms with different retrieval, kernel, and training tradeoffs. Fix: Name specific trick. For Mamba, key distinction is token-dependent , , and , not merely "non-quadratic scaling."
A pure transformer stack is risky because KV-cache growth scales with context length, so long decode can become a memory bottleneck quickly. A pure SSM removes context-length-dependent decode-state growth, but it may lose retrieval accuracy on rare important log lines. A hybrid is a candidate because it adds some attention access while reducing cache growth versus an attention-only stack; validate it with those log-retrieval cases.
Standard SSMs such as S4 use fixed, input-independent dynamics, so every token is processed with the same update rule regardless of content. That makes prompt-dependent routing, exact copying, and few-shot pattern matching harder than in attention models. Mamba improves this by making , , and input-dependent, but the full history is still compressed into a fixed-size state.
Inspect kernel path, not only asymptotic complexity. Mamba-1 depends on hardware-aware fused scans, so Python loops or weak kernels can erase theoretical gain. This is exactly why Mamba-2 matters: it reformulates more of work into tensor-core-friendly batched matrix multiplications.
Real advantage is constant-size recurrent state during decode, which keeps generation memory from growing with context length. That can improve long-context throughput and fit tighter edge budgets. It does not solve exact token retrieval automatically, and it does not make prompt ingestion free because prefill still has to process whole input.
Jamba's released configuration uses a 1:7 attention-to-Mamba ratio inside each block. That design reduces the number of KV-cached attention layers and reintroduces occasional explicit token access. Treat its reported results as evidence for that configuration; a production choice still needs throughput, memory, and recall-heavy evaluations.[4]
You now have a working mental model of how SSMs compress sequence history into fixed recurrent state, why Mamba's selective mechanism matters, and where hybrid architectures sit in the design space. The next article in the path is Reasoning & Test-Time Compute. It builds on these serving concepts: additional generated tokens alter decode memory, latency, batching, and architecture trade-offs.
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Gu & Dao · 2023
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Dao & Gu · 2024
Mamba-3: Improved Sequence Modeling using State Space Principles
Gu, A., et al. · 2026
Jamba: A Hybrid Transformer-Mamba Language Model
AI21 Labs · 2024
Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models
NVIDIA · 2025
Efficiently Modeling Long Sequences with Structured State Spaces
Gu et al. · 2022 · ICLR 2022
HiPPO: Recurrent Memory with Optimal Polynomial Projections.
Gu, A., Dao, T., Ermon, S., Rudra, A., & Ré, C. · 2020 · NeurIPS 2020