LearnInference & Production ScaleMulti-Query & Grouped-Query Attention

🚀HardInference Optimization

Multi-Query & Grouped-Query Attention

Compare MHA, MQA, and GQA architectures, calculate their KV cache footprint, and reason about memory-limited serving tradeoffs.

40 min read

Learning path

Step 128 of 158 in the full curriculum

Inference: TTFT, TPS & KV Cache KV Cache & PagedAttention

Every active token consumes key-value (KV) cache memory during decode. Model architecture can shrink that cache before serving tricks like paging, scheduling, or quantization are applied.

Picture a transformer serving stack where every attention head keeps its own K/V projection. With five heads, the paperwork pile is small. With sixty-four heads and long contexts, the KV memory grows quickly. One fix is to have groups of heads share one K/V projection instead of giving every picker a personal copy.

That's the idea behind Grouped-Query Attention (GQA) and Multi-Query Attention (MQA). In decoder-side self-attention with standard multi-head attention (MHA), each query head has its own Key and Value projections.^{[1]Reference 1Attention Is All You Need.https://arxiv.org/abs/1706.03762} Incremental decode therefore keeps separate cached K/V state per head. As conversations get longer and you serve more users simultaneously, this cache can grow to hundreds of gigabytes. GQA and MQA reduce that dynamic memory cost by sharing cached data across heads. If KV memory is the admission bottleneck, that saving can increase concurrency; it isn't an automatic throughput or quality guarantee.

Why KV cache memory becomes the bottleneck

Every time a model generates a token, it saves the Key and Value vectors from that step so it doesn't have to recompute them later. This KV cache is essential for performance, but it grows linearly with sequence length, and in standard Multi-Head Attention (MHA), every single head keeps its own separate copy. For big models, this adds up fast.

A tiny example to build intuition

Before scaling up to billions of parameters, use a miniature case you can count on your fingers.

Suppose a single layer has 2 attention heads and each head has a dimension of 4. For one token, we must store a Key vector and a Value vector for every head. That's 2 heads times 2 vectors (K and V) times 4 numbers each:

Component	Count	Numbers stored
Key vectors	2 heads	2 x 4 = 8
Value vectors	2 heads	2 x 4 = 8
Total per token		16

If you serve 8 requests in parallel and each request reaches 2,048 tokens, the cache explodes to 16 x 2,048 x 8 = 262,144 numbers for that one layer alone. Scale that to 80 layers and the pile becomes enormous. The problem isn't the math itself; it's the memory needed to hold all those numbers while the GPU streams them through High Bandwidth Memory (HBM) on every decoding step.

count-kv-elements.py

def kv_elements(layers: int, tokens: int, batch: int, kv_heads: int, head_dim: int) -> int:
    return 2 * layers * tokens * batch * kv_heads * head_dim

tiny_one_layer = kv_elements(layers=1, tokens=2048, batch=8, kv_heads=2, head_dim=4)
scaled_layers = kv_elements(layers=80, tokens=2048, batch=8, kv_heads=2, head_dim=4)
print("tiny one-layer elements:", tiny_one_layer)
print("with 80 layers:", scaled_layers)

Output

tiny one-layer elements: 262144
with 80 layers: 20971520

Compact MHA, GQA, and MQA K/V head-sharing comparison. — MHA keeps one K/V pair per query head. GQA and MQA share K/V heads to shrink decode cache.

In standard MHA, each of the $h$ attention heads maintains its own Key ( $K$ ) and Value ( $V$ ) projections. The attention computation for a single layer uses softmax over scaled query-key scores:

$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V$

Where $Q \in \mathbb{R}^{h \times n \times d_k}$ , $K \in \mathbb{R}^{h \times n \times d_k}$ , and $V \in \mathbb{R}^{h \times n \times d_k}$ for a sequence of length $n$ . Each head learns a different attention pattern by attending to different aspects of the input. The KV cache stores $K$ and $V$ for all previously computed tokens so the model doesn't have to recompute them on each decoding step.

KV cache memory with standard MHA

For standard Multi-Head Attention (MHA) with $h$ heads, each with dimension $d_k$ , read the cache size formula term by term before you apply it:

2 counts the Key and Value tensors separately.
$h$ is the number of attention heads.
$d_k$ is the dimension inside each head.
bytes per element depends on the storage format: 2 bytes for FP16 (16-bit floating point), 1 byte for INT8 (8-bit integer), and so on.

Multiply those four numbers and you get the cache for a single token in a single layer. For the full cache across all layers, sequence length, and batch:

$\text{KV cache per token per layer} = 2 \times h \times d_k \times \text{bytes per element}$

$\text{Total KV cache} = 2 \times L \times S \times B \times h \times d_k \times \text{bytes per element}$

Where $L$ is layers, $S$ is sequence length, and $B$ is batch size.

Common trap: $2 \times h \times d_k \times \text{bytes}$ is only the cache for one token in one layer. For a full serving estimate, you still need to multiply by layers, sequence length, and batch size.

Concrete example (72B-style decoder dimensions)

Qwen2.5-72B^{[2]Reference 2Qwen2.5 Technical Reporthttps://arxiv.org/abs/2412.15115} is a useful reference point because it uses 80 layers, 64 query heads, and a head dimension of 128. The published model uses GQA-8, not full MHA, but if a model with those same dimensions used FP16 MHA, the KV cache would be:

Parameter	Value
$d_{\text{model}}$	8192
$h$ (heads)	64
$d_k$	128
Layers	80
Sequence length	4096
Batch size	32

$\text{KV cache} = 2 \times 80 \times 4096 \times 32 \times 64 \times 128 \times 2 \text{ bytes} \approx \mathbf{344 \text{ GB}}$

(Equivalent formulation using $d_{\text{model}} = h \times d_k = 8192$ : $2 \times 80 \times 4096 \times 32 \times 8192 \times 2$ bytes.)

$2$ counts K and V tensors, $80$ is layers, $4096$ is tokens per request, $32$ is batch size, $64$ is heads, $128$ is head dimension ( $d_k$ ), and the final $2$ is FP16 bytes per element.

That's more than twice the model weights themselves (~144 GB in FP16). For long-context or high-concurrency decoding, moving that cache through HBM can dominate the incremental attention path.^{[3]Reference 3Efficiently Scaling Transformer Inference.https://arxiv.org/abs/2211.05102}

The next figure uses the same 72B-style dimensions, but shifts the shape of the workload from 32 parallel 4K requests to one 128K request. Both cases contain 131,072 cached token positions, so full MHA lands in the same 344 GB range.

KV-cache memory comparison for a 72B-style decoder at 128K context: MHA stores 64 KV heads for about 344 GB, GQA-8 stores 8 KV heads for about 43 GB, and MQA stores 1 KV head for about 5.4 GB. — At long context, the stored KV-head count can decide whether one request fits at all. Query heads stay fixed here; the shrinking tanks show cache-only memory before weights and runtime overhead.

In model serving, this is the difference between a full-MHA cache that doesn't fit on a single 80 GB GPU and a GQA-8 cache around 43 GB before weights, runtime buffers, and fragmentation. Whether either configuration is viable still depends on weights, quantization, hardware layout, scheduler policy, and measured traffic.

compare-attention-kv-footprints.py

def kv_cache_gb(kv_heads: int, tokens: int = 4096, batch: int = 32) -> float:
    bytes_used = 2 * 80 * tokens * batch * kv_heads * 128 * 2
    return bytes_used / 1e9

for name, kv_heads in (("MHA", 64), ("GQA-8", 8), ("MQA", 1)):
    print(f"{name}: {kv_cache_gb(kv_heads):.2f} GB")

Output

MHA: 343.60 GB
GQA-8: 42.95 GB
MQA: 5.37 GB

Multi-Query Attention (MQA)

The extreme compression

Back to the K/V-sharing picture: MQA takes the extreme approach. Instead of every query head keeping its own K/V projection, all query heads share one K/V projection. This saves enormous storage, but some lanes might lose details that would have helped their specific work.

MQA^{[4]Reference 4Fast Transformer Decoding: One Write-Head is All You Need.https://arxiv.org/abs/1911.02150} uses one shared K head and one shared V head across all query heads:

$h$ separate query projections (same as standard MHA)
1 shared key projection for all $h$ query heads
1 shared value projection for all $h$ query heads

In the head-sharing figure above, MQA is the rightmost design: every query head routes through the same cached K/V pair.

How much memory MQA saves

Return to our tiny example: 2 heads, dimension 4, one token. Under MHA we stored 16 numbers. Under MQA we store only 1 Key and 1 Value head, so the count drops to $2 \times 1 \times 4 = 8$ numbers. That's a 2x reduction for 2 heads. At 64 heads the reduction becomes 64x.

The memory footprint of MQA is drastically smaller because the number of attention heads ( $h$ ) in the standard MHA formula is replaced by $1$ :

$\text{MQA KV cache per token per layer} = 2 \times 1 \times d_k \times \text{bytes per element}$

Where $1$ means a single shared KV head (instead of $h$ heads), so cache per token per layer scales with $d_k$ rather than with $h \cdot d_k$ .

Instead of storing separate K and V for each of the 64 heads, MQA stores just one shared K and one shared V, cutting cache by 64x. All query heads look up the same key-value pair, like every picking lane using the same route note.

This scaling is proportional to the number of query heads: with $h$ query heads and a single KV head, you get an $h$ x reduction. The 8-head illustration above shows this with 8x reduction; the same principle applies at scale with 64 heads for a 64x reduction.

Method	KV heads	Cache per token per layer (FP16)	Savings
MHA	64	$2 \times 64 \times 128 \times 2 = 32\text{ KiB}$	1x
MQA	1	$2 \times 1 \times 128 \times 2 = 512\text{ B}$	64x

For the same 72B-style dimensions with batch=32 and seq=4096, MHA uses 344 GB of KV cache. MQA cuts that by 64x to ~5.4 GB.

Performance vs. quality tradeoffs in MQA

MQA can reduce quality because all heads share the same key-value representation. Query heads can still ask different questions, but they no longer get separate K/V subspaces.

Common mistake: Treating MQA quality loss as either zero or catastrophic. Real impact depends on model size and task. Shazeer's original paper motivates MQA as a serving optimization, and later GQA work shows why many larger models want more than one KV head.^{[4]Reference 4Fast Transformer Decoding: One Write-Head is All You Need.https://arxiv.org/abs/1911.02150}^{[5]Reference 5GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.https://arxiv.org/abs/2305.13245}

Return to the transformer serving stack: MQA gives every query head the same K/V representation. Each query head can still attend to a different aspect of the context (a separate query), but every answer starts from one shared K/V representation. That's efficient, but it can hide a detail that a specialized K/V projection would have preserved.

A minimal MQA implementation

This basic PyTorch implementation of Multi-Query Attention projects the input into multiple query heads, but uses only a single shared key and value projection across all query heads. The key and value tensors are broadcast across query heads during attention scoring to produce the final output.

a-minimal-mqa-implementation.py

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class MultiQueryAttention(nn.Module):
    """
    Multi-Query Attention (MQA).
    Q gets h separate heads; K and V share a single head.
    """
    def __init__(self, d_model: int, n_heads: int):
        super().__init__()
        self.n_heads = n_heads
        self.d_k = d_model // n_heads

        # h separate query projections
        self.W_q = nn.Linear(d_model, d_model)
        # ONE shared key and value projection (output dim is d_k, not d_model)
        self.W_k = nn.Linear(d_model, self.d_k)
        self.W_v = nn.Linear(d_model, self.d_k)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, x: torch.Tensor, mask: torch.Tensor | None = None) -> torch.Tensor:
        B, N, D = x.shape

        # Q: [batch, seq, d_model] -> [batch, heads, seq, d_k]
        Q = self.W_q(x).view(B, N, self.n_heads, self.d_k).transpose(1, 2)

        # K and V: [batch, seq, d_k] -> [batch, 1, seq, d_k]
        # The singleton head dimension broadcasts across all query heads automatically
        K = self.W_k(x).unsqueeze(1)   # [B, 1, N, d_k]
        V = self.W_v(x).unsqueeze(1)   # [B, 1, N, d_k]

        # Attention scores: Q @ K^T
        # PyTorch broadcasts the singleton head dimension to match h
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        # shape: [B, h, N, N]

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attn = F.softmax(scores, dim=-1)

        # Weighted sum: attn @ V
        # V's singleton head dimension broadcasts to h
        out = torch.matmul(attn, V)     # [B, h, N, d_k]
        out = out.transpose(1, 2).contiguous().view(B, N, D)
        return self.W_o(out)

# Shape smoke test
mqa = MultiQueryAttention(d_model=64, n_heads=4)
x = torch.randn(2, 10, 64)  # batch=2, seq=10
out = mqa(x)
print("MQA output shape:", out.shape)

Output

MQA output shape: torch.Size([2, 10, 64])

Grouped-Query Attention (GQA)

The balanced compromise

MQA saves the most memory, but forcing all query heads to share one K/V pair can reduce quality on some workloads. The compromise is to split the lanes into groups, with each group sharing one route note. That's GQA: less aggressive than MQA (one note for everyone), but more memory-efficient than MHA (one note per query head).

GQA^{[5]Reference 5GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.https://arxiv.org/abs/2305.13245} is the compromise between MHA and MQA. Instead of 1 KV head (MQA) or $h$ KV heads (MHA), use $g$ KV groups where $1 < g < h$ :

$\frac{h}{g} = \text{queries per group}$

With $g$ groups and $h$ total query heads, each group of $h/g$ queries shares one K and one V. For example, Mistral 7B's 32 query heads with 8 KV groups means 4 queries share each KV pair. That's a 4:1 ratio: 4x less KV cache than MHA, but with more representational diversity than MQA's single KV pair.

GQA is GPU-side reference sharing: instead of every query head carrying its own K/V cache view (MHA) or all heads sharing one global view (MQA), you organize heads into groups that share one K/V pair. Each group can still ask different queries, but it reads shared cached reference data.

map-query-heads-to-kv-groups.py

def kv_group_for_query(query_head: int, query_heads: int, kv_heads: int) -> int:
    assert query_heads % kv_heads == 0
    return query_head // (query_heads // kv_heads)

assignments = [kv_group_for_query(head, query_heads=8, kv_heads=2) for head in range(8)]
print("query to KV group:", assignments)
print("cache reduction:", 8 // 2, "x")

Output

query to KV group: [0, 0, 0, 0, 1, 1, 1, 1]
cache reduction: 4 x

GQA in practice

Production model configurations make the pattern easier to compare:

Model	Total heads ( $h$ )	KV groups ( $g$ )	Ratio
Qwen2.5-72B^{[2]Reference 2Qwen2.5 Technical Reporthttps://arxiv.org/abs/2412.15115}	64	8	8:1
Llama 2 70B^{[6]Reference 6Llama 2: Open Foundation and Fine-Tuned Chat Models.https://arxiv.org/abs/2307.09288}	64	8	8:1
Mistral 7B^{[7]Reference 7Mistral 7B.https://arxiv.org/abs/2310.06825}	32	8	4:1
Gemma 2 9B^{[8]Reference 8Gemma 2: Improving Open Language Models at a Practical Sizehttps://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf}	16	8	2:1

Beyond memory savings, GQA interacts with tensor-parallel serving. Tensor parallelism splits one model's work across several accelerators. Each runtime has its own sharding rules. For example, vLLM first requires the total query-head count to divide evenly by the tensor-parallel (TP) degree. It divides KV heads across shards while that's possible, then replicates KV heads when TP exceeds the KV-head count so every shard still owns at least one.^{[9]Reference 9vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttentionhttps://github.com/vllm-project/vllm}

check-vllm-kv-head-sharding.py

def vllm_sharding_plan(query_heads: int, kv_heads: int, tensor_parallel: int) -> str:
    if query_heads % tensor_parallel != 0:
        return "invalid: query heads must divide evenly across TP shards"
    if tensor_parallel <= kv_heads:
        return f"even: {kv_heads // tensor_parallel} KV heads per shard"
    if tensor_parallel % kv_heads != 0:
        return "replicated: runtime-specific KV ownership"
    replicas_per_kv_head = tensor_parallel // kv_heads
    return f"replicated: each KV head appears on {replicas_per_kv_head} shards"

print("TP=4:", vllm_sharding_plan(query_heads=64, kv_heads=8, tensor_parallel=4))
print("TP=16:", vllm_sharding_plan(query_heads=64, kv_heads=8, tensor_parallel=16))

Output

TP=4: even: 2 KV heads per shard
TP=16: replicated: each KV head appears on 2 shards

Memory comparison (72B model, seq=4096)

These are per-request KV cache sizes at the given sequence length:

Method	KV groups	KV cache per request	Quality posture
MHA	64	~10.7 GB	Reference architecture
GQA-8	8	~1.34 GB	Validate converted or trained model on task evals
MQA	1	~0.17 GB	Strongest sharing constraint; evaluate carefully

In Ainslie et al., intermediate group counts recovered much of MHA quality while retaining MQA-like inference benefits after uptraining.^{[5]Reference 5GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.https://arxiv.org/abs/2305.13245} The correct group count for another model remains an architecture and evaluation choice, not a guarantee inherited from that experiment.

Adoption across model families

Different model families have made distinct architectural choices based on their serving requirements and quality targets:

Architecture	Typical usage	Design Rationale
MHA	Older decoder designs, or deployments where KV memory is less constrained	Maximum per-head flexibility, highest KV-cache cost
MQA	Serving-first deployments that need the smallest possible KV cache	Aggressive memory reduction with the strongest sharing constraint
GQA	Published models such as Llama 2 34B/70B, Qwen2.5-72B, Mistral 7B, and Gemma 2 9B	Intermediate KV-head count; evaluate quality and serving together

Key observations

MQA is best understood as a serving optimization: shared KV heads cut cache size and reduce memory traffic during incremental decoding.^{[4]Reference 4Fast Transformer Decoding: One Write-Head is All You Need.https://arxiv.org/abs/1911.02150}^{[3]Reference 3Efficiently Scaling Transformer Inference.https://arxiv.org/abs/2211.05102}
Many modern decoder-only LLMs use grouped KV heads, but the choice is model-specific rather than family-wide. Published examples here include Llama 2 34B/70B, Mistral 7B, Qwen2.5-72B, and Gemma 2 9B.^{[6]Reference 6Llama 2: Open Foundation and Fine-Tuned Chat Models.https://arxiv.org/abs/2307.09288}^{[7]Reference 7Mistral 7B.https://arxiv.org/abs/2310.06825}^{[2]Reference 2Qwen2.5 Technical Reporthttps://arxiv.org/abs/2412.15115}^{[8]Reference 8Gemma 2: Improving Open Language Models at a Practical Sizehttps://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf}
There's no universal best group count. Model size, quality target, and serving stack decide whether 2:1, 4:1, or 8:1 is right.^{[5]Reference 5GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.https://arxiv.org/abs/2305.13245}

Converting MHA to GQA via uptraining

A key practical insight for teams adapting older models: you don't need to train a GQA architecture from scratch. You can convert an existing MHA model to GQA through a process called "uptraining."

MHA to GQA conversion: Converting MHA to GQA is like a company merger. Several departments each had their own filing system (KV heads). Instead of rebuilding from scratch, you merge departments into groups, average their files, and then give the new organization time to adapt through continued training. Quality must be evaluated after the conversion.

The conversion has two main steps:

Mean-pool KV heads: Group the existing Key and Value heads into the desired number of partitions (e.g. merging 64 heads into 8 groups of 8). Take the mean of the weights for the $K$ and $V$ projection matrices within each group. This retains information from the original MHA heads as a useful initialization, but it doesn't preserve the original attention computation exactly.
Fine-tune (uptrain): Train the converted model for a small fraction of the original pretraining budget using the standard next-token prediction objective. This allows the model to adapt its internal representations to the newly merged KV projections.

Compact GQA uptraining flow from MHA checkpoint to validation. — GQA uptraining mean-pools old K/V heads, trains briefly, then validates quality and serving metrics.

The original GQA paper^{[5]Reference 5GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.https://arxiv.org/abs/2305.13245} showed that this recipe reaches quality close to MHA with 5% of the original pretraining compute in its T5.1.1 encoder-decoder experiments. The paper also calls out an important limit: it didn't compare decoder-only models or a same-size GQA model trained from scratch. For a fresh pretraining run, teams usually bake the desired KV-head pattern into the architecture from day one. Uptraining matters when you inherit an older MHA checkpoint and want serving gains without restarting pretraining.

mean-pool-kv-heads-for-uptraining.py

def mean_pool_heads(heads: list[list[float]], group_size: int) -> list[list[float]]:
    assert len(heads) % group_size == 0
    pooled: list[list[float]] = []
    for start in range(0, len(heads), group_size):
        group = heads[start : start + group_size]
        pooled.append([
            sum(values) / len(group)
            for values in zip(*group)
        ])
    return pooled

mha_k_heads = [[1.0, 3.0], [3.0, 5.0], [10.0, 12.0], [14.0, 16.0]]
print("GQA initial K heads:", mean_pool_heads(mha_k_heads, group_size=2))

Output

GQA initial K heads: [[2.0, 4.0], [12.0, 14.0]]

Beyond GQA: Multi-Head Latent Attention (MLA)

GQA reduces the number of distinct KV heads stored in the cache. Multi-Head Latent Attention (MLA) takes a different route: it reduces the dimensionality of what must be stored per position by learning a low-rank latent representation.

How MLA works (high-level)

Instead of caching full per-head Key and Value content vectors for every token, the model does the following:

Input hidden states are projected into a compact latent vector of dimension $d_c$ .
Only this compact latent state (plus a small amount of decoupled positional information) is written into the KV cache.
The architecture defines learned up-projections from that latent state. In an optimized inference implementation, those projection matrices can be absorbed into the query and output paths so decode doesn't materialize full per-head cached K/V again.^{[10]Reference 10DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Modelhttps://arxiv.org/abs/2405.04434}

DeepSeek-V2 also separates a positional RoPE component from compressed content so the cache can retain the required position-dependent term without preventing content compression.^{[10]Reference 10DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Modelhttps://arxiv.org/abs/2405.04434}

DeepSeek-V2 reports a 93.3% reduction in deployed KV-cache memory footprint compared with DeepSeek 67B.^{[10]Reference 10DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Modelhttps://arxiv.org/abs/2405.04434} That deployed comparison also includes KV-cache quantization, so don't treat 93.3% as an MLA-only head-count ratio. MLA's architectural payload depends on the chosen latent width $d_c$ , positional component, and execution path.

GQA versus MLA comparison separating an exact same-shape GQA cache ratio from DeepSeek-V2 paper-reported deployed cache savings versus DeepSeek 67B. — GQA head ratios are exact when model dimensions stay fixed. DeepSeek-V2's reported MLA result uses a different deployment comparison, so evaluate it separately.

Trade-offs and adoption

Aspect	GQA	MLA
Mechanism	Fewer KV heads ( $g < h$ )	Low-rank latent compression + up-projection
Kernel compatibility	Fits GQA-aware attention kernels	Needs an MLA-aware latent-cache implementation
Compression measure	KV-head ratio gives exact cache-factor comparison	Cached latent width and positional component determine savings
Published examples	Llama 2 70B, Qwen2.5-72B, Mistral 7B, Gemma 2 9B	DeepSeek-V2

Why the runtime distinction matters. GQA slots into grouped-attention implementations and familiar tensor-parallel strategies. MLA changes the cached representation and needs a compatible execution path. A smaller cache on paper is only useful when the selected runtime implements that path efficiently.

This architectural fork appears in production interviews: "When would you choose GQA over a more aggressive compression scheme like MLA?" The defensible answer is that GQA has a simpler K/V-head contract, while MLA can compress harder if the selected model and runtime both support its latent path.

DeepSeek-V2 documents a 512-dimensional compressed K/V latent and a 64-dimensional decoupled key component. That means MLA stores 576 values per layer and token before storage-precision choices. For a shape comparison, an illustrative GQA layout with 128-dimensional heads stores $2 \times g \times 128$ values, so MLA's 576-value payload sits at the same width as 2.25 GQA heads. This is an architecture-level payload comparison, not a claim that unrelated models will have the same latency or quality.^{[10]Reference 10DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Modelhttps://arxiv.org/abs/2405.04434}

compare-cached-state-widths.py

def gqa_cached_values_per_token(kv_heads: int, head_dim: int) -> int:
    return 2 * kv_heads * head_dim

gqa_8_values = gqa_cached_values_per_token(kv_heads=8, head_dim=128)
mla_example_values = 512 + 64  # compact content latent plus positional component
equivalent_gqa_heads = mla_example_values / (2 * 128)
print("GQA-8 cached values:", gqa_8_values)
print("MLA-style cached values:", mla_example_values)
print("MLA-style width in GQA-head units:", equivalent_gqa_heads)

Output

GQA-8 cached values: 2048
MLA-style cached values: 576
MLA-style width in GQA-head units: 2.25

Serving impact

Concurrency impact (72B-style dimensions, 4K context)

Reducing KV cache doesn't guarantee an equal throughput gain, but it does raise the memory-limited concurrency ceiling. Using the same 80-layer, 64-query-head, 128-dim-head example:

Memory-limited concurrency visual where MHA fits one active stream, GQA-8 fits about eight streams, and MQA raises the theoretical KV-cache-limited ceiling to about sixty-four streams before real serving overhead. — Head sharing raises the memory-limited batch ceiling when KV cache is the bottleneck. Treat the dots as a theoretical ceiling; kernels, weights, prefill, and scheduler policy decide realized throughput.

Config	KV Cache per Request	Relative Memory-Limited Concurrency Ceiling
MHA	~10.7 GB	1x
GQA-8	~1.34 GB	~8x
MQA	~0.17 GB	~64x

If KV cache is what limits batch size, those ratios are first-order good estimates. Real systems land below that theoretical ceiling because weights, scheduler overhead, kernel efficiency, and interconnect traffic still matter.^{[3]Reference 3Efficiently Scaling Transformer Inference.https://arxiv.org/abs/2211.05102}

Long-context impact

KV savings become larger in absolute bytes with long-context models. For the same 72B-style dimensions, per-request cache grows linearly with sequence length:

Sequence Length	MHA KV Cache (approx.)	GQA-8 KV Cache (approx.)	Savings
4K	~10.7 GB	~1.34 GB	8x
32K	~85.9 GB	~10.7 GB	8x
128K	~343.6 GB	~42.9 GB	8x
1M	~2.75 TB	~343.6 GB	8x

At long context, head-sharing or another cache compression strategy becomes an explicit capacity decision. At 128K, full MHA uses roughly 344 GB of KV state per request in this example, before weights, allocator fragmentation, or extra concurrency. GQA-8 drops that cache to about 43 GB, which is still expensive and still needs an end-to-end memory budget.

budget-long-context-cache.py

def kv_cache_gb(kv_heads: int, tokens: int) -> float:
    return 2 * 80 * tokens * kv_heads * 128 * 2 / 1e9

mha_128k = kv_cache_gb(kv_heads=64, tokens=131_072)
gqa_128k = kv_cache_gb(kv_heads=8, tokens=131_072)
print(f"MHA 128K cache: {mha_128k:.1f} GB")
print(f"GQA-8 128K cache: {gqa_128k:.1f} GB")
print("GQA cache plus 144 GB weights fits in 4x80 GB raw:", gqa_128k + 144 <= 320)

Output

MHA 128K cache: 343.6 GB
GQA-8 128K cache: 42.9 GB
GQA cache plus 144 GB weights fits in 4x80 GB raw: True

Common mistake: Treating long-context support as an architectural context-window claim rather than a serving budget. A long support history, codebase prompt, or retrieved document bundle can consume the same KV allocation. Calculate active-token memory before promising concurrency.

How serving engines handle GQA head expansion

When building or working with inference engines, you must handle the KV head expansion efficiently. This conceptual implementation shows how an engine matches the smaller number of KV heads to the larger number of query heads before computing attention.

how-serving-engines-handle-gqa-head.py

import math
import torch
import torch.nn.functional as F

def gqa_attention_reference(Q: torch.Tensor, K: torch.Tensor, V: torch.Tensor,
                            n_heads: int, n_kv_heads: int) -> torch.Tensor:
    """
    Readable GQA implementation concept.

    Args:
        Q: [batch, n_heads, seq, d_k]
        K: [batch, n_kv_heads, seq, d_k]
        V: [batch, n_kv_heads, seq, d_k]
        n_heads: Number of query heads
        n_kv_heads: Number of KV heads
    """
    # Expand KV heads to match Q heads.
    # This materializes repeated K/V for clarity. Production serving kernels
    # compute grouped attention without building these larger tensors.
    heads_per_group = n_heads // n_kv_heads

    # [batch, n_kv_heads, seq, d_k] -> [batch, n_kv_heads * group_size, seq, d_k]
    K_expanded = K.repeat_interleave(heads_per_group, dim=1)
    V_expanded = V.repeat_interleave(heads_per_group, dim=1)

    # Standard attention from here
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K_expanded.transpose(-2, -1)) / math.sqrt(d_k)
    attn = F.softmax(scores, dim=-1)
    return torch.matmul(attn, V_expanded)

# Shape smoke test
B, N, h, h_kv, d = 1, 8, 32, 8, 64
Q = torch.randn(B, h, N, d)
K = torch.randn(B, h_kv, N, d)
V = torch.randn(B, h_kv, N, d)
out = gqa_attention_reference(Q, K, V, h, h_kv)
print("GQA output shape:", out.shape)

Output

GQA output shape: torch.Size([1, 32, 8, 64])

Modern GQA-aware kernels, including FlashAttention's current flash_attn_with_kvcache path and FlashInfer^{[11]Reference 11FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving.https://arxiv.org/abs/2501.01005}, handle fewer KV heads without materializing the expanded tensors. This avoids the memory overhead of repeat_interleave and computes grouped attention directly in the fused kernel. The next chapter explains the serving-side cache layout; FlashAttention^{[12]Reference 12FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.https://arxiv.org/abs/2307.08691} gets its own kernel-level chapter after prefix caching.

Production tip: When evaluating models for serving, total parameter count isn't enough. num_kv_heads, context length, and KV-cache dtype often dominate concurrency. GQA-8 raises the memory-limited concurrency ceiling by roughly 8x versus same-dimension MHA, but realized cost per query still depends on batching, quantization, and kernels.

KV cache update correctness

In serving code, the hardest GQA bug is often not the attention formula. It's updating the cache correctly one token at a time.

GQA cache update invariant showing narrow KV writes, all query heads reading, and parity test against full recompute. — GQA cache correctness has one useful invariant: write narrow KV heads, read with all query heads, then prove incremental decode matches full-prefix recompute.

For one decode step, the runtime should:

Project only the new token into Q, K, and V.
Apply RoPE or another positional transform using the absolute position of that token.
Append the new K/V slice at the next cache position.
Run attention with Q from the new token against all cached K/V positions.
Track cache length per request, because different requests finish at different times.

The cache shape should use KV heads, not query heads:

text

K_cache: (batch, n_kv_heads, max_seq, d_k)
V_cache: (batch, n_kv_heads, max_seq, d_k)

In an MHA model, n_kv_heads == n_heads. MQA sets n_kv_heads == 1. GQA sits between them with 1 < n_kv_heads < n_heads.

enforce-cache-append-cursor.py

def append_kv(cache: list[str], cursor: int, token_value: str) -> int:
    if cursor != len(cache):
        raise ValueError("cursor would overwrite or skip cache state")
    cache.append(token_value)
    return cursor + 1

cache: list[str] = []
cursor = append_kv(cache, cursor=0, token_value="token-0-kv")
cursor = append_kv(cache, cursor=cursor, token_value="token-1-kv")
print("cache length:", len(cache))
try:
    append_kv(cache, cursor=1, token_value="stale-write")
except ValueError as exc:
    print("blocked:", exc)

Output

cache length: 2
blocked: cursor would overwrite or skip cache state

Common cache-update bugs look like this:

Symptom	Likely bug	Check
answer quality degrades after a few tokens	overwrote position `t - 1` instead of appending at `t`	print cache length after every decode step
works at batch size 1, fails under batching	used one global cache length for all requests	store per-request positions
GQA memory is still huge	allocated cache with query heads instead of KV heads	assert `K_cache.size(1) == n_kv_heads`
long-context output becomes incoherent	applied RoPE with local chunk position instead of absolute position	log the position id used for each appended token

That's why cache tests should compare a cached decode path against a full-prefix recompute path on the same tiny prompt. The next-token logits (raw scores before probabilities) should match closely. If they don't, the bug is usually position IDs, mask shape, or cache append order.

What to check before moving on

By now, reason from the architecture diagram to serving cost instead of only naming the attention variant.

Skill	What a strong answer includes
KV-cache sizing	Full cache math multiplies K/V tensors, layers, sequence length, batch size, KV heads, head dimension, and bytes per element.
MHA vs MQA vs GQA	MHA stores one K/V pair per query head, MQA stores one shared K/V pair, and GQA stores an intermediate number of K/V groups.
Memory reduction factor	Savings follow `num_query_heads / num_key_value_heads` for the KV cache, such as 4x for Mistral 7B and 8x for Qwen2.5-72B.
Quality tradeoff	Fewer KV heads reduce memory traffic, but they also reduce K/V representational capacity.
Uptraining	Existing MHA checkpoints can be converted by mean-pooling K/V heads and continuing language-model training for a small compute budget.
Serving fit	GQA appears in the cited model architectures and can raise memory-limited concurrency when the runtime supports its head layout efficiently.
GQA vs MLA choice	GQA shrinks the number of full K/V heads. MLA stores a compact latent and needs an MLA-aware inference path that can avoid full cached K/V materialization. The right choice depends on runtime support and cache pressure.
Memory win vs throughput win	KV-cache savings can raise the memory-limited batch ceiling without translating into the same wall-clock speedup if compute, prefill, or scheduler overhead still dominates.

Follow-up questions

A 70B-style chat model must serve 128K-token context. What memory story do you defend?

For 80 layers, 64 query heads, head dimension 128, FP16 cache, and full MHA, the per-token-per-layer cache is 2 * 64 * 128 * 2 = 32 KiB. Across all 80 layers, that's about 2.5 MiB per token. At 128K context, one request needs about 344 GB of KV cache under full MHA. GQA-8 cuts that to about 43 GB per request. That's much better, but it still isn't a deployment proof: add roughly 144 GB of FP16 weights, runtime buffers, fragmentation, and any additional active requests. Strong answers mention head sharing plus paging, scheduling, retrieval, or KV-cache quantization.

A team wants GQA but can't afford full retraining. What rollout plan makes sense?

Use uptraining. Group old K/V heads, mean-pool each group into one new K head and one new V head, then continue language-model training for a small fraction of original pretraining budget so the model adapts to reduced K/V capacity. After that, validate both task quality and serving behavior: perplexity or downstream evals, KV-cache size, tensor-parallel sharding fit, and decode throughput under realistic batch and context settings. Ainslie et al. used this path in T5.1.1 encoder-decoder experiments; apply the recipe to decoder-only models as an engineering choice that still needs direct evaluation.^{[5]Reference 5GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.https://arxiv.org/abs/2305.13245}

`num_attention_heads = 64`, `num_key_value_heads = 8`, and vLLM tensor parallel degree is 16. Why does replication appear?

The 64 query heads split evenly across 16 shards, but only 8 KV heads exist. vLLM keeps at least one KV head on every shard, so each KV head is replicated twice. That duplication cuts into the physical cache saving you expected from GQA. Check the selected runtime's sharding rules instead of inferring deployment memory from architecture alone.^{[9]Reference 9vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttentionhttps://github.com/vllm-project/vllm}

Your runtime offers both GQA and MLA paths today. How do you choose between them?

Choose GQA when your selected runtime supports grouped heads and its sharding layout fits the model. Choose MLA when your model is MLA-native, your stack ships its latent-cache path, and cache pressure is dominant enough to justify that execution contract.

Cached decode diverges from full recompute after 200 tokens. What do you test first?

Check that cache shape uses n_kv_heads, not query heads, and that absolute positions, cache cursors, and append order all stay correct. The quickest proof is a tiny-prompt parity test: cached decode logits should match full-prefix recompute closely at each step.

Common pitfalls

"Model weights are the same thing as KV cache"

Symptom: You calculate that a quantized model fits in GPU memory, then the server still runs out of memory under long context or high concurrency.
Cause: Weights are static model parameters. The KV cache is dynamic per-request state. GQA mainly shrinks dynamic K/V activations, not the feed-forward layers or the full weight tensor.
Fix: Budget weights and KV cache separately. Then add activation buffers, allocator slack, scheduler state, and fragmentation.

"GQA speeds up training"

Symptom: You benchmark training throughput and see almost no change after switching from MHA to GQA.
Cause: Training often has a different bottleneck from incremental decoding. GQA reduces decode cache state, while training processes whole sequences and may remain dominated by large attention and feed-forward computations.
Fix: Measure decode throughput (tokens per second during autoregressive generation), not training throughput. GQA's wins show up when you're repeatedly loading the KV cache for each new token, not when you're computing the full attention matrix once.

"MQA always destroys quality"

Symptom: You reject MQA outright even for small, latency-sensitive models where the quality difference isn't visible on your task.
Cause: MQA imposes the strongest sharing constraint, but impact depends on model size, task, and training recipe. Treating it as always catastrophic is as wrong as treating it as free.
Fix: Compare task evals and serving metrics for your workload. Use GQA when you need a safer quality/serving compromise.

"I saved 8x on KV cache, so my serving cost dropped 8x"

Symptom: You calculate a huge KV cache reduction, but measured latency or cost only improves by a fraction of that.
Cause: KV cache is one piece of the puzzle. Model weights, feed-forward network (FFN) compute, attention arithmetic, scheduler overhead, and interconnect traffic still matter. If your batch was previously limited by compute rather than memory, shrinking the cache won't move the needle as much.
Fix: Profile end-to-end latency with a realistic batch size. Use a tool like NVIDIA Nsight Systems or vLLM's built-in metrics to see whether you're memory-bound or compute-bound before betting on GQA alone.

"repeat_interleave exploded my memory during GQA training"

Symptom: Out-of-memory errors when you naively expand KV heads with repeat_interleave inside a training loop.
Cause: repeat_interleave materializes a larger tensor in memory. For 32 query heads and 8 KV groups, that temporarily creates a 4x larger K/V tensor before the matmul.
Fix: Use FlashAttention or FlashInfer, which handle GQA natively without materializing the expanded tensor. If you must write a custom kernel, use an implicit broadcast or fused kernel rather than explicit expansion.

"I can regroup heads any way I want during GQA conversion"

Symptom: The converted checkpoint loads, but quality drops sharply or tensor-parallel shards disagree after serving rollout.
Cause: Head grouping isn't arbitrary. Query and KV heads often follow a fixed ordering in the weight matrices, and tensor-parallel sharding can interleave or chunk that order in specific ways. Regrouping without respecting the original layout silently changes which heads share a K/V projection.
Fix: Preserve the model's published head ordering during regrouping, then run a tiny parity test and shard-level smoke test before broader evals. Don't assume head 0-7 is always the intended first group.

"Architecture-level KV savings always survive tensor parallelism"

Symptom: Tensor-parallel serving uses more memory than your head-ratio estimate predicted.
Cause: Physical cache layout is runtime-specific. In vLLM, query heads must divide evenly across TP shards, and KV heads are replicated when TP degree exceeds KV-head count so each shard owns at least one.
Fix: Check model config and runtime sharding rules before choosing TP degree. Benchmark the physical layout you intend to deploy.^{[9]Reference 9vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttentionhttps://github.com/vllm-project/vllm}

A quick check

Try this without looking back at the tables:

A model has 40 layers, 32 query heads, 8 KV groups, head dimension 128, and FP16 KV-cache elements. You serve a batch of 8 requests, each at 2,048 tokens. How large is the KV cache under GQA? Show your work.

Solution sketch

Per token per layer: 2 (K and V) x 8 (KV groups) x 128 (head dim) x 2 (FP16 bytes) = 4,096 bytes.
Per request: 2,048 tokens x 40 layers x 4,096 bytes = 335,544,320 bytes ≈ 320 MiB.
Batch of 8: 8 x 320 MiB ≈ 2.5 GiB.

Compare to MHA: 32 KV heads instead of 8 gives a 4x larger cache, so the same batch would need roughly 10 GiB just for the KV cache.

What to remember and where to go next

KV Cache Bottleneck: For long contexts or high concurrency, KV cache can consume more memory than model weights.
MQA vs GQA: MQA uses 1 KV head (largest cache saving and strongest sharing constraint). GQA uses $g$ KV heads to trade cache size against representational capacity; measure quality.
Savings follow the head ratio: KV cache savings are num_query_heads / num_kv_heads. A 64-query-head model with 8 KV heads gets 8x savings; a 32-query-head model with 8 KV heads gets 4x.
Uptraining: You can convert MHA checkpoints to GQA by mean-pooling KV heads and continuing training for a small compute budget, then validating quality and serving behavior.
Serving Support: Modern GQA-aware kernels such as FlashAttention's current flash_attn_with_kvcache path and FlashInfer^{[11]Reference 11FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving.https://arxiv.org/abs/2501.01005} avoid explicit K/V expansion in the hot path.

Practice drill

Create a GQA deployment worksheet for one candidate model:

Calculate KV-cache bytes for MHA, MQA, and at least two GQA group sizes at the target context length and batch size.
Add a parity test plan that compares cached decode logits against full-prefix recompute for short and long prompts.
Record kernel support: whether the runtime reads grouped KV heads directly or materializes repeated K/V tensors.
Choose a quality check that must pass before a lower-KV-head variant replaces the baseline.

The worksheet should make cache savings, runtime support, and quality risk visible in one review.

Next Step

Continue to KV Cache & PagedAttention

There, you'll understand KV cache storage strategies for multi-tenant LLM inference, including <span data-glossary="pagedattention">PagedAttention</span>, memory fragmentation, and vLLM architecture.

PreviousInference: TTFT, TPS & KV Cache

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Attention Is All You Need.

Vaswani, A., et al. · 2017

Qwen2.5 Technical Report

Qwen Team · 2024

Efficiently Scaling Transformer Inference.

Pope, R., et al. · 2023 · arXiv preprint

Fast Transformer Decoding: One Write-Head is All You Need.

Shazeer, N. · 2019 · arXiv preprint

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.

Ainslie, J., et al. · 2023 · EMNLP 2023

Llama 2: Open Foundation and Fine-Tuned Chat Models.

Touvron, H., et al. · 2023 · arXiv preprint

Mistral 7B.

Jiang, A. Q., et al. · 2023

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Google DeepMind · 2024

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

vLLM Team · 2024

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI · 2024

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving.

Ye, Z., et al. · 2025

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.

Dao, T. · 2023 · ICLR 2024

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Multi-Query & Grouped-Query Attention

What do MQA and GQA change compared with standard multi-head attention?

Why KV cache memory becomes the bottleneck

A tiny example to build intuition

For the tiny 2-head example, how much KV cache does one layer need for 8 concurrent requests at 2,048 tokens each?

KV cache memory with standard MHA

Concrete example (72B-style decoder dimensions)

What is the most common KV-cache sizing mistake?

Multi-Query Attention (MQA)

The extreme compression

How much memory MQA saves

Your 64-head serving model runs out of memory under long chats. You swap from MHA to MQA. What single term in the KV-cache formula changed, and why does that produce a 64x reduction?

Performance vs. quality tradeoffs in MQA

After an MQA migration, latency is great but relation-heavy evals regress. What bottleneck did you probably introduce?

A minimal MQA implementation

Why does the output stay at width 64 even though K and V use only one head?

Grouped-Query Attention (GQA)

The balanced compromise

GQA in practice

A teammate says "32 query heads and 8 KV groups means 8x savings." What is correct, and why?

Why can a 64-query-head, 8-KV-head model lose part of its cache saving at TP=16 in vLLM?

Memory comparison (72B model, seq=4096)

You are serving a large support model and need more concurrency without hurting answer quality too much. Why is GQA often safer than pure MQA?

Adoption across model families

Key observations

Converting MHA to GQA via uptraining

How do you initialize GQA from an existing MHA checkpoint?

Beyond GQA: Multi-Head Latent Attention (MLA)

How MLA works (high-level)

You need long-context serving on a runtime that already supports GQA kernels but not MLA-specific execution. Why is GQA operationally simpler, even if MLA compresses harder on paper?

Trade-offs and adoption

Your stack already ships MLA kernels and cache pressure is still dominant after other optimizations. When does MLA become worth extra complexity?

Serving impact

Concurrency impact (72B-style dimensions, 4K context)

Long-context impact

You are targeting 128K context on an 80 GB GPU. Why is GQA-8 alone still not enough, and what else usually joins it?

How serving engines handle GQA head expansion

Why should a serving engine avoid materializing repeat_interleave for GQA?

KV cache update correctness

Cached decode starts drifting after token 200, while full-prefix recompute still looks correct. Which invariants do you inspect first?

What to check before moving on

Follow-up questions

A 70B-style chat model must serve 128K-token context. What memory story do you defend?

A team wants GQA but can't afford full retraining. What rollout plan makes sense?

num_attention_heads = 64, num_key_value_heads = 8, and vLLM tensor parallel degree is 16. Why does replication appear?

Your runtime offers both GQA and MLA paths today. How do you choose between them?

Cached decode diverges from full recompute after 200 tokens. What do you test first?

Common pitfalls

"Model weights are the same thing as KV cache"

"GQA speeds up training"

"MQA always destroys quality"

"I saved 8x on KV cache, so my serving cost dropped 8x"

"repeat_interleave exploded my memory during GQA training"

"I can regroup heads any way I want during GQA conversion"

"Architecture-level KV savings always survive tensor parallelism"

A quick check

What to remember and where to go next

Practice drill

Mastery Check

Discussion

`num_attention_heads = 64`, `num_key_value_heads = 8`, and vLLM tensor parallel degree is 16. Why does replication appear?