Calculate KV cache capacity, trace paged block allocation, and separate memory packing from prefix reuse and scheduling tradeoffs.
The previous chapter shrank each token's key-value (KV) footprint with Multi-Query Attention (MQA) and Grouped-Query Attention (GQA). This chapter keeps that smaller cache from wasting memory as variable-length requests enter and leave a serving batch.
When a commerce assistant answers a shopper's question, it has to remember the entire conversation: cart history, product details, refund policy, delivery ETA, and support messages. Each new token the model generates would require re-reading that whole history from scratch if it didn't have a way to remember what it already computed.
That memory is called the KV cache (Key-Value cache). It stores the intermediate results the model needs to attend back to earlier tokens without recomputing them. The cache makes generation fast, but it creates a hard engineering problem: it consumes a huge amount of expensive GPU memory, and that consumption grows with every token and every simultaneous shopper.
Managing this memory efficiently is central to LLM serving. PagedAttention, introduced with vLLM, borrows ideas from operating-system paging to reduce fragmentation and pack active request state more tightly.[1]
This article builds on two ideas from earlier in the curriculum. First, Transformer self-attention: the model computes Query, Key, and Value vectors to decide which earlier tokens matter for the current token.[2] Second, autoregressive generation: the model produces one token at a time, and each new token can attend to every token that came before it. If those prerequisites are fresh, the rest of this article follows naturally.
Without a KV cache, generating each new token is like re-reading the entire conversation from scratch every time you want to say one more sentence. The KV cache is like keeping a running notebook: each time you process a token, you jot down a summary (Key) and the relevant content (Value). When the next token needs to reference earlier context, you flip through the notebook instead of re-reading everything. The notebook grows with each token, which is why memory management matters so much.
In a Transformer model, generating the next token requires attending to all previous tokens in the sequence.[2] The attention mechanism for a query token at step is defined as:
To understand this formula: the current token's query is compared against all previous keys (via dot product, scaled by ). The softmax function then determines which previous tokens are most relevant, and those tokens' values are blended accordingly. Without caching, the keys and values would need to be recomputed from scratch at every step.
Here and represent the Key and Value matrices for all tokens from position 1 to .
Without caching, generating token would require recomputing the Key and Value vectors for all previous tokens at every single step. That means the cost of generating each new token grows linearly with sequence length, giving a quadratic total cost for the full sequence. The KV cache eliminates the repeated K/V projection work, but the attention operation still has to read the full cached history for each new token.
The KV cache solves this by storing the Key () and Value () vectors for all previous tokens in GPU memory. At step , we only need to compute for the current token, and append to the cache.
With the KV cache, the decode phase (generating one new token at a time) still scales linearly per step in sequence length because each new token attends over all cached tokens. As cache reads grow, long-context decode can become memory-bandwidth-sensitive even after caching.[1]
The size of the KV cache isn't trivial. To see why, let's count the bytes for a concrete model first, then write the general formula.
| Example architecture | Layers | KV Heads | Cache/token | 4K Context | 32K Context | |
|---|---|---|---|---|---|---|
| 32-layer, 8-KV-head GQA | 32 | 8 | 128 | 128 KiB | 512 MiB | 4 GiB |
| 80-layer, 8-KV-head GQA | 80 | 8 | 128 | 320 KiB | 1.25 GiB | 10 GiB |
| 96-layer, 96-KV-head MHA | 96 | 96 | 128 | 4.5 MiB | 18 GiB | 144 GiB |
For the 80-layer, 8-KV-head GQA example, a single 4K-context request already consumes 1.25 GiB of KV cache. A batch of 128 such requests would need 160 GiB of High Bandwidth Memory (HBM), before weights and runtime buffers. That is beyond a single 80 GB-class accelerator, so memory capacity can cap throughput before raw FLOPs do.
If you plug in 80 layers, 64 KV heads, head dimension 128, and FP16 precision, you get 2.5 MiB per token. Keeping the layers, head dimension, and precision fixed but reducing the KV heads from 64 to 8 drops that to 320 KiB per token. At 4K context, that is the difference between 10 GiB and 1.25 GiB of cache for a single request.
Where do these numbers come from? The memory required per token follows a simple rule:
To interpret this formula: for each token, we store a Key and Value vector (hence the "2") across every layer and every KV head. Each vector has dimensions, and is the precision size. For plain Multi-Head Attention (MHA), . For Grouped-Query Attention (GQA) or Multi-Query Attention (MQA), the number of KV heads is smaller, which directly shrinks the cache.[3][4]
Where:
This is the tensor-payload calculation. A deployed engine may also reserve quantization scales, block tables, temporary workspaces, communication buffers, graph memory, and allocator headroom. Use the formula to reason about slopes, then profile the runtime before setting admission limits.
1def cache_gib(layers: int, kv_heads: int, head_dim: int, bytes_per_value: int, tokens: int) -> float:
2 bytes_per_token = 2 * layers * kv_heads * head_dim * bytes_per_value
3 return bytes_per_token * tokens / (1024 ** 3)
4
5for label, heads in [("MHA-64", 64), ("GQA-8", 8), ("MQA-1", 1)]:
6 size = cache_gib(layers=80, kv_heads=heads, head_dim=128, bytes_per_value=2, tokens=4096)
7 print(f"{label}: {size:.2f} GiB at 4K tokens")1MHA-64: 10.00 GiB at 4K tokens
2GQA-8: 1.25 GiB at 4K tokens
3MQA-1: 0.16 GiB at 4K tokensIn a naive serving system, memory for the KV cache is allocated as a single, contiguous buffer for each request. Because the system can't know ahead of time how many tokens the model will ultimately generate before stopping, it often reserves enough memory to accommodate the configured maximum possible sequence length (for instance, 4096 tokens).
This pessimistic allocation strategy leads to massive inefficiencies due to three distinct forms of memory fragmentation:
In the vLLM paper's evaluation traces, existing contiguous allocators used only 20.4-38.2% of KV memory to hold live token states, which means the rest was lost to fragmentation and over-reservation.[1] That directly limits the maximum batch size and throttles throughput.
1bytes_per_token = 320 * 1024
2max_tokens = 8192
3live_tokens = [600, 1250, 4000, 128]
4
5reserved = len(live_tokens) * max_tokens * bytes_per_token
6live = sum(live_tokens) * bytes_per_token
7waste_fraction = 1 - live / reserved
8
9print(f"reserved: {reserved / 1024**3:.2f} GiB")
10print(f"live state: {live / 1024**3:.2f} GiB")
11print(f"unused reservation: {waste_fraction:.1%}")1reserved: 10.00 GiB
2live state: 1.82 GiB
3unused reservation: 81.8%PagedAttention adapts the concept of paging from operating systems to LLM serving.[1] Instead of contiguous buffers, the KV cache is divided into fixed-size blocks. In the original vLLM paper, each block holds 16 tokens.[1]
Naive KV cache allocation is like reserving a full pallet position for every possible package size, even when most orders only need a small bin. Most reserved space goes unused. PagedAttention is like a shared bin system that assigns fixed-size slots only as requests need them, and those slots don't have to sit next to each other. When a request finishes, its slots are immediately available for the next request. Waste is mostly limited to each request's final partially filled block.
This allows the system to allocate memory dynamically, one block at a time, only when needed.
| Metric in vLLM evaluation | Contiguous baselines | PagedAttention |
|---|---|---|
| Fragmentation / reservation waste | High on evaluated traces | Tail waste below 4% on evaluated traces |
| Live-token cache utilization | 20.4-38.2% on reported traces | Above 96% on reported traces |
| Admission constraint | Large per-request reservations | Blocks allocated for live tokens |
| Throughput implication | Baseline systems in comparison | Higher when recovered HBM admits useful work |
In the vLLM paper, waste stays under 4% because only the last partially filled block of each sequence can be underutilized, plus a small amount of block-table metadata.[1] On top of that memory-efficiency gain, the paper reports 2-4× higher throughput at comparable latency versus systems such as FasterTransformer and Orca, with bigger gains on longer sequences and more complex decoding workloads.[1]
Block size is a real trade-off. Smaller blocks reduce tail waste and improve sharing granularity, but they increase block-table lookups and can reduce GPU kernel efficiency. Larger blocks improve memory access regularity, but they waste more space in the final block. In vLLM's ablation, block sizes from 16 to 128 worked best on ShareGPT-like traces, shorter Alpaca-like traces favored 16 or 32, and the default landed at 16.[1]
1from math import ceil
2
3def allocated_tokens(live_tokens: int, block_size: int) -> int:
4 return ceil(live_tokens / block_size) * block_size
5
6lengths = [505, 1000, 4096]
7block_size = 16
8for length in lengths:
9 allocated = allocated_tokens(length, block_size)
10 print(f"{length} live tokens -> {allocated} slots, {allocated - length} unused")
11
12assert max(allocated_tokens(length, block_size) - length for length in lengths) < block_size1505 live tokens -> 512 slots, 7 unused
21000 live tokens -> 1008 slots, 8 unused
34096 live tokens -> 4096 slots, 0 unusedThe following Python code simulates how a PagedAttention manager allocates and tracks blocks for requests. This logic is critical for understanding how variable-length requests are mapped to fixed-size memory pools. In a real system (vLLM), this manager runs on the CPU, while custom CUDA kernels handle the actual attention computation on the GPU.
The PagedKVCacheManager takes the block size and total block count as initialization parameters. When requests arrive, it takes a request ID and assigns logical blocks, returning the physical block index and slot offset where the new token's Key and Value tensors should be written as tokens are generated.
1class PagedKVCacheManager:
2 """
3 Manages the mapping between logical pages and physical GPU blocks.
4 Doesn't store the actual tensors (that happens in GPU memory).
5 """
6
7 def __init__(self, num_blocks: int, block_size: int):
8 if num_blocks <= 0 or block_size <= 0:
9 raise ValueError("num_blocks and block_size must be positive")
10 self.block_size = block_size
11 # Stack of free physical block indices
12 self.free_blocks: list[int] = list(range(num_blocks))
13
14 # Maps request_id -> list of physical block indices
15 # Example: { "req_1": [7, 2, 15] }
16 self.block_tables: dict[str, list[int]] = {}
17
18 # Maps request_id -> number of tokens generated so far
19 self.seq_lens: dict[str, int] = {}
20
21 def allocate(self, request_id: str) -> None:
22 """Initialize a new request with one empty block."""
23 if request_id in self.block_tables:
24 raise ValueError(f"request already allocated: {request_id}")
25 if not self.free_blocks:
26 raise MemoryError("GPU Out of Memory")
27
28 first_block = self.free_blocks.pop()
29 self.block_tables[request_id] = [first_block]
30 self.seq_lens[request_id] = 0
31
32 def append_slot(self, request_id: str) -> tuple[int, int]:
33 """
34 Signals that a new token is being generated.
35 Allocates a new block if the current last block is full.
36 Returns (physical_block_index, slot_offset_within_block).
37 """
38 if request_id not in self.block_tables:
39 raise KeyError(f"unknown request: {request_id}")
40 current_blocks = self.block_tables[request_id]
41
42 token_idx = self.seq_lens[request_id]
43
44 # Allocate a fresh block when the previous one is full.
45 if token_idx > 0 and token_idx % self.block_size == 0:
46 if not self.free_blocks:
47 raise MemoryError("GPU Out of Memory")
48 new_block = self.free_blocks.pop()
49 current_blocks.append(new_block)
50
51 block_id = current_blocks[-1]
52 slot_offset = token_idx % self.block_size
53 self.seq_lens[request_id] = token_idx + 1
54 return block_id, slot_offset
55
56 def free(self, request_id: str) -> None:
57 """Release all blocks associated with a request back to the pool."""
58 if request_id in self.block_tables:
59 blocks = self.block_tables.pop(request_id)
60 self.free_blocks.extend(blocks)
61 del self.seq_lens[request_id]
62
63manager = PagedKVCacheManager(num_blocks=20, block_size=4)
64manager.allocate("A")
65for _ in range(5):
66 print("A", manager.append_slot("A"))
67
68manager.allocate("B")
69print("B", manager.append_slot("B"))
70
71manager.free("A")
72print("free blocks:", len(manager.free_blocks))1A (19, 0)
2A (19, 1)
3A (19, 2)
4A (19, 3)
5A (18, 0)
6B (17, 0)
7free blocks: 19Notice how allocate only grabs a single block to start. As append_slot is called during the autoregressive generation loop, new blocks are pulled from free_blocks strictly on-demand. When the request completes, free returns all blocks to the shared pool. In this fixed-size toy pool, no variable-sized gap is stranded; a production runtime must also account for tensors and allocator state outside this KV block pool.
Let's walk through what happens when two requests arrive in a manager with block_size = 4 and num_blocks = 20.
Step 1: Request A arrives.
allocate("A") pops physical block 19 from free_blocks. The block table becomes {"A": [19]}. The sequence length is 0.
Step 2: Request A generates four tokens.
append_slot("A") is called four times. The first three tokens fit into block 19 at offsets 0, 1, and 2. On the fourth call, token_idx = 3, which is still within block 19 (offset 3). After the fourth token, seq_lens["A"] = 4.
Step 3: Request A generates a fifth token.
Now token_idx = 4 and 4 % 4 == 0, so the current block is full. append_slot("A") pops a new block (18, the next index off the free stack) from free_blocks, appends it to the block table, and writes the fifth token at offset 0 inside block 18. The block table is now {"A": [19, 18]}, which matches the A (18, 0) line in the output above.
Step 4: Request B arrives.
allocate("B") pops block 17. Block table: {"A": [19, 18], "B": [17]}.
Step 5: Request A finishes.
free("A") returns blocks 19 and 18 to free_blocks. They are immediately available for the next incoming request, without leaving a variable-sized gap in this fixed-block pool.
This trace shows the core dynamic: blocks are grabbed one at a time, only when needed, and returned instantly when the request completes.
Real engines add two important pieces we omitted for clarity: a reference count per physical block (for shared prefixes) and copy-on-write when a sequence needs to write into a block that's currently shared.
Thinking PagedAttention speeds up attention computation is a common mistake. It doesn't. It speeds up memory allocation and packing, which lets you fit more requests in the same GPU memory. The attention math itself stays the same; what changes is how the K and V tensors are laid out in HBM so the GPU can find them without wasting space.
1block_size = 4
2block_table = [19, 18, 5]
3
4def resolve(token_position: int) -> tuple[int, int]:
5 logical_block, slot = divmod(token_position, block_size)
6 return block_table[logical_block], slot
7
8for position in [0, 3, 4, 8, 10]:
9 print(f"token {position}: physical block, slot = {resolve(position)}")1token 0: physical block, slot = (19, 0)
2token 3: physical block, slot = (19, 3)
3token 4: physical block, slot = (18, 0)
4token 8: physical block, slot = (5, 0)
5token 10: physical block, slot = (5, 2)vLLM combines PagedAttention with a sophisticated scheduler that manages the lifecycle of requests. Because memory is now a shared resource managed much like an operating system's virtual memory, the scheduler must gracefully handle preemption when the GPU's memory capacity is fully exhausted.
In a highly loaded multi-tenant system, the GPU block pool can still become exhausted. When that happens, the scheduler has to preempt some work and recover later. Two common strategies are:[1]
Once GPU space opens up, the request resumes either from swapped-in blocks or from a newly rebuilt cache. Recomputing trades extra FLOPs for simpler recovery, while swapping trades FLOPs for CPU-GPU traffic. Either way, frequent preemption hurts latency, so production systems still try to size batches to avoid it.[1]
Suppose a two-rank deployment serves an 80-layer model with 8 KV heads, head dimension 128, FP16 cache, and an 8K context window. From the table earlier, that cache adds 320 KiB per token, or 2.5 GiB per full request across the model. If the KV heads are evenly sharded over the two ranks, each rank stores 1.25 GiB of that request. Applying the paper's under-4% tail-waste result as an estimate gives about 1.30 GiB per rank.[1]
Do not derive an admission limit from advertised HBM minus model weights alone. Profile how much per-rank memory remains after weights, kernels, communication buffers, graphs, and safety headroom. If that measured KV budget is 8 GiB per rank, only six fully occupied 8K requests fit under the simplified estimate below. PagedAttention does not change this full-context case much; it changes variable-length traffic by allocating blocks for live state rather than reserving every maximum-length tail.
1from math import floor
2
3bytes_per_token = 2 * 80 * 8 * 128 * 2
4context_tokens = 8192
5tp_ranks = 2
6tail_overhead = 1.04
7kv_budget_per_rank_gib = 8.0
8
9per_request_per_rank_gib = bytes_per_token * context_tokens / tp_ranks / (1024 ** 3)
10estimated_with_tail = per_request_per_rank_gib * tail_overhead
11capacity = floor(kv_budget_per_rank_gib / estimated_with_tail)
12
13print(f"per request per rank: {estimated_with_tail:.2f} GiB")
14print(f"admit at most {capacity} full 8K requests under this KV budget")1per request per rank: 1.30 GiB
2admit at most 6 full 8K requests under this KV budgetPagedAttention pairs especially well with continuous batching (also called iteration-level scheduling).
In traditional static batching, the system waits for requests to arrive, bundles them into a batch, and processes them together until all requests in the batch are completed. Because sequences have variable lengths, the shorter requests finish early, leaving GPU compute units idle while waiting for the longest request to complete.
Continuous batching, made explicit in systems like Orca, reevaluates the active batch at every decode iteration.[5] As soon as one request emits an End-of-Sequence (EOS) token and finishes, the scheduler ejects it, frees its KV cache blocks, and admits another waiting request into the running batch.
By operating at the iteration level rather than the request level, the engine avoids leaving slots idle just because one request finished earlier than the others. Continuous batching and PagedAttention solve different bottlenecks, but they work especially well together: one keeps compute busy, and the other makes memory reuse immediate.
| Strategy | Idle Time on skewed generations | When are requests added? | Throughput posture |
|---|---|---|---|
| Static Batching | Can be high while waiting for longest sequence | Only when the entire previous batch completes | Baseline for comparison |
| Continuous Batching | Can be lower | At iteration boundaries after requests finish | Can improve on skewed workloads |
Continuous batching transforms the batch from a static grouping into a sliding window of active computations. Orca reported up to 36.9× higher throughput than FasterTransformer at the same latency target on GPT-3 175B, because late arrivals can join ongoing work and early finishers can leave immediately.[5] PagedAttention makes that scheduling policy much easier to sustain under tight memory budgets.
1from math import ceil
2
3block_size = 16
4live_tokens = {"A": 48, "B": 5}
5
6def blocks(tokens: int) -> int:
7 return ceil(tokens / block_size)
8
9before = sum(blocks(tokens) for tokens in live_tokens.values())
10del live_tokens["A"] # A emits EOS; scheduler retires it before next iteration.
11live_tokens["C"] = 30
12after = sum(blocks(tokens) for tokens in live_tokens.values())
13
14print(f"blocks before retire/admit: {before}")
15print(f"blocks after retire A and admit C: {after}")
16assert after <= before1blocks before retire/admit: 4
2blocks after retire A and admit C: 3Block tables make shared physical blocks possible, but they do not by themselves decide which independently arriving requests may reuse state. In the original vLLM design, copy-on-write lets output sequences forked from the same prompt, such as parallel sampling or beam-search candidates, share their prompt blocks until a branch writes divergent state.[1]
Reusing a system prompt across separate user requests requires an additional prefix cache policy: retain completed prefix blocks, match an exactly compatible token prefix, apply isolation keys where needed, and only then map new requests to the cached state. vLLM documents this cross-request feature as Automatic Prefix Caching.[6] The next lesson covers those matching and isolation rules in depth.
Suppose 100 requests are allowed to reuse the same 1000-token system-instruction prefix:
If all 100 requests qualify, that arrangement saves roughly 99% of prefix-block storage in this example. It does not save unique user input or generated-token cache, and it must not cross tenant or adapter boundaries without an isolation policy.
1from math import ceil
2
3requests = 100
4prefix_tokens = 1000
5block_size = 16
6prefix_blocks = ceil(prefix_tokens / block_size)
7
8without_cache = requests * prefix_blocks
9with_compatible_cache = prefix_blocks
10saved_fraction = 1 - with_compatible_cache / without_cache
11
12print(f"prefix blocks per copy: {prefix_blocks}")
13print(f"saved prefix-block storage: {saved_fraction:.1%}")
14assert saved_fraction == 0.991prefix blocks per copy: 63
2saved prefix-block storage: 99.0%Newer engines like SGLang extend this with RadixAttention.[7] Instead of discarding KV state after each generation call, the runtime keeps a least-recently used (LRU) cache of KV entries for all requests inside a radix tree.
That makes it particularly effective for multi-turn chat and agentic workloads where long prompt prefixes repeat across calls.[7]
FlashInfer is a kernel library for serving workloads that accelerates attention over paged KV caches and variable-length sequences.[8] It doesn't replace PagedAttention at the memory-manager level. Instead, it speeds up the GPU kernels that consume those layouts during decode.
These names are easy to mix up, but they solve different problems.[1][9]
Modern serving stacks often use both or close variants together.[8] PagedAttention makes the cache fit and stay shareable. FlashAttention-style kernels make attention itself faster once those tensors are in place.
PagedAttention optimizes how memory is managed by changing the allocation strategy. However, it doesn't change the actual mathematical size of the tensors. Model architecture changes optimize how much memory is needed in the first place.
Grouped-Query Attention (GQA) is a structural technique that can sharply shrink baseline cache size. Instead of maintaining a separate Key and Value head for every single Query head (as in traditional Multi-Head Attention), GQA groups multiple Query heads together to share a Key and Value head. Ainslie et al. found intermediate group counts recovered much of MHA quality while retaining MQA-like inference benefits after uptraining; other models still require their own quality evaluation.[3]
By sharing KV heads, the memory footprint required for each token drops proportionally to the group size. For instance, in a 70-billion parameter model, moving from 64 distinct KV heads down to 8 (a group size of 8) reduces the per-token KV cache memory by exactly 8 times.
| Attention Type | Example KV Heads | KV Cache/token | Evaluation posture |
|---|---|---|---|
| Multi-Head (MHA) | 64 | 2,560 KiB | Reference cache size |
| Grouped-Query (GQA)[3] | 8 | 320 KiB | 8x smaller cache; measure task quality |
| Multi-Query (MQA)[4] | 1 | 40 KiB | 64x smaller cache; measure task quality |
GQA reduces the size of the problem, while PagedAttention solves the allocation of the problem. They are complementary when the selected model architecture and serving runtime support both.
Confusing the KV cache with the model weights is another common mistake. The weights are static parameters loaded once at startup and shared across every request. The KV cache is per-request state that grows as the conversation proceeds. If you size a GPU for weights alone and forget the cache, you'll hit out-of-memory errors the moment you try to serve more than a handful of long conversations.
LLM inference has two distinct phases:[1][10]
PagedAttention primarily optimizes the Decode phase, where the system rereads a growing KV cache on every token. In practice, many engines still colocate prefills and decodes on the same GPU via continuous batching or chunked prefill, but that creates interference: a large prompt prefill can stall ongoing decode streams for everyone else. That's why newer serving stacks also use chunked prefill and, at larger scale, full prefill/decode disaggregation to keep time-to-first-token (TTFT) and inter-token latency smooth.[5][11][10]
1traces = [
2 {"prompt_tokens": 8000, "ttft_ms": 780, "inter_token_ms": 22},
3 {"prompt_tokens": 200, "ttft_ms": 48, "inter_token_ms": 23},
4 {"prompt_tokens": 220, "ttft_ms": 51, "inter_token_ms": 61},
5]
6
7for trace in traces:
8 if trace["ttft_ms"] > 500:
9 diagnosis = "inspect prefill admission or chunking"
10 elif trace["inter_token_ms"] > 50:
11 diagnosis = "inspect decode pressure and KV budget"
12 else:
13 diagnosis = "within sample thresholds"
14 print(f"{trace['prompt_tokens']} prompt tokens: {diagnosis}")18000 prompt tokens: inspect prefill admission or chunking
2200 prompt tokens: within sample thresholds
3220 prompt tokens: inspect decode pressure and KV budgetEven with PagedAttention, the KV cache still grows linearly with context length. The 80-layer, 8-KV-head, 128-dim GQA example from earlier needs 40 GiB of KV cache at 128K context in FP16. At that scale, KV memory can rival or exceed the room left after loading weights, so long-context serving may require distribution, KV-cache quantization, explicit compression, or smaller admitted contexts.
In Tensor Parallelism (TP), a runtime may shard KV heads across ranks when the model layout permits it. If the number or grouping of KV heads does not divide cleanly across ranks, a runtime may replicate some KV state instead. Capacity planning must use the actual deployed layout, not assume ideal division.
Paged allocation still happens per rank, but logical block growth has to stay aligned across ranks so token positions 0-15, 16-31, and so on map to the correct local KV blocks on every worker.
The physical block IDs don't have to match across GPUs. What matters is consistent logical bookkeeping: each rank must know which local block corresponds to each logical block for its shard of the cache.
Implementations usually synchronize allocation decisions at the scheduler level and maintain per-rank block tables. That lets each GPU run its local attention kernels independently without renegotiating memory layout during every decode step.
1def kv_heads_per_rank(kv_heads: int, tp_ranks: int) -> tuple[int, str]:
2 if kv_heads % tp_ranks == 0:
3 return kv_heads // tp_ranks, "evenly sharded"
4 return kv_heads, "conservative replicated-state budget"
5
6for kv_heads, ranks in [(8, 2), (8, 3), (1, 4)]:
7 local_heads, posture = kv_heads_per_rank(kv_heads, ranks)
8 print(f"{kv_heads} KV heads on TP={ranks}: budget {local_heads} local heads ({posture})")18 KV heads on TP=2: budget 4 local heads (evenly sharded)
28 KV heads on TP=3: budget 8 local heads (conservative replicated-state budget)
31 KV heads on TP=4: budget 1 local heads (conservative replicated-state budget)Symptom: Capacity planning looks fine on paper, then long chats trigger OOM as soon as concurrency rises. Cause: Static weights were counted, but dynamic per-request KV growth was ignored. Fix: Budget GPU memory for weights and for worst realistic live KV state at your target concurrency.
Symptom: Throughput stalls even after scheduler tuning. Cause: The batch policy improved, but the allocator still strands large empty tails and gaps. Fix: Evaluate scheduler behavior and allocator behavior separately. PagedAttention helps because it changes memory packing, not because it changes the attention math.
Symptom: Benchmarks look amazing, but production gains disappoint. Cause: Reported speedups depended on sequence-length skew, latency target, or decoding workload that did not match your route mix. Fix: Compare serving systems under the same latency target, prompt lengths, output lengths, and concurrency pattern.
Symptom: Teams optimize decode memory layout, but TTFT still swings wildly. Cause: Decode is often memory-bandwidth-sensitive, but prefill is a separate compute-heavy phase that can still interfere with live traffic. Fix: Track prefill and decode separately, then decide whether you need chunked prefill, disaggregation, or cache compression.
Symptom: Scheduler falls back to CPU swap and latency explodes. Cause: Swapping preserved work, but PCIe transfer and restore time were ignored. Fix: Treat swap as a tradeoff, not a free safety valve. Compare it against recompute under your actual prompt lengths and hardware path.
Symptom: A team adopts vLLM but still cannot fit the desired context length. Cause: Allocation waste dropped, but the mathematical KV footprint per token stayed the same. Fix: Use GQA, MQA, KV quantization, or cache compression when the per-token cache itself is too large.
You now understand why memory capacity can cap serving throughput, and how PagedAttention turns fragmented KV allocation into a compact block pool. The next step is to reuse cache work across compatible requests. In Prefix Caching and Prompt Caching, you will see how repeated policy text, tool schemas, and shared instructions can avoid repeated prefill work without bypassing isolation rules.
Efficient Memory Management for Large Language Model Serving with PagedAttention.
Kwon, W., et al. · 2023 · SOSP 2023
Attention Is All You Need.
Vaswani, A., et al. · 2017
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.
Ainslie, J., et al. · 2023 · EMNLP 2023
Fast Transformer Decoding: One Write-Head is All You Need.
Shazeer, N. · 2019 · arXiv preprint
Orca: A Distributed Serving System for Transformer-Based Generative Models.
Yu, G.-I., et al. · 2022 · OSDI 2022
Automatic Prefix Caching
vLLM · 2026
SGLang: Efficient Execution of Structured Language Model Programs
Zheng, L., Yin, L., Xie, Z., et al. · 2024 · NeurIPS 2024
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving.
Ye, Z., et al. · 2025
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. · 2022 · NeurIPS 2022
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving.
Zhong, Y., et al. · 2024 · OSDI 2024
Sarathi-Serve: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.
Agrawal, A., et al. · 2023 · arXiv preprint
SnapKV: Compressing KV Cache by Selecting Global Attention Patterns.
Li, Y., et al. · 2024
H2O: Heavy-Hitter Oracle for KV Cache Eviction in LLMs.
Zhang, Z., et al. · 2023