LearnInference & Production ScaleKV Cache & PagedAttention

🚀HardInference Optimization

KV Cache & PagedAttention

Calculate KV cache capacity, trace paged block allocation, and separate memory packing from prefix reuse and scheduling tradeoffs.

36 min read

Learning path

Step 125 of 155 in the full curriculum

Multi-Query & Grouped-Query Attention Prefix Caching and Prompt Caching

KV Cache & PagedAttention

The previous chapter shrank each token's key-value (KV) footprint with Multi-Query Attention (MQA) and Grouped-Query Attention (GQA). This chapter keeps that smaller cache from wasting memory as variable-length requests enter and leave a serving batch.

When a commerce assistant answers a shopper's question, it has to remember the entire conversation: cart history, product details, refund policy, delivery ETA, and support messages. Each new token the model generates would require re-reading that whole history from scratch if it didn't have a way to remember what it already computed.

That memory is called the KV cache (Key-Value cache). It stores the intermediate results the model needs to attend back to earlier tokens without recomputing them. The cache makes generation fast, but it creates a hard engineering problem: it consumes a huge amount of expensive GPU memory, and that consumption grows with every token and every simultaneous shopper.

Managing this memory efficiently is central to LLM serving. PagedAttention, introduced with vLLM, borrows ideas from operating-system paging to reduce fragmentation and pack active request state more tightly.^[1]

Why the KV Cache Exists

This article builds on two ideas from earlier in the curriculum. First, Transformer self-attention: the model computes Query, Key, and Value vectors to decide which earlier tokens matter for the current token.^[2] Second, autoregressive generation: the model produces one token at a time, and each new token can attend to every token that came before it. If those prerequisites are fresh, the rest of this article follows naturally.

Autoregressive generation

Without a KV cache, generating each new token is like re-reading the entire conversation from scratch every time you want to say one more sentence. The KV cache is like keeping a running notebook: each time you process a token, you jot down a summary (Key) and the relevant content (Value). When the next token needs to reference earlier context, you flip through the notebook instead of re-reading everything. The notebook grows with each token, which is why memory management matters so much.

In a Transformer model, generating the next token requires attending to all previous tokens in the sequence.^[2] The attention mechanism for a query token $q_t$ at step $t$ is defined as:

$\text{Attention}(q_t, K_{1:t}, V_{1:t}) = \text{softmax}\left(\frac{q_t K_{1:t}^T}{\sqrt{d_k}}\right) V_{1:t}$

To understand this formula: the current token's query $q_t$ is compared against all previous keys $K_{1:t}$ (via dot product, scaled by $\sqrt{d_k}$ ). The softmax function then determines which previous tokens are most relevant, and those tokens' values $V_{1:t}$ are blended accordingly. Without caching, the keys and values would need to be recomputed from scratch at every step.

Here $K_{1:t}$ and $V_{1:t}$ represent the Key and Value matrices for all tokens from position 1 to $t$ .

Without caching, generating token $t$ would require recomputing the Key and Value vectors for all $t-1$ previous tokens at every single step. That means the cost of generating each new token grows linearly with sequence length, giving a quadratic total cost for the full sequence. The KV cache eliminates the repeated K/V projection work, but the attention operation still has to read the full cached history for each new token.

KV cache: cache K, V from previous steps

The KV cache solves this by storing the Key ( $K$ ) and Value ( $V$ ) vectors for all previous tokens in GPU memory. At step $t$ , we only need to compute $q_t, k_t, v_t$ for the current token, and append $k_t, v_t$ to the cache.

KV cache decode loop showing each new token computing only its own query, key, and value, appending key and value to cache, then attending over all cached history. — The cache saves repeated K/V projection work. It does not remove the growing read: each new query still attends over cached history.

With the KV cache, the decode phase (generating one new token at a time) still scales linearly per step in sequence length because each new token attends over all cached tokens. As cache reads grow, long-context decode can become memory-bandwidth-sensitive even after caching.^[1]

KV cache memory analysis

The size of the KV cache isn't trivial. To see why, let's count the bytes for a concrete model first, then write the general formula.

Memory footprint examples (FP16)

Example architecture	Layers	KV Heads	$d_{head}$	Cache/token	4K Context	32K Context
32-layer, 8-KV-head GQA	32	8	128	128 KiB	512 MiB	4 GiB
80-layer, 8-KV-head GQA	80	8	128	320 KiB	1.25 GiB	10 GiB
96-layer, 96-KV-head MHA	96	96	128	4.5 MiB	18 GiB	144 GiB

KV cache memory footprint comparison for GQA and MHA architectures across 4K and 32K contexts. — KV-cache footprint depends on layer count, KV heads, head dimension, precision, and sequence length. Full MHA grows far faster than GQA.

For the 80-layer, 8-KV-head GQA example, a single 4K-context request already consumes 1.25 GiB of KV cache. A batch of 128 such requests would need 160 GiB of High Bandwidth Memory (HBM), before weights and runtime buffers. That is beyond a single 80 GB-class accelerator, so memory capacity can cap throughput before raw FLOPs do.

If you plug in 80 layers, 64 KV heads, head dimension 128, and FP16 precision, you get 2.5 MiB per token. Keeping the layers, head dimension, and precision fixed but reducing the KV heads from 64 to 8 drops that to 320 KiB per token. At 4K context, that is the difference between 10 GiB and 1.25 GiB of cache for a single request.

Where do these numbers come from? The memory required per token follows a simple rule:

\text{Bytes per token} = 2 \times n_{\text{layers}} \times n_{\text{kv-heads}} \times d_{\text{head}} \times P_{\text{bytes}}

To interpret this formula: for each token, we store a Key and Value vector (hence the "2") across every layer and every KV head. Each vector has $d_{\text{head}}$ dimensions, and $P_{\text{bytes}}$ is the precision size. For plain Multi-Head Attention (MHA), $n_{\text{kv-heads}} = n_{\text{heads}}$ . For Grouped-Query Attention (GQA) or Multi-Query Attention (MQA), the number of KV heads is smaller, which directly shrinks the cache.^[3]^[4]

Where:

$2$ : For both Key and Value matrices
$n_{\text{layers}}$ : Number of transformer layers
$n_{\text{kv-heads}}$ : Number of KV heads. In MHA this equals the number of attention heads; in GQA/MQA it's smaller
$d_{\text{head}}$ : Dimension of each head
$P_{\text{bytes}}$ : Stored K/V payload precision (2 bytes for FP16, 1 byte for FP8/INT8)

This is the tensor-payload calculation. A deployed engine may also reserve quantization scales, block tables, temporary workspaces, communication buffers, graph memory, and allocator headroom. Use the formula to reason about slopes, then profile the runtime before setting admission limits.

calculate-kv-cache-footprint.py

def cache_gib(layers: int, kv_heads: int, head_dim: int, bytes_per_value: int, tokens: int) -> float:
    bytes_per_token = 2 * layers * kv_heads * head_dim * bytes_per_value
    return bytes_per_token * tokens / (1024 ** 3)

for label, heads in [("MHA-64", 64), ("GQA-8", 8), ("MQA-1", 1)]:
    size = cache_gib(layers=80, kv_heads=heads, head_dim=128, bytes_per_value=2, tokens=4096)
    print(f"{label}: {size:.2f} GiB at 4K tokens")

Output

MHA-64: 10.00 GiB at 4K tokens
GQA-8: 1.25 GiB at 4K tokens
MQA-1: 0.16 GiB at 4K tokens

The memory fragmentation problem

In a naive serving system, memory for the KV cache is allocated as a single, contiguous buffer for each request. Because the system can't know ahead of time how many tokens the model will ultimately generate before stopping, it often reserves enough memory to accommodate the configured maximum possible sequence length (for instance, 4096 tokens).

Side-by-side view of contiguous KV reservation wasting tail space versus paged blocks that allocate only live token chunks. — Contiguous allocators trap empty tails. PagedAttention allocates small live blocks on demand, so waste is mostly limited to the final partial block.

Types of waste

This pessimistic allocation strategy leads to massive inefficiencies due to three distinct forms of memory fragmentation:

Internal Fragmentation: This occurs when a request reserves the maximum space (e.g., 4096 tokens) but finishes generation early (e.g., after just 200 tokens). The remaining 3896 pre-allocated slots sit completely empty but can't be used by any other request, representing a massive loss of usable capacity.
External Fragmentation: Over time, as requests of various sizes enter and leave the system, free memory becomes scattered in small gaps between the allocated buffers. Even if the total free memory is large enough to handle a new request, the system can't launch it because there's no single contiguous block large enough to hold it.
Reservation Waste: Because the system guarantees contiguous memory for the maximum possible length, it can't oversubscribe memory based on average sequence lengths. This forces the system to operate highly conservatively.

In the vLLM paper's evaluation traces, existing contiguous allocators used only 20.4-38.2% of KV memory to hold live token states, which means the rest was lost to fragmentation and over-reservation.^[1] That directly limits the maximum batch size and throttles throughput.

measure-reservation-waste.py

bytes_per_token = 320 * 1024
max_tokens = 8192
live_tokens = [600, 1250, 4000, 128]

reserved = len(live_tokens) * max_tokens * bytes_per_token
live = sum(live_tokens) * bytes_per_token
waste_fraction = 1 - live / reserved

print(f"reserved: {reserved / 1024**3:.2f} GiB")
print(f"live state: {live / 1024**3:.2f} GiB")
print(f"unused reservation: {waste_fraction:.1%}")

Output

reserved: 10.00 GiB
live state: 1.82 GiB
unused reservation: 81.8%

PagedAttention (vLLM)

The key insight: virtual memory for KV cache

PagedAttention adapts the concept of paging from operating systems to LLM serving.^[1] Instead of contiguous buffers, the KV cache is divided into fixed-size blocks. In the original vLLM paper, each block holds 16 tokens.^[1]

Naive KV cache allocation is like reserving a full pallet position for every possible package size, even when most orders only need a small bin. Most reserved space goes unused. PagedAttention is like a shared bin system that assigns fixed-size slots only as requests need them, and those slots don't have to sit next to each other. When a request finishes, its slots are immediately available for the next request. Waste is mostly limited to each request's final partially filled block.

Logical Blocks: The sequence of tokens as seen by the model (contiguous indices 0, 1, 2...).
Physical Blocks: Where the data lives in GPU memory (non-contiguous, scattered).
Block Table: A mapping that translates logical block indices to physical block addresses.

This allows the system to allocate memory dynamically, one block at a time, only when needed.

PagedAttention block table mapping logical token blocks to non-contiguous physical GPU memory blocks. — The model sees contiguous logical blocks. The allocator is free to place each block anywhere in the physical GPU block pool.

Efficiency gains

Metric in vLLM evaluation	Contiguous baselines	PagedAttention
Fragmentation / reservation waste	High on evaluated traces	Tail waste below 4% on evaluated traces
Live-token cache utilization	20.4-38.2% on reported traces	Above 96% on reported traces
Admission constraint	Large per-request reservations	Blocks allocated for live tokens
Throughput implication	Baseline systems in comparison	Higher when recovered HBM admits useful work

In the vLLM paper, waste stays under 4% because only the last partially filled block of each sequence can be underutilized, plus a small amount of block-table metadata.^[1] On top of that memory-efficiency gain, the paper reports 2-4× higher throughput at comparable latency versus systems such as FasterTransformer and Orca, with bigger gains on longer sequences and more complex decoding workloads.^[1]

Block size is a real trade-off. Smaller blocks reduce tail waste and improve sharing granularity, but they increase block-table lookups and can reduce GPU kernel efficiency. Larger blocks improve memory access regularity, but they waste more space in the final block. In vLLM's ablation, block sizes from 16 to 128 worked best on ShareGPT-like traces, shorter Alpaca-like traces favored 16 or 32, and the default landed at 16.^[1]

bound-paged-tail-waste.py

from math import ceil

def allocated_tokens(live_tokens: int, block_size: int) -> int:
    return ceil(live_tokens / block_size) * block_size

lengths = [505, 1000, 4096]
block_size = 16
for length in lengths:
    allocated = allocated_tokens(length, block_size)
    print(f"{length} live tokens -> {allocated} slots, {allocated - length} unused")

assert max(allocated_tokens(length, block_size) - length for length in lengths) < block_size

Output

live tokens -> 512 slots, 7 unused
live tokens -> 1008 slots, 8 unused
live tokens -> 4096 slots, 0 unused

Implementation sketch

The following Python code simulates how a PagedAttention manager allocates and tracks blocks for requests. This logic is critical for understanding how variable-length requests are mapped to fixed-size memory pools. In a real system (vLLM), this manager runs on the CPU, while custom CUDA kernels handle the actual attention computation on the GPU.

The PagedKVCacheManager takes the block size and total block count as initialization parameters. When requests arrive, it takes a request ID and assigns logical blocks, returning the physical block index and slot offset where the new token's Key and Value tensors should be written as tokens are generated.

implementation-sketch.py

class PagedKVCacheManager:
    """
    Manages the mapping between logical pages and physical GPU blocks.
    Doesn't store the actual tensors (that happens in GPU memory).
    """

    def __init__(self, num_blocks: int, block_size: int):
        if num_blocks <= 0 or block_size <= 0:
            raise ValueError("num_blocks and block_size must be positive")
        self.block_size = block_size
        # Stack of free physical block indices
        self.free_blocks: list[int] = list(range(num_blocks))

        # Maps request_id -> list of physical block indices
        # Example: { "req_1": [7, 2, 15] }
        self.block_tables: dict[str, list[int]] = {}

        # Maps request_id -> number of tokens generated so far
        self.seq_lens: dict[str, int] = {}

    def allocate(self, request_id: str) -> None:
        """Initialize a new request with one empty block."""
        if request_id in self.block_tables:
            raise ValueError(f"request already allocated: {request_id}")
        if not self.free_blocks:
            raise MemoryError("GPU Out of Memory")

        first_block = self.free_blocks.pop()
        self.block_tables[request_id] = [first_block]
        self.seq_lens[request_id] = 0

    def append_slot(self, request_id: str) -> tuple[int, int]:
        """
        Signals that a new token is being generated.
        Allocates a new block if the current last block is full.
        Returns (physical_block_index, slot_offset_within_block).
        """
        if request_id not in self.block_tables:
            raise KeyError(f"unknown request: {request_id}")
        current_blocks = self.block_tables[request_id]

        token_idx = self.seq_lens[request_id]

        # Allocate a fresh block when the previous one is full.
        if token_idx > 0 and token_idx % self.block_size == 0:
            if not self.free_blocks:
                raise MemoryError("GPU Out of Memory")
            new_block = self.free_blocks.pop()
            current_blocks.append(new_block)

        block_id = current_blocks[-1]
        slot_offset = token_idx % self.block_size
        self.seq_lens[request_id] = token_idx + 1
        return block_id, slot_offset

    def free(self, request_id: str) -> None:
        """Release all blocks associated with a request back to the pool."""
        if request_id in self.block_tables:
            blocks = self.block_tables.pop(request_id)
            self.free_blocks.extend(blocks)
            del self.seq_lens[request_id]

manager = PagedKVCacheManager(num_blocks=20, block_size=4)
manager.allocate("A")
for _ in range(5):
    print("A", manager.append_slot("A"))

manager.allocate("B")
print("B", manager.append_slot("B"))

manager.free("A")
print("free blocks:", len(manager.free_blocks))

Output

A (19, 0)
A (19, 1)
A (19, 2)
A (19, 3)
A (18, 0)
B (17, 0)
free blocks: 19

Notice how allocate only grabs a single block to start. As append_slot is called during the autoregressive generation loop, new blocks are pulled from free_blocks strictly on-demand. When the request completes, free returns all blocks to the shared pool. In this fixed-size toy pool, no variable-sized gap is stranded; a production runtime must also account for tensors and allocator state outside this KV block pool.

Tracing a small example

Let's walk through what happens when two requests arrive in a manager with block_size = 4 and num_blocks = 20.

Step 1: Request A arrives. allocate("A") pops physical block 19 from free_blocks. The block table becomes {"A": [19]}. The sequence length is 0.

Step 2: Request A generates four tokens. append_slot("A") is called four times. The first three tokens fit into block 19 at offsets 0, 1, and 2. On the fourth call, token_idx = 3, which is still within block 19 (offset 3). After the fourth token, seq_lens["A"] = 4.

Step 3: Request A generates a fifth token. Now token_idx = 4 and 4 % 4 == 0, so the current block is full. append_slot("A") pops a new block (18, the next index off the free stack) from free_blocks, appends it to the block table, and writes the fifth token at offset 0 inside block 18. The block table is now {"A": [19, 18]}, which matches the A (18, 0) line in the output above.

Step 4: Request B arrives. allocate("B") pops block 17. Block table: {"A": [19, 18], "B": [17]}.

Step 5: Request A finishes. free("A") returns blocks 19 and 18 to free_blocks. They are immediately available for the next incoming request, without leaving a variable-sized gap in this fixed-block pool.

This trace shows the core dynamic: blocks are grabbed one at a time, only when needed, and returned instantly when the request completes.

Real engines add two important pieces we omitted for clarity: a reference count per physical block (for shared prefixes) and copy-on-write when a sequence needs to write into a block that's currently shared.

Thinking PagedAttention speeds up attention computation is a common mistake. It doesn't. It speeds up memory allocation and packing, which lets you fit more requests in the same GPU memory. The attention math itself stays the same; what changes is how the K and V tensors are laid out in HBM so the GPU can find them without wasting space.

resolve-logical-block-reads.py

block_size = 4
block_table = [19, 18, 5]

def resolve(token_position: int) -> tuple[int, int]:
    logical_block, slot = divmod(token_position, block_size)
    return block_table[logical_block], slot

for position in [0, 3, 4, 8, 10]:
    print(f"token {position}: physical block, slot = {resolve(position)}")

Output

token 0: physical block, slot = (19, 0)
token 3: physical block, slot = (19, 3)
token 4: physical block, slot = (18, 0)
token 8: physical block, slot = (5, 0)
token 10: physical block, slot = (5, 2)

vLLM architecture

vLLM combines PagedAttention with a sophisticated scheduler that manages the lifecycle of requests. Because memory is now a shared resource managed much like an operating system's virtual memory, the scheduler must gracefully handle preemption when the GPU's memory capacity is fully exhausted.

vLLM serving architecture showing requests flowing through scheduler, LLM engine, PagedAttention KV manager, GPU block pool, and preemption recovery by recompute or CPU swap. — PagedAttention makes memory block-based, but scheduler still owns hard production decision: admit, preempt, recompute, or swap.

Preemption: recompute vs. swap

In a highly loaded multi-tenant system, the GPU block pool can still become exhausted. When that happens, the scheduler has to preempt some work and recover later. Two common strategies are:^[1]

Recomputing: Discard the preempted request's KV blocks and rebuild them later by rerunning the prompt and prior decode steps. This spends extra FLOPs, but avoids Peripheral Component Interconnect Express (PCIe) transfer overhead.
Swapping: Move the KV blocks from GPU HBM to CPU RAM, then swap them back in later. This preserves previous work, but consumes PCIe bandwidth.

Once GPU space opens up, the request resumes either from swapped-in blocks or from a newly rebuilt cache. Recomputing trades extra FLOPs for simpler recovery, while swapping trades FLOPs for CPU-GPU traffic. Either way, frequent preemption hurts latency, so production systems still try to size batches to avoid it.^[1]

Suppose a two-rank deployment serves an 80-layer model with 8 KV heads, head dimension 128, FP16 cache, and an 8K context window. From the table earlier, that cache adds 320 KiB per token, or 2.5 GiB per full request across the model. If the KV heads are evenly sharded over the two ranks, each rank stores 1.25 GiB of that request. Applying the paper's under-4% tail-waste result as an estimate gives about 1.30 GiB per rank.^[1]

Do not derive an admission limit from advertised HBM minus model weights alone. Profile how much per-rank memory remains after weights, kernels, communication buffers, graphs, and safety headroom. If that measured KV budget is 8 GiB per rank, only six fully occupied 8K requests fit under the simplified estimate below. PagedAttention does not change this full-context case much; it changes variable-length traffic by allocating blocks for live state rather than reserving every maximum-length tail.

admit-full-context-requests.py

from math import floor

bytes_per_token = 2 * 80 * 8 * 128 * 2
context_tokens = 8192
tp_ranks = 2
tail_overhead = 1.04
kv_budget_per_rank_gib = 8.0

per_request_per_rank_gib = bytes_per_token * context_tokens / tp_ranks / (1024 ** 3)
estimated_with_tail = per_request_per_rank_gib * tail_overhead
capacity = floor(kv_budget_per_rank_gib / estimated_with_tail)

print(f"per request per rank: {estimated_with_tail:.2f} GiB")
print(f"admit at most {capacity} full 8K requests under this KV budget")

Output

per request per rank: 1.30 GiB
admit at most 6 full 8K requests under this KV budget

Continuous batching (iteration-level batching)

PagedAttention pairs especially well with continuous batching (also called iteration-level scheduling).

Static batching limitations

In traditional static batching, the system waits for $N$ requests to arrive, bundles them into a batch, and processes them together until all requests in the batch are completed. Because sequences have variable lengths, the shorter requests finish early, leaving GPU compute units idle while waiting for the longest request to complete.

The continuous batching approach

Continuous batching, made explicit in systems like Orca, reevaluates the active batch at every decode iteration.^[5] As soon as one request emits an End-of-Sequence (EOS) token and finishes, the scheduler ejects it, frees its KV cache blocks, and admits another waiting request into the running batch.

By operating at the iteration level rather than the request level, the engine avoids leaving slots idle just because one request finished earlier than the others. Continuous batching and PagedAttention solve different bottlenecks, but they work especially well together: one keeps compute busy, and the other makes memory reuse immediate.

Strategy	Idle Time on skewed generations	When are requests added?	Throughput posture
Static Batching	Can be high while waiting for longest sequence	Only when the entire previous batch completes	Baseline for comparison
Continuous Batching	Can be lower	At iteration boundaries after requests finish	Can improve on skewed workloads

Continuous batching transforms the batch from a static grouping into a sliding window of active computations. Orca reported up to 36.9× higher throughput than FasterTransformer at the same latency target on GPT-3 175B, because late arrivals can join ongoing work and early finishers can leave immediately.^[5] PagedAttention makes that scheduling policy much easier to sustain under tight memory budgets.

recycle-blocks-at-iteration-boundaries.py

from math import ceil

block_size = 16
live_tokens = {"A": 48, "B": 5}

def blocks(tokens: int) -> int:
    return ceil(tokens / block_size)

before = sum(blocks(tokens) for tokens in live_tokens.values())
del live_tokens["A"]  # A emits EOS; scheduler retires it before next iteration.
live_tokens["C"] = 30
after = sum(blocks(tokens) for tokens in live_tokens.values())

print(f"blocks before retire/admit: {before}")
print(f"blocks after retire A and admit C: {after}")
assert after <= before

Output

blocks before retire/admit: 4
blocks after retire A and admit C: 3

How serving engines build on PagedAttention

Copy-on-write for branches, prefix caching across requests

Block tables make shared physical blocks possible, but they do not by themselves decide which independently arriving requests may reuse state. In the original vLLM design, copy-on-write lets output sequences forked from the same prompt, such as parallel sampling or beam-search candidates, share their prompt blocks until a branch writes divergent state.^[1]

Reusing a system prompt across separate user requests requires an additional prefix cache policy: retain completed prefix blocks, match an exactly compatible token prefix, apply isolation keys where needed, and only then map new requests to the cached state. vLLM documents this cross-request feature as Automatic Prefix Caching.^[6] The next lesson covers those matching and isolation rules in depth.

Suppose 100 requests are allowed to reuse the same 1000-token system-instruction prefix:

Without prefix caching: Store 100 copies of the reusable prefix blocks.
With compatible prefix caching: Store one physical prefix copy and map permitted requests to those read-only blocks until their unique continuation begins.

If all 100 requests qualify, that arrangement saves roughly 99% of prefix-block storage in this example. It does not save unique user input or generated-token cache, and it must not cross tenant or adapter boundaries without an isolation policy.

estimate-prefix-block-reuse.py

from math import ceil

requests = 100
prefix_tokens = 1000
block_size = 16
prefix_blocks = ceil(prefix_tokens / block_size)

without_cache = requests * prefix_blocks
with_compatible_cache = prefix_blocks
saved_fraction = 1 - with_compatible_cache / without_cache

print(f"prefix blocks per copy: {prefix_blocks}")
print(f"saved prefix-block storage: {saved_fraction:.1%}")
assert saved_fraction == 0.99

Output

prefix blocks per copy: 63
saved prefix-block storage: 99.0%

RadixAttention (SGLang)

Newer engines like SGLang extend this with RadixAttention.^[7] Instead of discarding KV state after each generation call, the runtime keeps a least-recently used (LRU) cache of KV entries for all requests inside a radix tree.

Hit: Reuse existing KV blocks immediately (no recomputation needed).
Miss: Compute only the new tokens.

That makes it particularly effective for multi-turn chat and agentic workloads where long prompt prefixes repeat across calls.^[7]

FlashInfer and kernel optimization

FlashInfer is a kernel library for serving workloads that accelerates attention over paged KV caches and variable-length sequences.^[8] It doesn't replace PagedAttention at the memory-manager level. Instead, it speeds up the GPU kernels that consume those layouts during decode.

PagedAttention vs. FlashAttention

These names are easy to mix up, but they solve different problems.^[1]^[9]

PagedAttention: A KV-cache layout and allocation strategy. It places cached K/V tensors in blocks and supports block sharing; a prefix-cache policy decides cross-request reuse.
FlashAttention: An IO-aware attention kernel. It tiles attention so the GPU does less wasteful movement between HBM and on-chip memory, especially during full attention computation.^[9]

Modern serving stacks often use both or close variants together.^[8] PagedAttention makes the cache fit and stay shareable. FlashAttention-style kernels make attention itself faster once those tensors are in place.

Reducing KV cache at the architecture level

PagedAttention optimizes how memory is managed by changing the allocation strategy. However, it doesn't change the actual mathematical size of the tensors. Model architecture changes optimize how much memory is needed in the first place.

Grouped-Query Attention (GQA) is a structural technique that can sharply shrink baseline cache size. Instead of maintaining a separate Key and Value head for every single Query head (as in traditional Multi-Head Attention), GQA groups multiple Query heads together to share a Key and Value head. Ainslie et al. found intermediate group counts recovered much of MHA quality while retaining MQA-like inference benefits after uptraining; other models still require their own quality evaluation.^[3]

By sharing KV heads, the memory footprint required for each token drops proportionally to the group size. For instance, in a 70-billion parameter model, moving from 64 distinct KV heads down to 8 (a group size of 8) reduces the per-token KV cache memory by exactly 8 times.

Attention Type	Example KV Heads	KV Cache/token	Evaluation posture
Multi-Head (MHA)	64	2,560 KiB	Reference cache size
Grouped-Query (GQA)^[3]	8	320 KiB	8x smaller cache; measure task quality
Multi-Query (MQA)^[4]	1	40 KiB	64x smaller cache; measure task quality

GQA reduces the size of the problem, while PagedAttention solves the allocation of the problem. They are complementary when the selected model architecture and serving runtime support both.

Practical considerations

Confusing the KV cache with the model weights is another common mistake. The weights are static parameters loaded once at startup and shared across every request. The KV cache is per-request state that grows as the conversation proceeds. If you size a GPU for weights alone and forget the cache, you'll hit out-of-memory errors the moment you try to serve more than a handful of long conversations.

1. The prefill vs. decode bottleneck

LLM inference has two distinct phases:^[1]^[10]

Prefill: Processing the input prompt. For sufficiently large prompt batches, dense matrix multiplications often make this phase compute-heavy and highly parallelizable.
Decode: Generating tokens one by one. At common serving batch sizes, repeatedly loading weights and growing KV state often makes this phase memory-bandwidth-sensitive.

PagedAttention primarily optimizes the Decode phase, where the system rereads a growing KV cache on every token. In practice, many engines still colocate prefills and decodes on the same GPU via continuous batching or chunked prefill, but that creates interference: a large prompt prefill can stall ongoing decode streams for everyone else. That's why newer serving stacks also use chunked prefill and, at larger scale, full prefill/decode disaggregation to keep time-to-first-token (TTFT) and inter-token latency smooth.^[5]^[11]^[10]

separate-prefill-and-decode-slos.py

traces = [
    {"prompt_tokens": 8000, "ttft_ms": 780, "inter_token_ms": 22},
    {"prompt_tokens": 200, "ttft_ms": 48, "inter_token_ms": 23},
    {"prompt_tokens": 220, "ttft_ms": 51, "inter_token_ms": 61},
]

for trace in traces:
    if trace["ttft_ms"] > 500:
        diagnosis = "inspect prefill admission or chunking"
    elif trace["inter_token_ms"] > 50:
        diagnosis = "inspect decode pressure and KV budget"
    else:
        diagnosis = "within sample thresholds"
    print(f"{trace['prompt_tokens']} prompt tokens: {diagnosis}")

Output

prompt tokens: inspect prefill admission or chunking
prompt tokens: within sample thresholds
prompt tokens: inspect decode pressure and KV budget

2. Context window limits

Even with PagedAttention, the KV cache still grows linearly with context length. The 80-layer, 8-KV-head, 128-dim GQA example from earlier needs 40 GiB of KV cache at 128K context in FP16. At that scale, KV memory can rival or exceed the room left after loading weights, so long-context serving may require distribution, KV-cache quantization, explicit compression, or smaller admitted contexts.

Compression / eviction: Research systems such as SnapKV^[12] and H2O^[13] reduce the cache by keeping only the tokens that seem most useful for future attention.
Quantization: Storing KV cache in FP8 instead of FP16 halves the bytes per cached value, which roughly doubles KV-cache-limited capacity when the cache is the bottleneck.

3. Distributed inference

In Tensor Parallelism (TP), a runtime may shard KV heads across ranks when the model layout permits it. If the number or grouping of KV heads does not divide cleanly across ranks, a runtime may replicate some KV state instead. Capacity planning must use the actual deployed layout, not assume ideal division.

Paged allocation still happens per rank, but logical block growth has to stay aligned across ranks so token positions 0-15, 16-31, and so on map to the correct local KV blocks on every worker.

The physical block IDs don't have to match across GPUs. What matters is consistent logical bookkeeping: each rank must know which local block corresponds to each logical block for its shard of the cache.

Implementations usually synchronize allocation decisions at the scheduler level and maintain per-rank block tables. That lets each GPU run its local attention kernels independently without renegotiating memory layout during every decode step.

avoid-assuming-ideal-kv-sharding.py

def kv_heads_per_rank(kv_heads: int, tp_ranks: int) -> tuple[int, str]:
    if kv_heads % tp_ranks == 0:
        return kv_heads // tp_ranks, "evenly sharded"
    return kv_heads, "conservative replicated-state budget"

for kv_heads, ranks in [(8, 2), (8, 3), (1, 4)]:
    local_heads, posture = kv_heads_per_rank(kv_heads, ranks)
    print(f"{kv_heads} KV heads on TP={ranks}: budget {local_heads} local heads ({posture})")

Output

KV heads on TP=2: budget 4 local heads (evenly sharded)
KV heads on TP=3: budget 8 local heads (conservative replicated-state budget)
KV heads on TP=4: budget 1 local heads (conservative replicated-state budget)

Mastery check

Key concepts

KV cache: Stores old Keys and Values so decode does not recompute them on every step.
Fragmentation: Lost usable HBM from empty tails, scattered gaps, or worst-case reservations.
PagedAttention: Splits cache into fixed-size logical and physical blocks connected by a block table.
Continuous batching: Rebuilds active batch each decode step so finished requests free slots immediately.
Copy-on-write: Lets forked decoding branches share existing blocks until one branch needs a private write; independent request reuse requires prefix-cache matching.

Evaluation rubric

Foundational: Explains why the KV cache speeds decode but still grows with sequence length and active requests.
Foundational: Calculates bytes per token from layers, KV heads, head dimension, and precision.
Intermediate: Distinguishes internal fragmentation, scattered free gaps, and worst-case reservation waste in a naive contiguous allocator.
Intermediate: Explains how PagedAttention uses logical blocks, physical blocks, and a block table to allocate only live cache state.
Advanced: Connects PagedAttention to continuous batching, branch copy-on-write, optional cross-request prefix caching, and scheduler-level preemption decisions.
Advanced: Separates allocator optimizations from architecture optimizations such as GQA, MQA, and KV-cache quantization.
Advanced: Explains why prefill and decode remain different bottlenecks even after PagedAttention.

Follow-up questions

Common pitfalls

KV cache is treated like model weights

Symptom: Capacity planning looks fine on paper, then long chats trigger OOM as soon as concurrency rises. Cause: Static weights were counted, but dynamic per-request KV growth was ignored. Fix: Budget GPU memory for weights and for worst realistic live KV state at your target concurrency.

Fragmentation is blamed on batching alone

Symptom: Throughput stalls even after scheduler tuning. Cause: The batch policy improved, but the allocator still strands large empty tails and gaps. Fix: Evaluate scheduler behavior and allocator behavior separately. PagedAttention helps because it changes memory packing, not because it changes the attention math.

Throughput numbers are quoted without workload shape

Symptom: Benchmarks look amazing, but production gains disappoint. Cause: Reported speedups depended on sequence-length skew, latency target, or decoding workload that did not match your route mix. Fix: Compare serving systems under the same latency target, prompt lengths, output lengths, and concurrency pattern.

Prefill and decode are collapsed into one bottleneck

Symptom: Teams optimize decode memory layout, but TTFT still swings wildly. Cause: Decode is often memory-bandwidth-sensitive, but prefill is a separate compute-heavy phase that can still interfere with live traffic. Fix: Track prefill and decode separately, then decide whether you need chunked prefill, disaggregation, or cache compression.

Swap is assumed to be free recovery

Symptom: Scheduler falls back to CPU swap and latency explodes. Cause: Swapping preserved work, but PCIe transfer and restore time were ignored. Fix: Treat swap as a tradeoff, not a free safety valve. Compare it against recompute under your actual prompt lengths and hardware path.

PagedAttention is expected to shrink bytes per token

Symptom: A team adopts vLLM but still cannot fit the desired context length. Cause: Allocation waste dropped, but the mathematical KV footprint per token stayed the same. Fix: Use GQA, MQA, KV quantization, or cache compression when the per-token cache itself is too large.

Key takeaways

Budget live state: In memory-limited serving configurations, KV cache capacity can limit batch size before raw compute does.
Fragmentation Kills Performance: Naive contiguous allocation can waste roughly 60-80% of KV memory on real traces.^[1]
PagedAttention Improves Allocation: Fixed-size non-contiguous blocks avoid variable-sized KV gaps and bound per-sequence tail waste within the block pool.
Sharing needs policy: Copy-on-write supports forked branches; reuse across independent requests additionally requires compatible prefix matching and isolation.
Architecture Still Matters: Techniques like GQA reduce the size of the KV cache, while PagedAttention makes that smaller cache pack efficiently.

You now understand why memory capacity can cap serving throughput, and how PagedAttention turns fragmented KV allocation into a compact block pool. The next step is to reuse cache work across compatible requests. In Prefix Caching and Prompt Caching, you will see how repeated policy text, tool schemas, and shared instructions can avoid repeated prefill work without bypassing isolation rules.

Next Step

Continue to Prefix Caching and Prompt Caching

You now understand how PagedAttention packs live KV state. The next article covers the matching and isolation rules needed to reuse computed prefixes across compatible requests.

PreviousMulti-Query & Grouped-Query Attention

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Efficient Memory Management for Large Language Model Serving with PagedAttention.

Kwon, W., et al. · 2023 · SOSP 2023

Attention Is All You Need.

Vaswani, A., et al. · 2017

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.

Ainslie, J., et al. · 2023 · EMNLP 2023

Fast Transformer Decoding: One Write-Head is All You Need.

Shazeer, N. · 2019 · arXiv preprint

Orca: A Distributed Serving System for Transformer-Based Generative Models.

Yu, G.-I., et al. · 2022 · OSDI 2022

Automatic Prefix Caching

vLLM · 2026

SGLang: Efficient Execution of Structured Language Model Programs

Zheng, L., Yin, L., Xie, Z., et al. · 2024 · NeurIPS 2024

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving.

Ye, Z., et al. · 2025

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. · 2022 · NeurIPS 2022

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving.

Zhong, Y., et al. · 2024 · OSDI 2024

Sarathi-Serve: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.

Agrawal, A., et al. · 2023 · arXiv preprint

SnapKV: Compressing KV Cache by Selecting Global Attention Patterns.

Li, Y., et al. · 2024

H2O: Heavy-Hitter Oracle for KV Cache Eviction in LLMs.

Zhang, Z., et al. · 2023

Back to Topics

LearnInference & Production ScaleKV Cache & PagedAttention

🚀HardInference Optimization

KV Cache & PagedAttention

Calculate KV cache capacity, trace paged block allocation, and separate memory packing from prefix reuse and scheduling tradeoffs.

36 min read

Learning path

Step 125 of 155 in the full curriculum

Multi-Query & Grouped-Query Attention Prefix Caching and Prompt Caching

KV Cache & PagedAttention

Why the KV Cache Exists

Autoregressive generation

In a Transformer model, generating the next token requires attending to all previous tokens in the sequence.^[2] The attention mechanism for a query token $q_t$ at step $t$ is defined as:

$\text{Attention}(q_t, K_{1:t}, V_{1:t}) = \text{softmax}\left(\frac{q_t K_{1:t}^T}{\sqrt{d_k}}\right) V_{1:t}$

Here $K_{1:t}$ and $V_{1:t}$ represent the Key and Value matrices for all tokens from position 1 to $t$ .

KV cache: cache K, V from previous steps

KV cache memory analysis

The size of the KV cache isn't trivial. To see why, let's count the bytes for a concrete model first, then write the general formula.

Memory footprint examples (FP16)

Example architecture	Layers	KV Heads	$d_{head}$	Cache/token	4K Context	32K Context
32-layer, 8-KV-head GQA	32	8	128	128 KiB	512 MiB	4 GiB
80-layer, 8-KV-head GQA	80	8	128	320 KiB	1.25 GiB	10 GiB
96-layer, 96-KV-head MHA	96	96	128	4.5 MiB	18 GiB	144 GiB

Where do these numbers come from? The memory required per token follows a simple rule:

\text{Bytes per token} = 2 \times n_{\text{layers}} \times n_{\text{kv-heads}} \times d_{\text{head}} \times P_{\text{bytes}}

Where:

$2$ : For both Key and Value matrices
$n_{\text{layers}}$ : Number of transformer layers
$n_{\text{kv-heads}}$ : Number of KV heads. In MHA this equals the number of attention heads; in GQA/MQA it's smaller
$d_{\text{head}}$ : Dimension of each head
$P_{\text{bytes}}$ : Stored K/V payload precision (2 bytes for FP16, 1 byte for FP8/INT8)

calculate-kv-cache-footprint.py

def cache_gib(layers: int, kv_heads: int, head_dim: int, bytes_per_value: int, tokens: int) -> float:
    bytes_per_token = 2 * layers * kv_heads * head_dim * bytes_per_value
    return bytes_per_token * tokens / (1024 ** 3)

for label, heads in [("MHA-64", 64), ("GQA-8", 8), ("MQA-1", 1)]:
    size = cache_gib(layers=80, kv_heads=heads, head_dim=128, bytes_per_value=2, tokens=4096)
    print(f"{label}: {size:.2f} GiB at 4K tokens")

Output

MHA-64: 10.00 GiB at 4K tokens
GQA-8: 1.25 GiB at 4K tokens
MQA-1: 0.16 GiB at 4K tokens

The memory fragmentation problem

Types of waste

This pessimistic allocation strategy leads to massive inefficiencies due to three distinct forms of memory fragmentation:

Internal Fragmentation: This occurs when a request reserves the maximum space (e.g., 4096 tokens) but finishes generation early (e.g., after just 200 tokens). The remaining 3896 pre-allocated slots sit completely empty but can't be used by any other request, representing a massive loss of usable capacity.
External Fragmentation: Over time, as requests of various sizes enter and leave the system, free memory becomes scattered in small gaps between the allocated buffers. Even if the total free memory is large enough to handle a new request, the system can't launch it because there's no single contiguous block large enough to hold it.
Reservation Waste: Because the system guarantees contiguous memory for the maximum possible length, it can't oversubscribe memory based on average sequence lengths. This forces the system to operate highly conservatively.

measure-reservation-waste.py

bytes_per_token = 320 * 1024
max_tokens = 8192
live_tokens = [600, 1250, 4000, 128]

reserved = len(live_tokens) * max_tokens * bytes_per_token
live = sum(live_tokens) * bytes_per_token
waste_fraction = 1 - live / reserved

print(f"reserved: {reserved / 1024**3:.2f} GiB")
print(f"live state: {live / 1024**3:.2f} GiB")
print(f"unused reservation: {waste_fraction:.1%}")

Output

reserved: 10.00 GiB
live state: 1.82 GiB
unused reservation: 81.8%

PagedAttention (vLLM)

The key insight: virtual memory for KV cache

Logical Blocks: The sequence of tokens as seen by the model (contiguous indices 0, 1, 2...).
Physical Blocks: Where the data lives in GPU memory (non-contiguous, scattered).
Block Table: A mapping that translates logical block indices to physical block addresses.

This allows the system to allocate memory dynamically, one block at a time, only when needed.

Efficiency gains

Metric in vLLM evaluation	Contiguous baselines	PagedAttention
Fragmentation / reservation waste	High on evaluated traces	Tail waste below 4% on evaluated traces
Live-token cache utilization	20.4-38.2% on reported traces	Above 96% on reported traces
Admission constraint	Large per-request reservations	Blocks allocated for live tokens
Throughput implication	Baseline systems in comparison	Higher when recovered HBM admits useful work

bound-paged-tail-waste.py

from math import ceil

def allocated_tokens(live_tokens: int, block_size: int) -> int:
    return ceil(live_tokens / block_size) * block_size

lengths = [505, 1000, 4096]
block_size = 16
for length in lengths:
    allocated = allocated_tokens(length, block_size)
    print(f"{length} live tokens -> {allocated} slots, {allocated - length} unused")

assert max(allocated_tokens(length, block_size) - length for length in lengths) < block_size

Output

live tokens -> 512 slots, 7 unused
live tokens -> 1008 slots, 8 unused
live tokens -> 4096 slots, 0 unused

Implementation sketch

implementation-sketch.py

class PagedKVCacheManager:
    """
    Manages the mapping between logical pages and physical GPU blocks.
    Doesn't store the actual tensors (that happens in GPU memory).
    """

    def __init__(self, num_blocks: int, block_size: int):
        if num_blocks <= 0 or block_size <= 0:
            raise ValueError("num_blocks and block_size must be positive")
        self.block_size = block_size
        # Stack of free physical block indices
        self.free_blocks: list[int] = list(range(num_blocks))

        # Maps request_id -> list of physical block indices
        # Example: { "req_1": [7, 2, 15] }
        self.block_tables: dict[str, list[int]] = {}

        # Maps request_id -> number of tokens generated so far
        self.seq_lens: dict[str, int] = {}

    def allocate(self, request_id: str) -> None:
        """Initialize a new request with one empty block."""
        if request_id in self.block_tables:
            raise ValueError(f"request already allocated: {request_id}")
        if not self.free_blocks:
            raise MemoryError("GPU Out of Memory")

        first_block = self.free_blocks.pop()
        self.block_tables[request_id] = [first_block]
        self.seq_lens[request_id] = 0

    def append_slot(self, request_id: str) -> tuple[int, int]:
        """
        Signals that a new token is being generated.
        Allocates a new block if the current last block is full.
        Returns (physical_block_index, slot_offset_within_block).
        """
        if request_id not in self.block_tables:
            raise KeyError(f"unknown request: {request_id}")
        current_blocks = self.block_tables[request_id]

        token_idx = self.seq_lens[request_id]

        # Allocate a fresh block when the previous one is full.
        if token_idx > 0 and token_idx % self.block_size == 0:
            if not self.free_blocks:
                raise MemoryError("GPU Out of Memory")
            new_block = self.free_blocks.pop()
            current_blocks.append(new_block)

        block_id = current_blocks[-1]
        slot_offset = token_idx % self.block_size
        self.seq_lens[request_id] = token_idx + 1
        return block_id, slot_offset

    def free(self, request_id: str) -> None:
        """Release all blocks associated with a request back to the pool."""
        if request_id in self.block_tables:
            blocks = self.block_tables.pop(request_id)
            self.free_blocks.extend(blocks)
            del self.seq_lens[request_id]

manager = PagedKVCacheManager(num_blocks=20, block_size=4)
manager.allocate("A")
for _ in range(5):
    print("A", manager.append_slot("A"))

manager.allocate("B")
print("B", manager.append_slot("B"))

manager.free("A")
print("free blocks:", len(manager.free_blocks))

Output

A (19, 0)
A (19, 1)
A (19, 2)
A (19, 3)
A (18, 0)
B (17, 0)
free blocks: 19

Tracing a small example

Let's walk through what happens when two requests arrive in a manager with block_size = 4 and num_blocks = 20.

Step 1: Request A arrives. allocate("A") pops physical block 19 from free_blocks. The block table becomes {"A": [19]}. The sequence length is 0.

Step 4: Request B arrives. allocate("B") pops block 17. Block table: {"A": [19, 18], "B": [17]}.

This trace shows the core dynamic: blocks are grabbed one at a time, only when needed, and returned instantly when the request completes.

resolve-logical-block-reads.py

block_size = 4
block_table = [19, 18, 5]

def resolve(token_position: int) -> tuple[int, int]:
    logical_block, slot = divmod(token_position, block_size)
    return block_table[logical_block], slot

for position in [0, 3, 4, 8, 10]:
    print(f"token {position}: physical block, slot = {resolve(position)}")

Output

token 0: physical block, slot = (19, 0)
token 3: physical block, slot = (19, 3)
token 4: physical block, slot = (18, 0)
token 8: physical block, slot = (5, 0)
token 10: physical block, slot = (5, 2)

vLLM architecture

Preemption: recompute vs. swap

In a highly loaded multi-tenant system, the GPU block pool can still become exhausted. When that happens, the scheduler has to preempt some work and recover later. Two common strategies are:^[1]

Recomputing: Discard the preempted request's KV blocks and rebuild them later by rerunning the prompt and prior decode steps. This spends extra FLOPs, but avoids Peripheral Component Interconnect Express (PCIe) transfer overhead.
Swapping: Move the KV blocks from GPU HBM to CPU RAM, then swap them back in later. This preserves previous work, but consumes PCIe bandwidth.

admit-full-context-requests.py

from math import floor

bytes_per_token = 2 * 80 * 8 * 128 * 2
context_tokens = 8192
tp_ranks = 2
tail_overhead = 1.04
kv_budget_per_rank_gib = 8.0

per_request_per_rank_gib = bytes_per_token * context_tokens / tp_ranks / (1024 ** 3)
estimated_with_tail = per_request_per_rank_gib * tail_overhead
capacity = floor(kv_budget_per_rank_gib / estimated_with_tail)

print(f"per request per rank: {estimated_with_tail:.2f} GiB")
print(f"admit at most {capacity} full 8K requests under this KV budget")

Output

per request per rank: 1.30 GiB
admit at most 6 full 8K requests under this KV budget

Continuous batching (iteration-level batching)

PagedAttention pairs especially well with continuous batching (also called iteration-level scheduling).

Static batching limitations

The continuous batching approach

Strategy	Idle Time on skewed generations	When are requests added?	Throughput posture
Static Batching	Can be high while waiting for longest sequence	Only when the entire previous batch completes	Baseline for comparison
Continuous Batching	Can be lower	At iteration boundaries after requests finish	Can improve on skewed workloads

recycle-blocks-at-iteration-boundaries.py

from math import ceil

block_size = 16
live_tokens = {"A": 48, "B": 5}

def blocks(tokens: int) -> int:
    return ceil(tokens / block_size)

before = sum(blocks(tokens) for tokens in live_tokens.values())
del live_tokens["A"]  # A emits EOS; scheduler retires it before next iteration.
live_tokens["C"] = 30
after = sum(blocks(tokens) for tokens in live_tokens.values())

print(f"blocks before retire/admit: {before}")
print(f"blocks after retire A and admit C: {after}")
assert after <= before

Output

blocks before retire/admit: 4
blocks after retire A and admit C: 3

How serving engines build on PagedAttention

Copy-on-write for branches, prefix caching across requests

Suppose 100 requests are allowed to reuse the same 1000-token system-instruction prefix:

Without prefix caching: Store 100 copies of the reusable prefix blocks.
With compatible prefix caching: Store one physical prefix copy and map permitted requests to those read-only blocks until their unique continuation begins.

estimate-prefix-block-reuse.py

from math import ceil

requests = 100
prefix_tokens = 1000
block_size = 16
prefix_blocks = ceil(prefix_tokens / block_size)

without_cache = requests * prefix_blocks
with_compatible_cache = prefix_blocks
saved_fraction = 1 - with_compatible_cache / without_cache

print(f"prefix blocks per copy: {prefix_blocks}")
print(f"saved prefix-block storage: {saved_fraction:.1%}")
assert saved_fraction == 0.99

Output

prefix blocks per copy: 63
saved prefix-block storage: 99.0%

RadixAttention (SGLang)

Hit: Reuse existing KV blocks immediately (no recomputation needed).
Miss: Compute only the new tokens.

That makes it particularly effective for multi-turn chat and agentic workloads where long prompt prefixes repeat across calls.^[7]

FlashInfer and kernel optimization

PagedAttention vs. FlashAttention

These names are easy to mix up, but they solve different problems.^[1]^[9]

PagedAttention: A KV-cache layout and allocation strategy. It places cached K/V tensors in blocks and supports block sharing; a prefix-cache policy decides cross-request reuse.
FlashAttention: An IO-aware attention kernel. It tiles attention so the GPU does less wasteful movement between HBM and on-chip memory, especially during full attention computation.^[9]

Reducing KV cache at the architecture level

Attention Type	Example KV Heads	KV Cache/token	Evaluation posture
Multi-Head (MHA)	64	2,560 KiB	Reference cache size
Grouped-Query (GQA)^[3]	8	320 KiB	8x smaller cache; measure task quality
Multi-Query (MQA)^[4]	1	40 KiB	64x smaller cache; measure task quality

GQA reduces the size of the problem, while PagedAttention solves the allocation of the problem. They are complementary when the selected model architecture and serving runtime support both.

Practical considerations

1. The prefill vs. decode bottleneck

LLM inference has two distinct phases:^[1]^[10]

Prefill: Processing the input prompt. For sufficiently large prompt batches, dense matrix multiplications often make this phase compute-heavy and highly parallelizable.
Decode: Generating tokens one by one. At common serving batch sizes, repeatedly loading weights and growing KV state often makes this phase memory-bandwidth-sensitive.

separate-prefill-and-decode-slos.py

traces = [
    {"prompt_tokens": 8000, "ttft_ms": 780, "inter_token_ms": 22},
    {"prompt_tokens": 200, "ttft_ms": 48, "inter_token_ms": 23},
    {"prompt_tokens": 220, "ttft_ms": 51, "inter_token_ms": 61},
]

for trace in traces:
    if trace["ttft_ms"] > 500:
        diagnosis = "inspect prefill admission or chunking"
    elif trace["inter_token_ms"] > 50:
        diagnosis = "inspect decode pressure and KV budget"
    else:
        diagnosis = "within sample thresholds"
    print(f"{trace['prompt_tokens']} prompt tokens: {diagnosis}")

Output

prompt tokens: inspect prefill admission or chunking
prompt tokens: within sample thresholds
prompt tokens: inspect decode pressure and KV budget

2. Context window limits

Compression / eviction: Research systems such as SnapKV^[12] and H2O^[13] reduce the cache by keeping only the tokens that seem most useful for future attention.
Quantization: Storing KV cache in FP8 instead of FP16 halves the bytes per cached value, which roughly doubles KV-cache-limited capacity when the cache is the bottleneck.

3. Distributed inference

Paged allocation still happens per rank, but logical block growth has to stay aligned across ranks so token positions 0-15, 16-31, and so on map to the correct local KV blocks on every worker.

avoid-assuming-ideal-kv-sharding.py

def kv_heads_per_rank(kv_heads: int, tp_ranks: int) -> tuple[int, str]:
    if kv_heads % tp_ranks == 0:
        return kv_heads // tp_ranks, "evenly sharded"
    return kv_heads, "conservative replicated-state budget"

for kv_heads, ranks in [(8, 2), (8, 3), (1, 4)]:
    local_heads, posture = kv_heads_per_rank(kv_heads, ranks)
    print(f"{kv_heads} KV heads on TP={ranks}: budget {local_heads} local heads ({posture})")

Output

KV heads on TP=2: budget 4 local heads (evenly sharded)
KV heads on TP=3: budget 8 local heads (conservative replicated-state budget)
KV heads on TP=4: budget 1 local heads (conservative replicated-state budget)

Mastery check

Key concepts

KV cache: Stores old Keys and Values so decode does not recompute them on every step.
Fragmentation: Lost usable HBM from empty tails, scattered gaps, or worst-case reservations.
PagedAttention: Splits cache into fixed-size logical and physical blocks connected by a block table.
Continuous batching: Rebuilds active batch each decode step so finished requests free slots immediately.
Copy-on-write: Lets forked decoding branches share existing blocks until one branch needs a private write; independent request reuse requires prefix-cache matching.

Evaluation rubric

Foundational: Explains why the KV cache speeds decode but still grows with sequence length and active requests.
Foundational: Calculates bytes per token from layers, KV heads, head dimension, and precision.
Intermediate: Distinguishes internal fragmentation, scattered free gaps, and worst-case reservation waste in a naive contiguous allocator.
Intermediate: Explains how PagedAttention uses logical blocks, physical blocks, and a block table to allocate only live cache state.
Advanced: Connects PagedAttention to continuous batching, branch copy-on-write, optional cross-request prefix caching, and scheduler-level preemption decisions.
Advanced: Separates allocator optimizations from architecture optimizations such as GQA, MQA, and KV-cache quantization.
Advanced: Explains why prefill and decode remain different bottlenecks even after PagedAttention.

Follow-up questions

Common pitfalls

KV cache is treated like model weights

Fragmentation is blamed on batching alone

Throughput numbers are quoted without workload shape

Prefill and decode are collapsed into one bottleneck

Swap is assumed to be free recovery

PagedAttention is expected to shrink bytes per token

Key takeaways

Budget live state: In memory-limited serving configurations, KV cache capacity can limit batch size before raw compute does.
Fragmentation Kills Performance: Naive contiguous allocation can waste roughly 60-80% of KV memory on real traces.^[1]
PagedAttention Improves Allocation: Fixed-size non-contiguous blocks avoid variable-sized KV gaps and bound per-sequence tail waste within the block pool.
Sharing needs policy: Copy-on-write supports forked branches; reuse across independent requests additionally requires compatible prefix matching and isolation.
Architecture Still Matters: Techniques like GQA reduce the size of the KV cache, while PagedAttention makes that smaller cache pack efficiently.

Next Step

Continue to Prefix Caching and Prompt Caching

You now understand how PagedAttention packs live KV state. The next article covers the matching and isolation rules needed to reuse computed prefixes across compatible requests.

PreviousMulti-Query & Grouped-Query Attention

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Efficient Memory Management for Large Language Model Serving with PagedAttention.

Kwon, W., et al. · 2023 · SOSP 2023

Attention Is All You Need.

Vaswani, A., et al. · 2017

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.

Ainslie, J., et al. · 2023 · EMNLP 2023

Fast Transformer Decoding: One Write-Head is All You Need.

Shazeer, N. · 2019 · arXiv preprint

Orca: A Distributed Serving System for Transformer-Based Generative Models.

Yu, G.-I., et al. · 2022 · OSDI 2022

Automatic Prefix Caching

vLLM · 2026

SGLang: Efficient Execution of Structured Language Model Programs

Zheng, L., Yin, L., Xie, Z., et al. · 2024 · NeurIPS 2024

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving.

Ye, Z., et al. · 2025

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. · 2022 · NeurIPS 2022

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving.

Zhong, Y., et al. · 2024 · OSDI 2024

Sarathi-Serve: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.

Agrawal, A., et al. · 2023 · arXiv preprint

SnapKV: Compressing KV Cache by Selecting Global Attention Patterns.

Li, Y., et al. · 2024

H2O: Heavy-Hitter Oracle for KV Cache Eviction in LLMs.

Zhang, Z., et al. · 2023

KV Cache & PagedAttention

KV Cache & PagedAttention

What problem does the KV cache solve, and what new problem does it create?

Why the KV Cache Exists

Autoregressive generation

KV cache: cache K, V from previous steps

Why doesn't the KV cache make long-context decode constant time?

KV cache memory analysis

Memory footprint examples (FP16)

For an 80-layer GQA model with 8 KV heads, head dimension 128, and FP16 cache, how many bytes does one token add?

The memory fragmentation problem

Types of waste

Why does naive contiguous KV allocation waste memory on variable-length requests?

PagedAttention (vLLM)

The key insight: virtual memory for KV cache

What does the block table do in PagedAttention?

Efficiency gains

Why isn't the smallest possible block size always best?

Implementation sketch

Tracing a small example

In the toy manager, when is a new physical block allocated?

vLLM architecture

Preemption: recompute vs. swap

Why doesn't PagedAttention magically increase capacity for requests that all use the full maximum context?

Continuous batching (iteration-level batching)

Static batching limitations

The continuous batching approach

Why do continuous batching and PagedAttention work well together?

How serving engines build on PagedAttention

Copy-on-write for branches, prefix caching across requests

Why is PagedAttention block sharing not the same as cross-request prefix caching?

RadixAttention (SGLang)

FlashInfer and kernel optimization

PagedAttention vs. FlashAttention

How should you explain PagedAttention versus FlashAttention in one sentence?

Reducing KV cache at the architecture level

Why are GQA and PagedAttention complementary rather than competing optimizations?

Practical considerations

1. The prefill vs. decode bottleneck

2. Context window limits

3. Distributed inference

In tensor-parallel serving, do physical KV block IDs need to match on every GPU?

Mastery check

Key concepts

Evaluation rubric

Follow-up questions

An engine reserves 8K tokens for every request, but most conversations stop after 600 tokens. What waste is dominating?

Why does PagedAttention help less when every active request uses the full maximum context?

A route has healthy block utilization, but TTFT spikes whenever one huge prompt arrives. Which bottleneck are you seeing?

Two agent branches share same long system prompt, then diverge. Why does copy-on-write matter?

In tensor-parallel serving, must physical block IDs match across GPUs?

Common pitfalls

KV cache is treated like model weights

Fragmentation is blamed on batching alone

Throughput numbers are quoted without workload shape

Prefill and decode are collapsed into one bottleneck

Swap is assumed to be free recovery

PagedAttention is expected to shrink bytes per token

Key takeaways

KV Cache & PagedAttention

KV Cache & PagedAttention

What problem does the KV cache solve, and what new problem does it create?

Why the KV Cache Exists

Autoregressive generation

KV cache: cache K, V from previous steps

Why doesn't the KV cache make long-context decode constant time?

KV cache memory analysis

Memory footprint examples (FP16)

For an 80-layer GQA model with 8 KV heads, head dimension 128, and FP16 cache, how many bytes does one token add?

The memory fragmentation problem

Types of waste

Why does naive contiguous KV allocation waste memory on variable-length requests?

PagedAttention (vLLM)

The key insight: virtual memory for KV cache

What does the block table do in PagedAttention?

Efficiency gains

Why isn't the smallest possible block size always best?

Implementation sketch

Tracing a small example

In the toy manager, when is a new physical block allocated?