LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnInference & Production ScaleKV Cache & PagedAttention
🚀HardInference Optimization

KV Cache & PagedAttention

Calculate KV cache capacity, trace paged block allocation, and separate memory packing from prefix reuse and scheduling tradeoffs.

36 min read
Learning path
Step 125 of 155 in the full curriculum
Multi-Query & Grouped-Query AttentionPrefix Caching and Prompt Caching

KV Cache & PagedAttention

The previous chapter shrank each token's key-value (KV) footprint with Multi-Query Attention (MQA) and Grouped-Query Attention (GQA). This chapter keeps that smaller cache from wasting memory as variable-length requests enter and leave a serving batch.

When a commerce assistant answers a shopper's question, it has to remember the entire conversation: cart history, product details, refund policy, delivery ETA, and support messages. Each new token the model generates would require re-reading that whole history from scratch if it didn't have a way to remember what it already computed.

That memory is called the KV cache (Key-Value cache). It stores the intermediate results the model needs to attend back to earlier tokens without recomputing them. The cache makes generation fast, but it creates a hard engineering problem: it consumes a huge amount of expensive GPU memory, and that consumption grows with every token and every simultaneous shopper.

Managing this memory efficiently is central to LLM serving. PagedAttention, introduced with vLLM, borrows ideas from operating-system paging to reduce fragmentation and pack active request state more tightly.[1]


Why the KV Cache Exists

This article builds on two ideas from earlier in the curriculum. First, Transformer self-attention: the model computes Query, Key, and Value vectors to decide which earlier tokens matter for the current token.[2] Second, autoregressive generation: the model produces one token at a time, and each new token can attend to every token that came before it. If those prerequisites are fresh, the rest of this article follows naturally.

Autoregressive generation

Without a KV cache, generating each new token is like re-reading the entire conversation from scratch every time you want to say one more sentence. The KV cache is like keeping a running notebook: each time you process a token, you jot down a summary (Key) and the relevant content (Value). When the next token needs to reference earlier context, you flip through the notebook instead of re-reading everything. The notebook grows with each token, which is why memory management matters so much.

In a Transformer model, generating the next token requires attending to all previous tokens in the sequence.[2] The attention mechanism for a query token qtq_tqt​ at step ttt is defined as:

Attention(qt,K1:t,V1:t)=softmax(qtK1:tTdk)V1:t\text{Attention}(q_t, K_{1:t}, V_{1:t}) = \text{softmax}\left(\frac{q_t K_{1:t}^T}{\sqrt{d_k}}\right) V_{1:t}Attention(qt​,K1:t​,V1:t​)=softmax(dk​​qt​K1:tT​​)V1:t​

To understand this formula: the current token's query qtq_tqt​ is compared against all previous keys K1:tK_{1:t}K1:t​ (via dot product, scaled by dk\sqrt{d_k}dk​​). The softmax function then determines which previous tokens are most relevant, and those tokens' values V1:tV_{1:t}V1:t​ are blended accordingly. Without caching, the keys and values would need to be recomputed from scratch at every step.

Here K1:tK_{1:t}K1:t​ and V1:tV_{1:t}V1:t​ represent the Key and Value matrices for all tokens from position 1 to ttt.

Without caching, generating token ttt would require recomputing the Key and Value vectors for all t−1t-1t−1 previous tokens at every single step. That means the cost of generating each new token grows linearly with sequence length, giving a quadratic total cost for the full sequence. The KV cache eliminates the repeated K/V projection work, but the attention operation still has to read the full cached history for each new token.

KV cache: cache K, V from previous steps

The KV cache solves this by storing the Key (KKK) and Value (VVV) vectors for all previous tokens in GPU memory. At step ttt, we only need to compute qt,kt,vtq_t, k_t, v_tqt​,kt​,vt​ for the current token, and append kt,vtk_t, v_tkt​,vt​ to the cache.

KV cache decode loop showing each new token computing only its own query, key, and value, appending key and value to cache, then attending over all cached history. KV cache decode loop showing each new token computing only its own query, key, and value, appending key and value to cache, then attending over all cached history.
The cache saves repeated K/V projection work. It does not remove the growing read: each new query still attends over cached history.

With the KV cache, the decode phase (generating one new token at a time) still scales linearly per step in sequence length because each new token attends over all cached tokens. As cache reads grow, long-context decode can become memory-bandwidth-sensitive even after caching.[1]


KV cache memory analysis

The size of the KV cache isn't trivial. To see why, let's count the bytes for a concrete model first, then write the general formula.

Memory footprint examples (FP16)

Example architectureLayersKV Headsdheadd_{head}dhead​Cache/token4K Context32K Context
32-layer, 8-KV-head GQA328128128 KiB512 MiB4 GiB
80-layer, 8-KV-head GQA808128320 KiB1.25 GiB10 GiB
96-layer, 96-KV-head MHA96961284.5 MiB18 GiB144 GiB
KV cache memory footprint comparison for GQA and MHA architectures across 4K and 32K contexts. KV cache memory footprint comparison for GQA and MHA architectures across 4K and 32K contexts.
KV-cache footprint depends on layer count, KV heads, head dimension, precision, and sequence length. Full MHA grows far faster than GQA.

For the 80-layer, 8-KV-head GQA example, a single 4K-context request already consumes 1.25 GiB of KV cache. A batch of 128 such requests would need 160 GiB of High Bandwidth Memory (HBM), before weights and runtime buffers. That is beyond a single 80 GB-class accelerator, so memory capacity can cap throughput before raw FLOPs do.

If you plug in 80 layers, 64 KV heads, head dimension 128, and FP16 precision, you get 2.5 MiB per token. Keeping the layers, head dimension, and precision fixed but reducing the KV heads from 64 to 8 drops that to 320 KiB per token. At 4K context, that is the difference between 10 GiB and 1.25 GiB of cache for a single request.

Where do these numbers come from? The memory required per token follows a simple rule:

Bytes per token=2×nlayers×nkv-heads×dhead×Pbytes\text{Bytes per token} = 2 \times n_{\text{layers}} \times n_{\text{kv-heads}} \times d_{\text{head}} \times P_{\text{bytes}}Bytes per token=2×nlayers​×nkv-heads​×dhead​×Pbytes​

To interpret this formula: for each token, we store a Key and Value vector (hence the "2") across every layer and every KV head. Each vector has dheadd_{\text{head}}dhead​ dimensions, and PbytesP_{\text{bytes}}Pbytes​ is the precision size. For plain Multi-Head Attention (MHA), nkv-heads=nheadsn_{\text{kv-heads}} = n_{\text{heads}}nkv-heads​=nheads​. For Grouped-Query Attention (GQA) or Multi-Query Attention (MQA), the number of KV heads is smaller, which directly shrinks the cache.[3][4]

Where:

  • 222: For both Key and Value matrices
  • nlayersn_{\text{layers}}nlayers​: Number of transformer layers
  • nkv-headsn_{\text{kv-heads}}nkv-heads​: Number of KV heads. In MHA this equals the number of attention heads; in GQA/MQA it's smaller
  • dheadd_{\text{head}}dhead​: Dimension of each head
  • PbytesP_{\text{bytes}}Pbytes​: Stored K/V payload precision (2 bytes for FP16, 1 byte for FP8/INT8)

This is the tensor-payload calculation. A deployed engine may also reserve quantization scales, block tables, temporary workspaces, communication buffers, graph memory, and allocator headroom. Use the formula to reason about slopes, then profile the runtime before setting admission limits.

calculate-kv-cache-footprint.py
1def cache_gib(layers: int, kv_heads: int, head_dim: int, bytes_per_value: int, tokens: int) -> float: 2 bytes_per_token = 2 * layers * kv_heads * head_dim * bytes_per_value 3 return bytes_per_token * tokens / (1024 ** 3) 4 5for label, heads in [("MHA-64", 64), ("GQA-8", 8), ("MQA-1", 1)]: 6 size = cache_gib(layers=80, kv_heads=heads, head_dim=128, bytes_per_value=2, tokens=4096) 7 print(f"{label}: {size:.2f} GiB at 4K tokens")
Output
1MHA-64: 10.00 GiB at 4K tokens 2GQA-8: 1.25 GiB at 4K tokens 3MQA-1: 0.16 GiB at 4K tokens

The memory fragmentation problem

In a naive serving system, memory for the KV cache is allocated as a single, contiguous buffer for each request. Because the system can't know ahead of time how many tokens the model will ultimately generate before stopping, it often reserves enough memory to accommodate the configured maximum possible sequence length (for instance, 4096 tokens).

Side-by-side view of contiguous KV reservation wasting tail space versus paged blocks that allocate only live token chunks. Side-by-side view of contiguous KV reservation wasting tail space versus paged blocks that allocate only live token chunks.
Contiguous allocators trap empty tails. PagedAttention allocates small live blocks on demand, so waste is mostly limited to the final partial block.

Types of waste

This pessimistic allocation strategy leads to massive inefficiencies due to three distinct forms of memory fragmentation:

  1. Internal Fragmentation: This occurs when a request reserves the maximum space (e.g., 4096 tokens) but finishes generation early (e.g., after just 200 tokens). The remaining 3896 pre-allocated slots sit completely empty but can't be used by any other request, representing a massive loss of usable capacity.
  2. External Fragmentation: Over time, as requests of various sizes enter and leave the system, free memory becomes scattered in small gaps between the allocated buffers. Even if the total free memory is large enough to handle a new request, the system can't launch it because there's no single contiguous block large enough to hold it.
  3. Reservation Waste: Because the system guarantees contiguous memory for the maximum possible length, it can't oversubscribe memory based on average sequence lengths. This forces the system to operate highly conservatively.

In the vLLM paper's evaluation traces, existing contiguous allocators used only 20.4-38.2% of KV memory to hold live token states, which means the rest was lost to fragmentation and over-reservation.[1] That directly limits the maximum batch size and throttles throughput.

measure-reservation-waste.py
1bytes_per_token = 320 * 1024 2max_tokens = 8192 3live_tokens = [600, 1250, 4000, 128] 4 5reserved = len(live_tokens) * max_tokens * bytes_per_token 6live = sum(live_tokens) * bytes_per_token 7waste_fraction = 1 - live / reserved 8 9print(f"reserved: {reserved / 1024**3:.2f} GiB") 10print(f"live state: {live / 1024**3:.2f} GiB") 11print(f"unused reservation: {waste_fraction:.1%}")
Output
1reserved: 10.00 GiB 2live state: 1.82 GiB 3unused reservation: 81.8%

PagedAttention (vLLM)

The key insight: virtual memory for KV cache

PagedAttention adapts the concept of paging from operating systems to LLM serving.[1] Instead of contiguous buffers, the KV cache is divided into fixed-size blocks. In the original vLLM paper, each block holds 16 tokens.[1]

Naive KV cache allocation is like reserving a full pallet position for every possible package size, even when most orders only need a small bin. Most reserved space goes unused. PagedAttention is like a shared bin system that assigns fixed-size slots only as requests need them, and those slots don't have to sit next to each other. When a request finishes, its slots are immediately available for the next request. Waste is mostly limited to each request's final partially filled block.

  • Logical Blocks: The sequence of tokens as seen by the model (contiguous indices 0, 1, 2...).
  • Physical Blocks: Where the data lives in GPU memory (non-contiguous, scattered).
  • Block Table: A mapping that translates logical block indices to physical block addresses.

This allows the system to allocate memory dynamically, one block at a time, only when needed.

PagedAttention block table mapping logical token blocks to non-contiguous physical GPU memory blocks. PagedAttention block table mapping logical token blocks to non-contiguous physical GPU memory blocks.
The model sees contiguous logical blocks. The allocator is free to place each block anywhere in the physical GPU block pool.

Efficiency gains

Metric in vLLM evaluationContiguous baselinesPagedAttention
Fragmentation / reservation wasteHigh on evaluated tracesTail waste below 4% on evaluated traces
Live-token cache utilization20.4-38.2% on reported tracesAbove 96% on reported traces
Admission constraintLarge per-request reservationsBlocks allocated for live tokens
Throughput implicationBaseline systems in comparisonHigher when recovered HBM admits useful work

In the vLLM paper, waste stays under 4% because only the last partially filled block of each sequence can be underutilized, plus a small amount of block-table metadata.[1] On top of that memory-efficiency gain, the paper reports 2-4× higher throughput at comparable latency versus systems such as FasterTransformer and Orca, with bigger gains on longer sequences and more complex decoding workloads.[1]

Block size is a real trade-off. Smaller blocks reduce tail waste and improve sharing granularity, but they increase block-table lookups and can reduce GPU kernel efficiency. Larger blocks improve memory access regularity, but they waste more space in the final block. In vLLM's ablation, block sizes from 16 to 128 worked best on ShareGPT-like traces, shorter Alpaca-like traces favored 16 or 32, and the default landed at 16.[1]

bound-paged-tail-waste.py
1from math import ceil 2 3def allocated_tokens(live_tokens: int, block_size: int) -> int: 4 return ceil(live_tokens / block_size) * block_size 5 6lengths = [505, 1000, 4096] 7block_size = 16 8for length in lengths: 9 allocated = allocated_tokens(length, block_size) 10 print(f"{length} live tokens -> {allocated} slots, {allocated - length} unused") 11 12assert max(allocated_tokens(length, block_size) - length for length in lengths) < block_size
Output
1505 live tokens -> 512 slots, 7 unused 21000 live tokens -> 1008 slots, 8 unused 34096 live tokens -> 4096 slots, 0 unused

Implementation sketch

The following Python code simulates how a PagedAttention manager allocates and tracks blocks for requests. This logic is critical for understanding how variable-length requests are mapped to fixed-size memory pools. In a real system (vLLM), this manager runs on the CPU, while custom CUDA kernels handle the actual attention computation on the GPU.

The PagedKVCacheManager takes the block size and total block count as initialization parameters. When requests arrive, it takes a request ID and assigns logical blocks, returning the physical block index and slot offset where the new token's Key and Value tensors should be written as tokens are generated.

implementation-sketch.py
1class PagedKVCacheManager: 2 """ 3 Manages the mapping between logical pages and physical GPU blocks. 4 Doesn't store the actual tensors (that happens in GPU memory). 5 """ 6 7 def __init__(self, num_blocks: int, block_size: int): 8 if num_blocks <= 0 or block_size <= 0: 9 raise ValueError("num_blocks and block_size must be positive") 10 self.block_size = block_size 11 # Stack of free physical block indices 12 self.free_blocks: list[int] = list(range(num_blocks)) 13 14 # Maps request_id -> list of physical block indices 15 # Example: { "req_1": [7, 2, 15] } 16 self.block_tables: dict[str, list[int]] = {} 17 18 # Maps request_id -> number of tokens generated so far 19 self.seq_lens: dict[str, int] = {} 20 21 def allocate(self, request_id: str) -> None: 22 """Initialize a new request with one empty block.""" 23 if request_id in self.block_tables: 24 raise ValueError(f"request already allocated: {request_id}") 25 if not self.free_blocks: 26 raise MemoryError("GPU Out of Memory") 27 28 first_block = self.free_blocks.pop() 29 self.block_tables[request_id] = [first_block] 30 self.seq_lens[request_id] = 0 31 32 def append_slot(self, request_id: str) -> tuple[int, int]: 33 """ 34 Signals that a new token is being generated. 35 Allocates a new block if the current last block is full. 36 Returns (physical_block_index, slot_offset_within_block). 37 """ 38 if request_id not in self.block_tables: 39 raise KeyError(f"unknown request: {request_id}") 40 current_blocks = self.block_tables[request_id] 41 42 token_idx = self.seq_lens[request_id] 43 44 # Allocate a fresh block when the previous one is full. 45 if token_idx > 0 and token_idx % self.block_size == 0: 46 if not self.free_blocks: 47 raise MemoryError("GPU Out of Memory") 48 new_block = self.free_blocks.pop() 49 current_blocks.append(new_block) 50 51 block_id = current_blocks[-1] 52 slot_offset = token_idx % self.block_size 53 self.seq_lens[request_id] = token_idx + 1 54 return block_id, slot_offset 55 56 def free(self, request_id: str) -> None: 57 """Release all blocks associated with a request back to the pool.""" 58 if request_id in self.block_tables: 59 blocks = self.block_tables.pop(request_id) 60 self.free_blocks.extend(blocks) 61 del self.seq_lens[request_id] 62 63manager = PagedKVCacheManager(num_blocks=20, block_size=4) 64manager.allocate("A") 65for _ in range(5): 66 print("A", manager.append_slot("A")) 67 68manager.allocate("B") 69print("B", manager.append_slot("B")) 70 71manager.free("A") 72print("free blocks:", len(manager.free_blocks))
Output
1A (19, 0) 2A (19, 1) 3A (19, 2) 4A (19, 3) 5A (18, 0) 6B (17, 0) 7free blocks: 19

Notice how allocate only grabs a single block to start. As append_slot is called during the autoregressive generation loop, new blocks are pulled from free_blocks strictly on-demand. When the request completes, free returns all blocks to the shared pool. In this fixed-size toy pool, no variable-sized gap is stranded; a production runtime must also account for tensors and allocator state outside this KV block pool.

Tracing a small example

Let's walk through what happens when two requests arrive in a manager with block_size = 4 and num_blocks = 20.

Step 1: Request A arrives. allocate("A") pops physical block 19 from free_blocks. The block table becomes {"A": [19]}. The sequence length is 0.

Step 2: Request A generates four tokens. append_slot("A") is called four times. The first three tokens fit into block 19 at offsets 0, 1, and 2. On the fourth call, token_idx = 3, which is still within block 19 (offset 3). After the fourth token, seq_lens["A"] = 4.

Step 3: Request A generates a fifth token. Now token_idx = 4 and 4 % 4 == 0, so the current block is full. append_slot("A") pops a new block (18, the next index off the free stack) from free_blocks, appends it to the block table, and writes the fifth token at offset 0 inside block 18. The block table is now {"A": [19, 18]}, which matches the A (18, 0) line in the output above.

Step 4: Request B arrives. allocate("B") pops block 17. Block table: {"A": [19, 18], "B": [17]}.

Step 5: Request A finishes. free("A") returns blocks 19 and 18 to free_blocks. They are immediately available for the next incoming request, without leaving a variable-sized gap in this fixed-block pool.

This trace shows the core dynamic: blocks are grabbed one at a time, only when needed, and returned instantly when the request completes.

Real engines add two important pieces we omitted for clarity: a reference count per physical block (for shared prefixes) and copy-on-write when a sequence needs to write into a block that's currently shared.

Thinking PagedAttention speeds up attention computation is a common mistake. It doesn't. It speeds up memory allocation and packing, which lets you fit more requests in the same GPU memory. The attention math itself stays the same; what changes is how the K and V tensors are laid out in HBM so the GPU can find them without wasting space.

resolve-logical-block-reads.py
1block_size = 4 2block_table = [19, 18, 5] 3 4def resolve(token_position: int) -> tuple[int, int]: 5 logical_block, slot = divmod(token_position, block_size) 6 return block_table[logical_block], slot 7 8for position in [0, 3, 4, 8, 10]: 9 print(f"token {position}: physical block, slot = {resolve(position)}")
Output
1token 0: physical block, slot = (19, 0) 2token 3: physical block, slot = (19, 3) 3token 4: physical block, slot = (18, 0) 4token 8: physical block, slot = (5, 0) 5token 10: physical block, slot = (5, 2)

vLLM architecture

vLLM combines PagedAttention with a sophisticated scheduler that manages the lifecycle of requests. Because memory is now a shared resource managed much like an operating system's virtual memory, the scheduler must gracefully handle preemption when the GPU's memory capacity is fully exhausted.

vLLM serving architecture showing requests flowing through scheduler, LLM engine, PagedAttention KV manager, GPU block pool, and preemption recovery by recompute or CPU swap. vLLM serving architecture showing requests flowing through scheduler, LLM engine, PagedAttention KV manager, GPU block pool, and preemption recovery by recompute or CPU swap.
PagedAttention makes memory block-based, but scheduler still owns hard production decision: admit, preempt, recompute, or swap.

Preemption: recompute vs. swap

In a highly loaded multi-tenant system, the GPU block pool can still become exhausted. When that happens, the scheduler has to preempt some work and recover later. Two common strategies are:[1]

  1. Recomputing: Discard the preempted request's KV blocks and rebuild them later by rerunning the prompt and prior decode steps. This spends extra FLOPs, but avoids Peripheral Component Interconnect Express (PCIe) transfer overhead.
  2. Swapping: Move the KV blocks from GPU HBM to CPU RAM, then swap them back in later. This preserves previous work, but consumes PCIe bandwidth.

Once GPU space opens up, the request resumes either from swapped-in blocks or from a newly rebuilt cache. Recomputing trades extra FLOPs for simpler recovery, while swapping trades FLOPs for CPU-GPU traffic. Either way, frequent preemption hurts latency, so production systems still try to size batches to avoid it.[1]

Suppose a two-rank deployment serves an 80-layer model with 8 KV heads, head dimension 128, FP16 cache, and an 8K context window. From the table earlier, that cache adds 320 KiB per token, or 2.5 GiB per full request across the model. If the KV heads are evenly sharded over the two ranks, each rank stores 1.25 GiB of that request. Applying the paper's under-4% tail-waste result as an estimate gives about 1.30 GiB per rank.[1]

Do not derive an admission limit from advertised HBM minus model weights alone. Profile how much per-rank memory remains after weights, kernels, communication buffers, graphs, and safety headroom. If that measured KV budget is 8 GiB per rank, only six fully occupied 8K requests fit under the simplified estimate below. PagedAttention does not change this full-context case much; it changes variable-length traffic by allocating blocks for live state rather than reserving every maximum-length tail.

admit-full-context-requests.py
1from math import floor 2 3bytes_per_token = 2 * 80 * 8 * 128 * 2 4context_tokens = 8192 5tp_ranks = 2 6tail_overhead = 1.04 7kv_budget_per_rank_gib = 8.0 8 9per_request_per_rank_gib = bytes_per_token * context_tokens / tp_ranks / (1024 ** 3) 10estimated_with_tail = per_request_per_rank_gib * tail_overhead 11capacity = floor(kv_budget_per_rank_gib / estimated_with_tail) 12 13print(f"per request per rank: {estimated_with_tail:.2f} GiB") 14print(f"admit at most {capacity} full 8K requests under this KV budget")
Output
1per request per rank: 1.30 GiB 2admit at most 6 full 8K requests under this KV budget


Continuous batching (iteration-level batching)

PagedAttention pairs especially well with continuous batching (also called iteration-level scheduling).

Static batching limitations

In traditional static batching, the system waits for NNN requests to arrive, bundles them into a batch, and processes them together until all requests in the batch are completed. Because sequences have variable lengths, the shorter requests finish early, leaving GPU compute units idle while waiting for the longest request to complete.

The continuous batching approach

Continuous batching, made explicit in systems like Orca, reevaluates the active batch at every decode iteration.[5] As soon as one request emits an End-of-Sequence (EOS) token and finishes, the scheduler ejects it, frees its KV cache blocks, and admits another waiting request into the running batch.

By operating at the iteration level rather than the request level, the engine avoids leaving slots idle just because one request finished earlier than the others. Continuous batching and PagedAttention solve different bottlenecks, but they work especially well together: one keeps compute busy, and the other makes memory reuse immediate.

StrategyIdle Time on skewed generationsWhen are requests added?Throughput posture
Static BatchingCan be high while waiting for longest sequenceOnly when the entire previous batch completesBaseline for comparison
Continuous BatchingCan be lowerAt iteration boundaries after requests finishCan improve on skewed workloads

Continuous batching transforms the batch from a static grouping into a sliding window of active computations. Orca reported up to 36.9× higher throughput than FasterTransformer at the same latency target on GPT-3 175B, because late arrivals can join ongoing work and early finishers can leave immediately.[5] PagedAttention makes that scheduling policy much easier to sustain under tight memory budgets.

recycle-blocks-at-iteration-boundaries.py
1from math import ceil 2 3block_size = 16 4live_tokens = {"A": 48, "B": 5} 5 6def blocks(tokens: int) -> int: 7 return ceil(tokens / block_size) 8 9before = sum(blocks(tokens) for tokens in live_tokens.values()) 10del live_tokens["A"] # A emits EOS; scheduler retires it before next iteration. 11live_tokens["C"] = 30 12after = sum(blocks(tokens) for tokens in live_tokens.values()) 13 14print(f"blocks before retire/admit: {before}") 15print(f"blocks after retire A and admit C: {after}") 16assert after <= before
Output
1blocks before retire/admit: 4 2blocks after retire A and admit C: 3

How serving engines build on PagedAttention

Copy-on-write for branches, prefix caching across requests

Block tables make shared physical blocks possible, but they do not by themselves decide which independently arriving requests may reuse state. In the original vLLM design, copy-on-write lets output sequences forked from the same prompt, such as parallel sampling or beam-search candidates, share their prompt blocks until a branch writes divergent state.[1]

Reusing a system prompt across separate user requests requires an additional prefix cache policy: retain completed prefix blocks, match an exactly compatible token prefix, apply isolation keys where needed, and only then map new requests to the cached state. vLLM documents this cross-request feature as Automatic Prefix Caching.[6] The next lesson covers those matching and isolation rules in depth.

Suppose 100 requests are allowed to reuse the same 1000-token system-instruction prefix:

  • Without prefix caching: Store 100 copies of the reusable prefix blocks.
  • With compatible prefix caching: Store one physical prefix copy and map permitted requests to those read-only blocks until their unique continuation begins.

If all 100 requests qualify, that arrangement saves roughly 99% of prefix-block storage in this example. It does not save unique user input or generated-token cache, and it must not cross tenant or adapter boundaries without an isolation policy.

estimate-prefix-block-reuse.py
1from math import ceil 2 3requests = 100 4prefix_tokens = 1000 5block_size = 16 6prefix_blocks = ceil(prefix_tokens / block_size) 7 8without_cache = requests * prefix_blocks 9with_compatible_cache = prefix_blocks 10saved_fraction = 1 - with_compatible_cache / without_cache 11 12print(f"prefix blocks per copy: {prefix_blocks}") 13print(f"saved prefix-block storage: {saved_fraction:.1%}") 14assert saved_fraction == 0.99
Output
1prefix blocks per copy: 63 2saved prefix-block storage: 99.0%

RadixAttention (SGLang)

Newer engines like SGLang extend this with RadixAttention.[7] Instead of discarding KV state after each generation call, the runtime keeps a least-recently used (LRU) cache of KV entries for all requests inside a radix tree.

  • Hit: Reuse existing KV blocks immediately (no recomputation needed).
  • Miss: Compute only the new tokens.

That makes it particularly effective for multi-turn chat and agentic workloads where long prompt prefixes repeat across calls.[7]

FlashInfer and kernel optimization

FlashInfer is a kernel library for serving workloads that accelerates attention over paged KV caches and variable-length sequences.[8] It doesn't replace PagedAttention at the memory-manager level. Instead, it speeds up the GPU kernels that consume those layouts during decode.

PagedAttention vs. FlashAttention

These names are easy to mix up, but they solve different problems.[1][9]

  • PagedAttention: A KV-cache layout and allocation strategy. It places cached K/V tensors in blocks and supports block sharing; a prefix-cache policy decides cross-request reuse.
  • FlashAttention: An IO-aware attention kernel. It tiles attention so the GPU does less wasteful movement between HBM and on-chip memory, especially during full attention computation.[9]

Modern serving stacks often use both or close variants together.[8] PagedAttention makes the cache fit and stay shareable. FlashAttention-style kernels make attention itself faster once those tensors are in place.


Reducing KV cache at the architecture level

PagedAttention optimizes how memory is managed by changing the allocation strategy. However, it doesn't change the actual mathematical size of the tensors. Model architecture changes optimize how much memory is needed in the first place.

Grouped-Query Attention (GQA) is a structural technique that can sharply shrink baseline cache size. Instead of maintaining a separate Key and Value head for every single Query head (as in traditional Multi-Head Attention), GQA groups multiple Query heads together to share a Key and Value head. Ainslie et al. found intermediate group counts recovered much of MHA quality while retaining MQA-like inference benefits after uptraining; other models still require their own quality evaluation.[3]

By sharing KV heads, the memory footprint required for each token drops proportionally to the group size. For instance, in a 70-billion parameter model, moving from 64 distinct KV heads down to 8 (a group size of 8) reduces the per-token KV cache memory by exactly 8 times.

Attention TypeExample KV HeadsKV Cache/tokenEvaluation posture
Multi-Head (MHA)642,560 KiBReference cache size
Grouped-Query (GQA)[3]8320 KiB8x smaller cache; measure task quality
Multi-Query (MQA)[4]140 KiB64x smaller cache; measure task quality

GQA reduces the size of the problem, while PagedAttention solves the allocation of the problem. They are complementary when the selected model architecture and serving runtime support both.


Practical considerations

Confusing the KV cache with the model weights is another common mistake. The weights are static parameters loaded once at startup and shared across every request. The KV cache is per-request state that grows as the conversation proceeds. If you size a GPU for weights alone and forget the cache, you'll hit out-of-memory errors the moment you try to serve more than a handful of long conversations.

1. The prefill vs. decode bottleneck

LLM inference has two distinct phases:[1][10]

  • Prefill: Processing the input prompt. For sufficiently large prompt batches, dense matrix multiplications often make this phase compute-heavy and highly parallelizable.
  • Decode: Generating tokens one by one. At common serving batch sizes, repeatedly loading weights and growing KV state often makes this phase memory-bandwidth-sensitive.

PagedAttention primarily optimizes the Decode phase, where the system rereads a growing KV cache on every token. In practice, many engines still colocate prefills and decodes on the same GPU via continuous batching or chunked prefill, but that creates interference: a large prompt prefill can stall ongoing decode streams for everyone else. That's why newer serving stacks also use chunked prefill and, at larger scale, full prefill/decode disaggregation to keep time-to-first-token (TTFT) and inter-token latency smooth.[5][11][10]

separate-prefill-and-decode-slos.py
1traces = [ 2 {"prompt_tokens": 8000, "ttft_ms": 780, "inter_token_ms": 22}, 3 {"prompt_tokens": 200, "ttft_ms": 48, "inter_token_ms": 23}, 4 {"prompt_tokens": 220, "ttft_ms": 51, "inter_token_ms": 61}, 5] 6 7for trace in traces: 8 if trace["ttft_ms"] > 500: 9 diagnosis = "inspect prefill admission or chunking" 10 elif trace["inter_token_ms"] > 50: 11 diagnosis = "inspect decode pressure and KV budget" 12 else: 13 diagnosis = "within sample thresholds" 14 print(f"{trace['prompt_tokens']} prompt tokens: {diagnosis}")
Output
18000 prompt tokens: inspect prefill admission or chunking 2200 prompt tokens: within sample thresholds 3220 prompt tokens: inspect decode pressure and KV budget

2. Context window limits

Even with PagedAttention, the KV cache still grows linearly with context length. The 80-layer, 8-KV-head, 128-dim GQA example from earlier needs 40 GiB of KV cache at 128K context in FP16. At that scale, KV memory can rival or exceed the room left after loading weights, so long-context serving may require distribution, KV-cache quantization, explicit compression, or smaller admitted contexts.

  • Compression / eviction: Research systems such as SnapKV[12] and H2O[13] reduce the cache by keeping only the tokens that seem most useful for future attention.
  • Quantization: Storing KV cache in FP8 instead of FP16 halves the bytes per cached value, which roughly doubles KV-cache-limited capacity when the cache is the bottleneck.

3. Distributed inference

In Tensor Parallelism (TP), a runtime may shard KV heads across ranks when the model layout permits it. If the number or grouping of KV heads does not divide cleanly across ranks, a runtime may replicate some KV state instead. Capacity planning must use the actual deployed layout, not assume ideal division.

Paged allocation still happens per rank, but logical block growth has to stay aligned across ranks so token positions 0-15, 16-31, and so on map to the correct local KV blocks on every worker.

The physical block IDs don't have to match across GPUs. What matters is consistent logical bookkeeping: each rank must know which local block corresponds to each logical block for its shard of the cache.

Implementations usually synchronize allocation decisions at the scheduler level and maintain per-rank block tables. That lets each GPU run its local attention kernels independently without renegotiating memory layout during every decode step.

avoid-assuming-ideal-kv-sharding.py
1def kv_heads_per_rank(kv_heads: int, tp_ranks: int) -> tuple[int, str]: 2 if kv_heads % tp_ranks == 0: 3 return kv_heads // tp_ranks, "evenly sharded" 4 return kv_heads, "conservative replicated-state budget" 5 6for kv_heads, ranks in [(8, 2), (8, 3), (1, 4)]: 7 local_heads, posture = kv_heads_per_rank(kv_heads, ranks) 8 print(f"{kv_heads} KV heads on TP={ranks}: budget {local_heads} local heads ({posture})")
Output
18 KV heads on TP=2: budget 4 local heads (evenly sharded) 28 KV heads on TP=3: budget 8 local heads (conservative replicated-state budget) 31 KV heads on TP=4: budget 1 local heads (conservative replicated-state budget)

Mastery check

Key concepts

  • KV cache: Stores old Keys and Values so decode does not recompute them on every step.
  • Fragmentation: Lost usable HBM from empty tails, scattered gaps, or worst-case reservations.
  • PagedAttention: Splits cache into fixed-size logical and physical blocks connected by a block table.
  • Continuous batching: Rebuilds active batch each decode step so finished requests free slots immediately.
  • Copy-on-write: Lets forked decoding branches share existing blocks until one branch needs a private write; independent request reuse requires prefix-cache matching.

Evaluation rubric

  • Foundational: Explains why the KV cache speeds decode but still grows with sequence length and active requests.
  • Foundational: Calculates bytes per token from layers, KV heads, head dimension, and precision.
  • Intermediate: Distinguishes internal fragmentation, scattered free gaps, and worst-case reservation waste in a naive contiguous allocator.
  • Intermediate: Explains how PagedAttention uses logical blocks, physical blocks, and a block table to allocate only live cache state.
  • Advanced: Connects PagedAttention to continuous batching, branch copy-on-write, optional cross-request prefix caching, and scheduler-level preemption decisions.
  • Advanced: Separates allocator optimizations from architecture optimizations such as GQA, MQA, and KV-cache quantization.
  • Advanced: Explains why prefill and decode remain different bottlenecks even after PagedAttention.

Follow-up questions

Common pitfalls

KV cache is treated like model weights

Symptom: Capacity planning looks fine on paper, then long chats trigger OOM as soon as concurrency rises. Cause: Static weights were counted, but dynamic per-request KV growth was ignored. Fix: Budget GPU memory for weights and for worst realistic live KV state at your target concurrency.

Fragmentation is blamed on batching alone

Symptom: Throughput stalls even after scheduler tuning. Cause: The batch policy improved, but the allocator still strands large empty tails and gaps. Fix: Evaluate scheduler behavior and allocator behavior separately. PagedAttention helps because it changes memory packing, not because it changes the attention math.

Throughput numbers are quoted without workload shape

Symptom: Benchmarks look amazing, but production gains disappoint. Cause: Reported speedups depended on sequence-length skew, latency target, or decoding workload that did not match your route mix. Fix: Compare serving systems under the same latency target, prompt lengths, output lengths, and concurrency pattern.

Prefill and decode are collapsed into one bottleneck

Symptom: Teams optimize decode memory layout, but TTFT still swings wildly. Cause: Decode is often memory-bandwidth-sensitive, but prefill is a separate compute-heavy phase that can still interfere with live traffic. Fix: Track prefill and decode separately, then decide whether you need chunked prefill, disaggregation, or cache compression.

Swap is assumed to be free recovery

Symptom: Scheduler falls back to CPU swap and latency explodes. Cause: Swapping preserved work, but PCIe transfer and restore time were ignored. Fix: Treat swap as a tradeoff, not a free safety valve. Compare it against recompute under your actual prompt lengths and hardware path.

PagedAttention is expected to shrink bytes per token

Symptom: A team adopts vLLM but still cannot fit the desired context length. Cause: Allocation waste dropped, but the mathematical KV footprint per token stayed the same. Fix: Use GQA, MQA, KV quantization, or cache compression when the per-token cache itself is too large.


Key takeaways

  1. Budget live state: In memory-limited serving configurations, KV cache capacity can limit batch size before raw compute does.
  2. Fragmentation Kills Performance: Naive contiguous allocation can waste roughly 60-80% of KV memory on real traces.[1]
  3. PagedAttention Improves Allocation: Fixed-size non-contiguous blocks avoid variable-sized KV gaps and bound per-sequence tail waste within the block pool.
  4. Sharing needs policy: Copy-on-write supports forked branches; reuse across independent requests additionally requires compatible prefix matching and isolation.
  5. Architecture Still Matters: Techniques like GQA reduce the size of the KV cache, while PagedAttention makes that smaller cache pack efficiently.

You now understand why memory capacity can cap serving throughput, and how PagedAttention turns fragmented KV allocation into a compact block pool. The next step is to reuse cache work across compatible requests. In Prefix Caching and Prompt Caching, you will see how repeated policy text, tool schemas, and shared instructions can avoid repeated prefill work without bypassing isolation rules.

Next Step
Continue to Prefix Caching and Prompt Caching

You now understand how PagedAttention packs live KV state. The next article covers the matching and isolation rules needed to reuse computed prefixes across compatible requests.

PreviousMulti-Query & Grouped-Query Attention
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Efficient Memory Management for Large Language Model Serving with PagedAttention.

Kwon, W., et al. · 2023 · SOSP 2023

Attention Is All You Need.

Vaswani, A., et al. · 2017

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.

Ainslie, J., et al. · 2023 · EMNLP 2023

Fast Transformer Decoding: One Write-Head is All You Need.

Shazeer, N. · 2019 · arXiv preprint

Orca: A Distributed Serving System for Transformer-Based Generative Models.

Yu, G.-I., et al. · 2022 · OSDI 2022

Automatic Prefix Caching

vLLM · 2026

SGLang: Efficient Execution of Structured Language Model Programs

Zheng, L., Yin, L., Xie, Z., et al. · 2024 · NeurIPS 2024

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving.

Ye, Z., et al. · 2025

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. · 2022 · NeurIPS 2022

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving.

Zhong, Y., et al. · 2024 · OSDI 2024

Sarathi-Serve: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.

Agrawal, A., et al. · 2023 · arXiv preprint

SnapKV: Compressing KV Cache by Selecting Global Attention Patterns.

Li, Y., et al. · 2024

H2O: Heavy-Hitter Oracle for KV Cache Eviction in LLMs.

Zhang, Z., et al. · 2023