Understand the two-phase inference process (prefill vs decode), derive the KV cache memory formula, and learn production optimizations like chunked prefill and disaggregation.
When you send a prompt to ChatGPT and see the response appear word by word, you're watching inference: the process of a trained model generating output. But why does it stream one word at a time? Why is the first word slow and the rest faster? And why does a longer conversation make everything slower? Understanding these mechanics (the two-phase process, the memory bottleneck, and the key performance metrics) is essential for anyone building or optimizing production LLM systems.
๐ก Key insight: LLM inference has two distinct phases (prefill and decode) with fundamentally different hardware bottlenecks. Knowing which phase is the bottleneck, and why, is the foundation for every optimization in this space, from KV cache management to continuous batching.
Every LLM request goes through two distinct computational phases with fundamentally different hardware bottlenecks:
๐ฝ๏ธ Analogy, restaurant kitchen: Prefill is like a chef reading the entire recipe at once, gathering all ingredients, and prepping the mise en place. It's intense upfront work but highly parallelizable (multiple sous chefs can chop simultaneously). Decode is like plating one dish at a time in a specific order: each plate (token) must wait for the previous one, and the bottleneck is how fast you can carry ingredients from the fridge (memory bandwidth), not how fast you can cook (compute).
The model processes your entire input prompt in parallel. Every token is attended to simultaneously in a single forward pass. This is compute-bound, limited by GPU FLOPS, not memory bandwidth. For example, when starting generation, the prefill phase processes the input text to produce the first token:
text1Input: "Explain quantum computing in simple terms" 2โ 6 tokens processed simultaneously 3โ Produces KV cache entries for all 6 tokens 4โ Produces logits (unnormalized probability scores) for the FIRST output token
The time from request to the first output token is called TTFT (Time to First Token). For a user staring at a chat interface, this is the most noticeable latency.
| Use Case | TTFT Target | Why |
|---|---|---|
| Real-time voice | < 150ms | Conversational flow |
| Code completion | < 200ms | Developer productivity |
| Chat/conversational | < 500ms | User patience |
| Batch processing | < 2s | Background job |
After producing the first token, the model generates subsequent tokens one at a time, autoregressively. Each new token requires a forward pass through the entire model, but only processes that single new token (reusing cached K/V from all previous tokens). This step-by-step process builds the response iteratively:
text1Step 1: "Quantum" โ add to KV cache โ forward pass โ "computing" 2Step 2: "computing" โ add to KV cache โ forward pass โ "is" 3Step 3: "is" โ add to KV cache โ forward pass โ "like" 4...
This phase is memory-bandwidth bound, not compute-bound. The bottleneck is reading the model weights and KV cache from GPU HBM (High Bandwidth Memory) for each token. The matrix multiplications are small (batch=1), so the GPU's arithmetic units are mostly idle, waiting for data to arrive.
The key difference is arithmetic intensity (FLOPs per byte loaded from memory):
| Phase | Tokens Processed | Matrix Size | Arithmetic Intensity | Bottleneck |
|---|---|---|---|---|
| Prefill | (prompt) | Large batch matmul | High (many FLOPs/byte) | Compute (TFLOPS) |
| Decode | 1 at a time | Thin matmul (batch=1) | Low (few FLOPs/byte) | Memory bandwidth (TB/s) |
On an H100 GPU: ~990 TFLOPS (Tera Floating Point Operations Per Second) compute, ~3.35 TB/s HBM bandwidth. During decode, the GPU is reading ~140GB of model weights to process a single token. The computation itself takes microseconds, but loading the weights takes milliseconds. This memory-bound nature is what motivates architectural changes like FlashAttention[1] to minimize HBM access.
During attention, each layer computes Key and Value projections for every token. Without caching, generating token would require recomputing attention over all previous tokens from scratch, which is quadratic in sequence length.
๐ Analogy, exam reference sheet: The KV cache is like a reference sheet you build during an exam. For each question (token) you've already answered, you write down the key facts (K) and your reasoning (V) on the sheet. When the next question references a previous one, you glance at your reference sheet instead of re-deriving everything from scratch. The sheet grows with each question, and its size determines how many questions you can handle before running out of paper (GPU memory).
The KV cache stores these K and V tensors. As the sequence grows, the KV cache accumulates data for each token to avoid redundant calculations:
text1Token 1: Compute Kโ, Vโ โ Store in cache 2Token 2: Compute Kโ, Vโ โ Store; Attend to [Kโ,Kโ], [Vโ,Vโ] 3Token 3: Compute Kโ, Vโ โ Store; Attend to [Kโ,Kโ,Kโ], [Vโ,Vโ,Vโ] 4...
For a single sequence:
Reading the formula: for every layer (), every KV head (), every position in the sequence (), we store a Key vector and a Value vector (the "2") of dimension , each taking bytes. Multiply it all together and this cache can easily reach gigabytes for long sequences.
Where:
| Parameter | Value |
|---|---|
| Layers () | 80 |
| KV heads () | 8 (GQA, not 64 query heads!) |
| Head dim () | 128 |
| Sequence length () | 4,096 |
| Dtype | FP16 (2 bytes) |
Reading the formula: is for K and V, is the number of layers, is KV heads, is head dimension, is sequence length, and the final is FP16 bytes per value.
With GQA (8 KV heads instead of 64), this is 8ร smaller than it would be with MHA. Without GQA, the same model would need 10.7 GB per sequence.
๐ฏ Production tip: Serving a 70B model to 100 concurrent users requires precise KV cache budgeting: Model weights (140 GB FP16) + KV cache (1.34 GB ร 100 users = 134 GB) = 274 GB total. Feasible on 4ร H100-80GB with tensor parallelism (splitting the model across multiple GPUs).
In production environments, context length is not a static limit determined solely by the model's architecture. Instead, it is a dynamic memory budget that dictates how many concurrent users your system can support. Every additional token of context required by one user reduces the available VRAM for everyone else.
To serve models at scale, inference engines must strictly enforce these budgets. When a request comes in, the system checks the available GPU memory. If the required KV cache for the new request (plus existing ones) exceeds the remaining capacity, the request must wait in a queue. This is why managing the TTFT-TPS tradeoff is so important for overall throughput.
You can dynamically calculate the maximum affordable context length by subtracting the model weights' footprint from total GPU memory, then dividing the remainder by the number of concurrent users and the per-token KV cache size:
python1def max_context_for_budget( 2 gpu_memory_gb: float, 3 model_memory_gb: float, 4 num_layers: int, 5 num_kv_heads: int, 6 head_dim: int, 7 dtype_bytes: int = 2, # FP16 8 num_concurrent: int = 1, 9) -> int: 10 """Calculate maximum context length given memory constraints.""" 11 available_memory = (gpu_memory_gb - model_memory_gb) * 1e9 12 13 # Memory per token in KV cache 14 bytes_per_token = 2 * num_layers * num_kv_heads * head_dim * dtype_bytes 15 16 # Divide by concurrent users 17 budget_per_user = available_memory / num_concurrent 18 19 return int(budget_per_user / bytes_per_token) 20 21# Example: Llama 3.1 70B on 4รH100-80GB (320GB total) 22max_tokens = max_context_for_budget( 23 gpu_memory_gb=320, 24 model_memory_gb=140, # FP16 weights 25 num_layers=80, 26 num_kv_heads=8, # GQA: 8 KV heads (not 64!) 27 head_dim=128, 28 num_concurrent=50, 29) 30# Result: ~11,000 tokens per user with 50 concurrent users
This calculation drives critical deployment decisions. If you need to support 100 concurrent users but only have the budget for 5,000 tokens each, you might need to add another GPU node, reduce the model precision using quantization, or implement stricter context window limits at the application layer.
Large prefills block decode operations, causing TTFT-TPS tradeoff: prioritizing new prefills stalls existing decode streams.
Chunked prefill splits long prompts into smaller chunks, interleaving them with decode steps:
๐ญ Analogy, factory assembly line: Without chunked prefill, it's like shutting down the entire factory assembly line to set up for a new product. All existing products stop moving while you retool. With chunked prefill, you retool one station at a time while the rest of the line keeps running. Existing requests keep flowing (decode continues) while the new request is gradually set up (prefilled in chunks).
The timeline below illustrates how chunked prefill avoids stalling decodes:
text1Without chunked prefill: 2 [Prefill 10K tokens ===========================] [Decode...Decode...Decode...] 3 โ All decode requests stall during this prefill 4 5With chunked prefill (chunk=2048): 6 [Prefill chunk1][Decode][Prefill chunk2][Decode][Prefill chunk3][Decode]... 7 โ Decode requests continue between chunks
Benefits: Better GPU utilization (mixing compute-bound prefill with memory-bound decode), smoother token streaming, reduced tail latency. Enabled by default in vLLM V1 and discussed in the context of efficient memory management systems like PagedAttention[4].
Modern systems (Splitwise[5], DistServe[6], Mooncake[7]) separate prefill and decode onto different GPU pools:
Why this works:
Store KV cache in INT8 or even INT4 instead of FP16 to halve/quarter memory with minimal quality loss (see our model quantization deep-dive for the techniques behind weight and activation quantization):
Reading the formula: INT8 uses 1 byte per value instead of 2 bytes (FP16), so KV memory is cut in half for the same sequence length and concurrency.
This doubles the number of concurrent users you can serve on the same hardware.
โ "LLMs generate tokens one at a time" Only true for the decode phase. Prefill processes the entire prompt in parallel.
โ "More GPUs = faster generation" For a single request, model parallelism adds communication latency. Faster generation requires higher memory bandwidth, not more compute.
โ "Context length is just a model limitation" In production, context length is a memory budget decision. Shorter contexts = more concurrent users.
โ "TTFT and TPS improve together" They're often in tension. Optimizing TTFT (prioritizing prefills) hurts TPS (stalling decodes), and vice versa.
Orca: A Distributed Serving System for Transformer-Based Generative Models.
Yu, G.-I., et al. ยท 2022 ยท OSDI 2022
Splitwise: Efficient Generative LLM Inference Using Phase Splitting.
Patel, P., et al. ยท 2023
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving.
Zhong, Y., et al. ยท 2024 ยท OSDI 2024
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving.
Qin, Y., et al. ยท 2024
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Rรฉ, C. ยท 2022 ยท NeurIPS 2022
Efficient Memory Management for Large Language Model Serving with PagedAttention.
Kwon, W., et al. ยท 2023 ยท SOSP 2023
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.
Ainslie, J., et al. ยท 2023 ยท EMNLP 2023