Understand the two-phase inference process (prefill vs decode), derive the KV cache memory formula, and learn production optimizations like chunked prefill and disaggregation.
When you send a message to ChatGPT, you notice something peculiar about how the response appears. There's often a brief pause, then the first word appears, followed by the rest streaming out word by word. Why that initial pause? Why does it stream one word at a time instead of appearing instantly? And why do longer conversations eventually feel slower?
This behavior isn't a quirk of the interface. It's the physics of LLM inference: the process of running a trained model to generate text. Understanding these mechanics (two-phase nature of generation, memory bottlenecks, and how engineers measure performance) is essential for anyone building or optimizing production AI systems.
Key insight: LLM inference has two distinct phases (prefill and decode) with fundamentally different hardware bottlenecks. Knowing which phase is the bottleneck, and why, is the foundation for every optimization in this space, from Key-Value (KV) cache management to continuous batching.
Every LLM request goes through two distinct computational phases with fundamentally different hardware bottlenecks:
Analogy (restaurant kitchen): Prefill is like a chef reading the entire recipe at once, gathering all ingredients, and prepping the mise en place. It's intense upfront work but highly parallelizable (multiple sous chefs can chop simultaneously). Decode is like plating one dish at a time in a specific order: each plate (token) must wait for the previous one, and the bottleneck is how fast you can carry ingredients from the fridge (memory bandwidth), not how fast you can cook (compute).
The diagram below illustrates the sequential dependency between the highly parallel prefill phase and the autoregressive decode phase:
The model processes your entire input prompt in parallel. Every token is attended to simultaneously in a single forward pass. This is compute-bound, limited by GPU Floating Point Operations Per Second (FLOPS), not memory bandwidth. For example, when starting generation, the prefill phase processes the input text to produce the first token. The following trace shows how a short input sentence is transformed into initial states and the first prediction:
text1Input: "Explain quantum computing in simple terms" 2→ 6 tokens processed simultaneously 3→ Produces KV cache entries for all 6 tokens 4→ Produces logits (unnormalized probability scores) for the FIRST output token
The time from request to the first output token is called TTFT (Time to First Token). For a user staring at a chat interface, this is the most noticeable latency.
| Use Case | TTFT Target | Why |
|---|---|---|
| Real-time voice | < 150ms | Conversational flow |
| Code completion | < 200ms | Developer productivity |
| Chat/conversational | < 500ms | User patience |
| Batch processing | < 2s | Background job |
After producing the first token, the model generates subsequent tokens one at a time, autoregressively. Each new token requires a forward pass through the entire model, but only processes that single new token (reusing cached K/V from all previous tokens). This step-by-step process builds the response iteratively by taking the previously generated token, adding it to the ongoing cache, and predicting the next word. The sequence below demonstrates how the model constructs a phrase one token at a time:
text1Step 1: "Quantum" → add to KV cache → forward pass → "computing" 2Step 2: "computing" → add to KV cache → forward pass → "is" 3Step 3: "is" → add to KV cache → forward pass → "like" 4...
This phase is memory-bandwidth bound, not compute-bound. The bottleneck is reading the model weights and KV cache from GPU HBM (High Bandwidth Memory) for each token. The matrix multiplications are small (batch=1), so the GPU's arithmetic units are mostly idle, waiting for data to arrive.
The key difference is arithmetic intensity (FLOPs per byte loaded from memory):
| Phase | Tokens Processed | Matrix Size | Arithmetic Intensity | Bottleneck |
|---|---|---|---|---|
| Prefill | (prompt) | Large batch matmul | High (many FLOPs/byte) | Compute (TFLOPS) |
| Decode | 1 at a time | Thin matmul (batch=1) | Low (few FLOPs/byte) | Memory bandwidth (TB/s) |
On an H100 GPU: ~990 TFLOPS (TeraFLOPS) compute, ~3.35 TB/s HBM bandwidth. During decode, the GPU is reading ~140GB of model weights to process a single token. The computation itself takes microseconds, but loading the weights takes milliseconds. This memory-bound nature is what motivates architectural changes like FlashAttention[1] to minimize intermediate HBM access during prefill, and PagedAttention[2] for efficient KV cache management during decode.
To evaluate and optimize an inference system, engineers rely on four standard metrics that capture different parts of the user experience and system throughput. Balancing these metrics often involves direct tradeoffs.
TTFT measures the prefill latency: specifically, how quickly the model processes the initial prompt and generates the very first output token. Because the entire prompt is processed in parallel, TTFT is primarily dominated by the prompt's length and the model's overall size. For long prompts, TTFT scales roughly linearly with the token count.
This metric is highly visible to users. It's critical for interactive applications like chat interfaces, voice assistants, and real-time code completion, where a delay of even a second can feel sluggish.
TPS (also called "decode throughput") measures how fast the model generates subsequent output tokens after the first token is produced. This metric is constrained by memory bandwidth rather than compute power.
For a single request on a high-end GPU, TPS typically ranges between 30 and 80 tokens per second. However, in production environments, systems use batched inference (often using continuous batching[3]) to process multiple requests concurrently, yielding hundreds or thousands of aggregate TPS across the system.
ITL represents the time elapsed between generating consecutive output tokens. For a single request running in isolation, ITL is simply the inverse of TPS ().
In real-world applications, human readers perceive stuttering or unnatural pauses when ITL exceeds roughly 100 milliseconds. Under heavy batching, ITL can increase because more requests are contending for the same GPU resources.
While ITL measures the raw time between tokens from the user's perspective, TPOT is a system-level metric that accounts for scheduling overhead, queueing delays, and batching contention.
In an actively serving system, TPOT is typically greater than or equal to ITL. Monitoring TPOT helps engineers understand when the inference engine is overloaded and scheduling delays are impacting the overall output rate.
For a single decode stream, TPS is governed by memory bandwidth. To generate one token, the GPU must load the full model weights from HBM. A rough estimate:
For a 70B model in FP16 on an H100 (3.35 TB/s HBM bandwidth, ~140 GB weights):
This is the theoretical ceiling. Real-world single-stream throughput lands around 10-25 tokens/sec for large models because the simple calculation omits KV cache reads during attention, activation memory, kernel launch overhead, and because the full model weights can't be fully cached in on-chip SRAM between steps. With aggressive batching and TensorRT-LLM optimizations, single-stream throughput can reach the upper end of this range. Batched inference on the same hardware yields much higher aggregate throughput by sharing weight loads across concurrent requests.
🔬 Research insight: This memory-bandwidth bound is fundamental to transformer inference and is why every major serving optimization (PagedAttention[2], continuous batching[3], quantization) ultimately attacks the same bottleneck: reducing bytes moved per token or increasing effective HBM bandwidth.
In production deployments, the true limit on how many users a system can serve concurrently is almost always memory capacity, specifically the memory required to store the KV cache.
During attention, each layer computes Key and Value projections for every token. Without caching, generating token would require recomputing attention over all previous tokens from scratch, which is quadratic in sequence length.
💡 Analogy (exam reference sheet): The KV cache is like a reference sheet you build during an exam. For each question (token) you've already answered, you write down the key facts (K) and your reasoning (V) on the sheet. When the next question references a previous one, you glance at your reference sheet instead of re-deriving everything from scratch. The sheet grows with each question, and its size determines how many questions you can handle before running out of paper (GPU memory).
The KV cache stores these K and V tensors. As the sequence grows, the KV cache accumulates data for each token to avoid redundant calculations. This step-by-step accumulation allows the model to compute attention for only the newest token against the historical cache. The trace below shows how the cache expands with each generated token:
text1Token 1: Compute K₁, V₁ → Store in cache 2Token 2: Compute K₂, V₂ → Store; Attend to [K₁,K₂], [V₁,V₂] 3Token 3: Compute K₃, V₃ → Store; Attend to [K₁,K₂,K₃], [V₁,V₂,V₃] 4...
To manage this growing memory dynamically without fragmentation, systems like vLLM use PagedAttention[2], which divides the KV cache into fixed-size blocks (pages) similar to operating system virtual memory.
For a single sequence:
Reading the formula: for every layer (), every KV head (), every position in the sequence (), we store a Key vector and a Value vector (the "2") of dimension , each taking bytes. Multiply it all together and this cache can easily reach gigabytes for long sequences.
Where:
| Parameter | Value |
|---|---|
| Layers () | 80 |
| KV heads () | 8 (GQA, not 64 query heads!) |
| Head dim () | 128 |
| Sequence length () | 4,096 |
| Dtype | FP16 (2 bytes) |
Here's the breakdown: the formula multiplies (for K and V) by layers, KV heads, a head dimension, a sequence length, and bytes per value (for FP16). With GQA (8 KV heads instead of 64), this is 8× smaller than it would be with Multi-Head Attention (MHA). Without GQA, the same model would need 10.7 GB per sequence.
🎯 Production tip: Serving a 70B model to 100 concurrent users requires precise KV cache budgeting: Model weights (140 GB FP16) + KV cache (1.34 GB × 100 users = 134 GB) = 274 GB total. Feasible on 4× H100-80GB with tensor parallelism (splitting the model across multiple GPUs).
In production environments, context length isn't a static limit determined solely by the model's architecture. Instead, it's a dynamic memory budget that dictates how many concurrent users your system can support. Every additional token of context required by one user reduces the available GPU Video Random Access Memory (VRAM) for everyone else.
To serve models at scale, inference engines must strictly enforce these budgets. When a request comes in, the system checks the available GPU memory. If the required KV cache for the new request (plus existing ones) exceeds the remaining capacity, the request must wait in a queue. This is why managing the TTFT-TPS tradeoff is so important for overall throughput.
To calculate the maximum affordable context length, we can write a simple capacity planning function. The function takes the total GPU memory, the model's static weight footprint, and its architectural parameters (layers, KV heads, and dimension) as inputs. It computes the available memory per user and divides it by the per-token KV cache size, returning the maximum number of tokens each user can generate:
python1def max_context_for_budget( 2 gpu_memory_gb: float, 3 model_memory_gb: float, 4 num_layers: int, 5 num_kv_heads: int, 6 head_dim: int, 7 dtype_bytes: int = 2, # FP16 8 num_concurrent: int = 1, 9) -> int: 10 """Calculate maximum context length given memory constraints.""" 11 available_memory = (gpu_memory_gb - model_memory_gb) * 1e9 12 13 # Memory per token in KV cache 14 bytes_per_token = 2 * num_layers * num_kv_heads * head_dim * dtype_bytes 15 16 # Divide by concurrent users 17 budget_per_user = available_memory / num_concurrent 18 19 return int(budget_per_user / bytes_per_token) 20 21# Example: Qwen3.5 72B on 4×H100-80GB (320GB total) 22max_tokens = max_context_for_budget( 23 gpu_memory_gb=320, 24 model_memory_gb=140, # FP16 weights 25 num_layers=80, 26 num_kv_heads=8, # GQA: 8 KV heads (not 64!) 27 head_dim=128, 28 num_concurrent=50, 29) 30# Result: ~110,000 tokens per user with 50 concurrent users
This calculation drives critical deployment decisions. If you need to support 100 concurrent users but only have the budget for 5,000 tokens each, you might need to add another GPU node, reduce the model precision using quantization, or implement stricter context window limits at the application layer.
As the mechanics of inference reveal, balancing compute-bound prefilling and memory-bound decoding is challenging. Engineering teams use advanced serving techniques to smooth out these tradeoffs and maximize hardware utilization.
Large prefills block decode operations, causing a TTFT-TPS tradeoff: prioritizing new prefills stalls existing decode streams.
Chunked prefill splits long prompts into smaller chunks, interleaving them with decode steps:
💡 Analogy (factory assembly line): Without chunked prefill, it's like shutting down the entire factory assembly line to set up for a new product. All existing products stop moving while you retool. With chunked prefill, you retool one station at a time while the rest of the line keeps running. Existing requests keep flowing (decode continues) while the new request is gradually set up (prefilled in chunks).
The timeline below illustrates how chunked prefill avoids stalling decodes by breaking up the massive prefill block. By interleaving smaller prefill chunks with ongoing decode steps, the system maintains a steady flow of output tokens for existing users while gradually processing the new prompt:
text1Without chunked prefill: 2 [Prefill 10K tokens ===========================] [Decode...Decode...Decode...] 3 ↑ All decode requests stall during this prefill 4 5With chunked prefill (chunk=2048): 6 [Prefill chunk1][Decode][Prefill chunk2][Decode][Prefill chunk3][Decode]... 7 ↑ Decode requests continue between chunks
Better GPU utilization (mixing compute-bound prefill with memory-bound decode), smoother token streaming, reduced tail latency. Enabled by default in vLLM V1 and formalized in systems like Sarathi-Serve[5].
In a standard setup, a single GPU handles both prefill and decode phases for its assigned requests. However, because prefill is compute-bound and decode is memory-bandwidth bound, using the same hardware for both leads to resource underutilization. Modern systems (Splitwise[6], DistServe[7], Mooncake[8]) solve this by separating prefill and decode onto different GPU pools:
Store KV cache in INT8 or even INT4 instead of FP16 to halve/quarter memory with minimal quality loss (see our model quantization deep-dive for the techniques behind weight and activation quantization):
Reading the formula: INT8 uses 1 byte per value instead of 2 bytes (FP16), so KV memory is cut in half for the same sequence length and concurrency.
While quantizing weights reduces the static memory footprint of the model, quantizing the KV cache specifically attacks the dynamic memory bottleneck that limits concurrency. Modern serving engines apply this quantization on-the-fly, converting activations to INT8 before storing them in the cache and dequantizing them back to higher precision during the attention computation. This immediately doubles the number of concurrent users you can serve on the same hardware.
When working with LLM inference, it's easy to misunderstand where the bottlenecks actually lie. Here are a few common pitfalls to avoid.
This is only true for the decode phase. During the prefill phase, the entire prompt is processed in parallel. All input tokens are passed through the transformer layers simultaneously to compute their initial Key and Value states, which constructs the starting KV cache. Assuming generation is entirely sequential ignores the massive burst of parallel compute that happens before the first output token is ever produced.
Throwing more compute at the problem doesn't linearly speed up token generation for a single request. During decode, the bottleneck is loading weights from memory (memory bandwidth), not arithmetic operations. In fact, for a single request, excessive model parallelism across multiple GPUs can actually slow down generation because the communication overhead (passing intermediate tensors between GPUs via NVLink) outpaces the minor compute gains. Faster single-stream generation requires GPUs with higher memory bandwidth (e.g., H200 vs H100), not simply adding more GPUs.
While models are trained with a maximum context window (e.g., 1M+ for GPT-5.4), in production, context length is primarily a memory budget decision. Every token held in the context window consumes physical VRAM on the GPU in the form of the KV cache. If you allow users to use the full 1M+ context, you can serve drastically fewer concurrent users on the same hardware. Engineering teams often cap context lengths well below the model's theoretical maximum to ensure sufficient memory is available for high concurrency.
These two metrics are often in direct tension. Optimizing for TTFT generally means dedicating large, uninterrupted blocks of GPU time to process incoming prefill requests as fast as possible. However, doing so stalls any ongoing decode requests, causing the system's overall TPS (and the user's inter-token latency) to suffer. Conversely, prioritizing smooth token streaming (high TPS) means delaying new prefills. System designers must carefully tune scheduling algorithms like chunked prefill to balance these competing requirements.
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. · 2022 · NeurIPS 2022
Efficient Memory Management for Large Language Model Serving with PagedAttention.
Kwon, W., et al. · 2023 · SOSP 2023
Orca: A Distributed Serving System for Transformer-Based Generative Models.
Yu, G.-I., et al. · 2022 · OSDI 2022
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.
Ainslie, J., et al. · 2023 · EMNLP 2023
Sarathi-Serve: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.
Agrawal, A., et al. · 2023 · arXiv preprint
Splitwise: Efficient Generative LLM Inference Using Phase Splitting.
Patel, P., et al. · 2023
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving.
Zhong, Y., et al. · 2024 · OSDI 2024
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving.
Qin, Y., et al. · 2024