Understand the two-phase inference process (prefill vs decode), derive the KV cache memory formula, and learn production optimizations like chunked prefill and prefill/decode disaggregation.
The previous chapter showed how multi-agent systems turn one product action into planning, retrieval, tool calls, review, and several model calls. This chapter zooms into one of those calls. Once a request reaches an inference engine, three terms matter immediately: time to first token (TTFT), tokens per second (TPS), and the key-value (KV) cache that grows with the conversation.
When you send a message to ChatGPT, you notice something peculiar about how the response appears. There's often a brief pause, then the first visible text appears, followed by the rest streaming out in small chunks. Why that initial pause? Why does it stream token by token instead of appearing instantly? And why do longer conversations eventually feel slower?
This behavior isn't a quirk of the interface. It's the physics of LLM inference: the process of running a trained model to generate text. Understanding these mechanics (two-phase nature of generation, memory bottlenecks, and how engineers measure performance) is important for anyone building or optimizing production AI systems. Longer conversations often feel slower for a concrete reason: each decode step has to read a growing KV cache on top of the model weights.
Key insight: LLM inference has two distinct phases (prefill and decode) that usually emphasize different hardware bottlenecks. Knowing which phase is the bottleneck, and why, is the foundation for major optimizations in this space, from Key-Value (KV) cache management to continuous batching.
Every LLM request goes through two distinct computational phases with fundamentally different hardware bottlenecks:
Analogy (fulfillment desk): Prefill is like reading the entire order history, return policy, and carrier notes at once before the first reply. It's intense upfront work but highly parallelizable. Decode is like scanning one outbound box at a time in strict order: each new token waits for the previous token, and the bottleneck is how fast the system can fetch stored state from memory.
The diagram below illustrates the sequential dependency between the highly parallel prefill phase and the autoregressive decode phase:
In an unchunked baseline, the model processes the input prompt in parallel during prefill. Prompt tokens participate in large matrix operations, so the GPU can run dense matrix multiplies efficiently. For long prompts on modern accelerators, this phase is usually compute-bound, limited more by available FLOPs than by memory bandwidth. Later in this lesson, chunked prefill deliberately divides that work for scheduling reasons. A toy trace looks like this:
1Input: "Explain why order 102 is delayed"
2→ Tokenize prompt
3→ Process all prompt tokens in one forward pass
4→ Produce KV cache entries for the prompt
5→ Produce logits (unnormalized probability scores) for the FIRST output tokenThe time from request to the first output token is called TTFT (Time to First Token). For a user staring at a chat interface, this is the most noticeable latency. Do not copy one generic latency target into every product; establish a service-level objective (SLO) from the interaction mode and measured user tolerance.
| Use Case | Primary pressure | What to measure |
|---|---|---|
| Real-time voice | Turn-taking delay | End-to-end TTFT and audio pipeline overhead |
| Code completion | Interruption to typing | Tail TTFT for short prompts |
| Chat/conversational | Visible waiting | TTFT plus streamed ITL |
| Batch processing | Job completion | Throughput and cost before TTFT |
After producing the first token, the model generates subsequent tokens one at a time, autoregressively. Each new token requires a forward pass through the entire model, but only processes that single new token (reusing cached K/V from all previous tokens). The new token is appended to the running sequence, added to the KV cache, and fed back into the model to predict the next token. The trace below shows how the response grows:
1Step 1: Output so far: "Order" → add to KV cache → forward pass → next token: "102"
2Step 2: Output so far: "Order 102" → add to KV cache → forward pass → next token: "is"
3Step 3: Output so far: "Order 102 is" → add to KV cache → forward pass → next token: "delayed"
4...For a single decode stream, this phase is usually memory-bandwidth bound, not compute-bound. The bottleneck is reading the model weights and KV cache from GPU HBM (High Bandwidth Memory) for each token. The matrix multiplications are thin relative to the amount of data that must be moved, so the GPU's arithmetic units often spend more time waiting for bytes than doing math. As the response grows, the attention kernel also has to read a larger cached prefix, so per-token latency tends to rise with sequence length even when the model weights stay fixed.
The following small timeline separates first-token latency from decode cadence. It is intentionally a measurement exercise, not a model benchmark.
1arrival_ms = 0
2token_times_ms = [320, 355, 392, 428]
3
4ttft_ms = token_times_ms[0] - arrival_ms
5itls_ms = [
6 current - previous
7 for previous, current in zip(token_times_ms, token_times_ms[1:])
8]
9mean_itl_ms = sum(itls_ms) / len(itls_ms)
10tps = 1000 / mean_itl_ms
11
12print("TTFT:", ttft_ms, "ms")
13print("decode ITLs:", itls_ms, "ms")
14print(f"decode TPS: {tps:.1f}")1TTFT: 320 ms
2decode ITLs: [35, 37, 36] ms
3decode TPS: 27.8The key difference between the two phases is arithmetic intensity: the number of floating-point operations (FLOPs) the GPU can perform per byte of data it must fetch from high-bandwidth memory (HBM).
Why prefill has high arithmetic intensity. During prefill the model loads each weight matrix (the ~140 GB of parameters) once and reuses it across the entire prompt batch of tokens. The expensive memory traffic is amortized over a large number of matrix multiplications. The GPU's tensor cores stay saturated with useful work; the bottleneck becomes raw compute throughput (TFLOPS, trillions of floating-point operations per second).
Why decode has low arithmetic intensity. For each new output token, the model effectively streams through the weight tensor (~140 GB for a 70B-class BF16/FP16 model) to perform what is close to a matrix-vector product when the effective batch is small. The number of FLOPs per byte loaded collapses. The GPU's arithmetic units spend much of their time waiting for the next wave of weights and KV cache entries to arrive from HBM. Memory bandwidth (TB/s) therefore becomes the limiting factor.
| Phase | Tokens Processed | Effective Batch | Arithmetic Intensity | Bottleneck |
|---|---|---|---|---|
| Prefill | (prompt) | Large (full prompt) | High (many FLOPs/byte) | Compute (TFLOPS) |
| Decode | 1 at a time | 1 | Low (few FLOPs/byte) | Memory bandwidth (TB/s) |
The roofline model[1] makes this concrete: a kernel's achievable throughput is capped by either peak compute or by (memory bandwidth × arithmetic intensity), whichever is lower. Below a hardware-specific intensity threshold (the "ridge point"), you are bandwidth-bound; above it, compute-bound. Prefill often sits above that threshold, while low-batch decode often sits below it. An H100 SXM GPU has 80 GB of HBM and peak HBM bandwidth of 3.35 TB/s.[2] A 70B-class model in BF16/FP16 (16-bit floating-point formats) needs roughly 138-140 GB for weights alone, so it cannot be served unsharded on one H100-80GB.[3] That asymmetry is why input/output-aware (IO-aware) kernels like FlashAttention[4] matter for long prefills, while PagedAttention[5] focuses on fitting and reusing the KV cache efficiently during serving.
To evaluate and optimize an inference system, engineers rely on four standard metrics that capture different parts of the user experience and system throughput. Balancing these metrics often involves direct tradeoffs.
TTFT measures how long it takes before the first output token appears.[6] At the model-kernel level, TTFT is dominated by the prefill phase. In a real serving stack, end-to-end TTFT also includes tokenization, queueing, scheduling, and network overhead. Once those are under control, TTFT usually scales roughly linearly with prompt length for long prompts.
This metric is highly visible to users. It's critical for interactive applications like chat interfaces, voice assistants, and real-time code completion, where a delay of even a second can feel sluggish.
TPS (also called "decode throughput") measures how fast the model generates subsequent output tokens after the first token is produced. For low-batch decode, memory bandwidth is often the dominant constraint. Exact results depend on model architecture, quantization, parallelism, engine, context length, and batch shape; measure the deployed configuration rather than using a generic tokens-per-second claim. Production systems use batched inference, often with continuous batching,[7] to improve aggregate throughput across requests.
ITL represents the time elapsed between generating consecutive output tokens.[6] For a single request running in isolation, ITL is simply the inverse of TPS ().
In interactive UIs, ITL above roughly 100 milliseconds often starts to feel choppy. Under heavy batching, ITL can increase because more requests are contending for the same GPU resources.
While ITL is the individual gap between adjacent streamed tokens, TPOT is the average time per generated output token after the first token. In practice, TPOT is often computed as the mean of a request's ITL values or as an aggregate benchmark statistic across requests.[6]
Monitoring both metrics matters: ITL exposes jitter and stalls in the stream, while TPOT summarizes overall decode pacing. Under load, prefill interruptions, scheduling, and batching contention can make both worse.
The four metrics are summarized below:
| Metric | What it measures | Phase | What drives it | Useful aggregation |
|---|---|---|---|---|
| TTFT | Time until first output token appears | Prefill path | Prompt length, model size, queueing | Median and tail latency |
| TPS | Speed of token generation after the first | Decode | Memory bandwidth, batching | Per-request and aggregate rate |
| ITL | Time between consecutive tokens | Decode | Scheduling and contention | Distribution of token gaps |
| TPOT | Average time per output token after first | Decode | Scheduling, batching, contention | Request or benchmark mean |
An alert can route to the relevant investigation path without pretending to diagnose root cause by itself:
1def investigate(ttft_p95_ms: int, itl_p95_ms: int) -> str:
2 if ttft_p95_ms > 900 and itl_p95_ms <= 80:
3 return "inspect queueing and prefill"
4 if itl_p95_ms > 120:
5 return "inspect decode scheduling and memory pressure"
6 return "within example thresholds"
7
8print("long initial pause:", investigate(ttft_p95_ms=1100, itl_p95_ms=60))
9print("choppy stream:", investigate(ttft_p95_ms=350, itl_p95_ms=160))1long initial pause: inspect queueing and prefill
2choppy stream: inspect decode scheduling and memory pressureFor a single decode stream, TPS is governed by memory bandwidth. To generate one token, the GPU roughly has to stream the model weights from HBM. A rough estimate:
A 70B BF16/FP16 weight tensor does not fit on one H100-80GB. Suppose it is tensor-parallel sharded across four H100 SXM GPUs. Each GPU holds about one quarter of the 140 GB weight tensor and can read its shard at up to 3.35 TB/s; equivalently, the four devices offer an idealized aggregate 13.4 TB/s before communication and runtime losses.[2][3]
This is an optimistic bandwidth bound, not a throughput promise. It omits tensor-parallel communication, KV cache reads during attention, activation memory, kernel overhead, less-than-peak sustained bandwidth, and scheduling. Benchmark the selected engine and parallelism layout to get real TPS. Batched inference can improve aggregate throughput because multiple sequences share weight reads across concurrent decode work.
1def ideal_weight_stream_tps(
2 model_gb: float, bandwidth_gb_per_s: float, tensor_parallel_gpus: int
3) -> float:
4 aggregate_bandwidth = bandwidth_gb_per_s * tensor_parallel_gpus
5 return aggregate_bandwidth / model_gb
6
7model_gb = 140
8h100_capacity_gb = 80
9tensor_parallel_gpus = 4
10
11print("fits on one H100-80GB:", model_gb <= h100_capacity_gb)
12bound = ideal_weight_stream_tps(model_gb, 3350, tensor_parallel_gpus)
13print(f"four-GPU ideal shard-read bound: {bound:.1f} tokens/s")1fits on one H100-80GB: False
2four-GPU ideal shard-read bound: 95.7 tokens/sResearch note: This memory-bandwidth bound is fundamental to transformer inference and is why many serving optimizations (PagedAttention[5], continuous batching[7], quantization) attack the same bottleneck: reducing bytes moved per token or increasing effective HBM bandwidth.
After weights and runtime buffers are resident, the limit on how many active sequences a deployment can admit is often remaining memory capacity, specifically the memory required for their KV cache.
During attention, each layer computes Key and Value projections for every token. Without caching, each decode step would have to rerun the full prefix through the model and recompute old K/V tensors again and again. That repeated work gets expensive fast as the sequence grows.
Analogy (shipment trace): The KV cache is like a shipment trace you build while processing an order conversation. For each token already processed, the model stores key routing facts (K) and useful details (V). When the next token references earlier context, the model reads that trace instead of re-deriving the whole prompt from scratch. The trace grows with each token, and its size determines how many concurrent conversations fit in GPU memory.
The KV cache stores these K and V tensors. As the sequence grows, the KV cache accumulates data for each token to avoid redundant calculations. This step-by-step accumulation allows the model to compute attention for only the newest token against the historical cache. The trace below shows how the cache expands with each generated token:
1Token 1: Compute K₁, V₁ → Store in cache
2Token 2: Compute K₂, V₂ → Store; Attend to [K₁,K₂], [V₁,V₂]
3Token 3: Compute K₃, V₃ → Store; Attend to [K₁,K₂,K₃], [V₁,V₂,V₃]
4...To manage this growing memory dynamically without fragmentation, systems like vLLM use PagedAttention[5], which divides the KV cache into fixed-size blocks (pages) similar to operating system virtual memory.
For a single sequence:
Reading the formula: for every layer (), every KV head (), every position in the sequence (), we store a Key vector and a Value vector (the "2") of dimension , each taking bytes. Multiply it all together and this cache can easily reach gigabytes for long sequences.
Where:
| Parameter | Value |
|---|---|
| Layers () | 80 |
| KV heads () | 8 (GQA, not 64 query heads!) |
| Head dim () | 128 |
| Sequence length () | 4,096 |
| Dtype | FP16 (2 bytes) |
Here's the breakdown: the formula multiplies (for K and V) by layers, KV heads, a head dimension, a sequence length, and bytes per value (for FP16). This matches the common 64-query-head / 8-KV-head GQA geometry used by many 70B-class models. With GQA, this is 8× smaller than it would be with standard Multi-Head Attention (MHA). Without GQA, the same geometry would need about 10.74 GB (10.0 GiB) per sequence.
1def kv_cache_bytes(
2 layers: int, kv_heads: int, head_dim: int, tokens: int, bytes_per_value: int
3) -> int:
4 return 2 * layers * kv_heads * head_dim * tokens * bytes_per_value
5
6gqa_bytes = kv_cache_bytes(80, 8, 128, 4096, 2)
7mha_bytes = kv_cache_bytes(80, 64, 128, 4096, 2)
8
9print(f"GQA cache: {gqa_bytes / 1e9:.2f} GB")
10print(f"MHA cache: {mha_bytes / 1e9:.2f} GB")
11print("MHA / GQA:", mha_bytes // gqa_bytes)1GQA cache: 1.34 GB
2MHA cache: 10.74 GB
3MHA / GQA: 8
Production note: Serving a 70B model to 100 concurrent users requires precise KV cache budgeting: model weights (~138-140 GB BF16/FP16) + KV cache (1.34 GB × 100 users = 134 GB) = about 274 GB of raw footprint before allocator overhead, runtime buffers, and communication memory. On paper that fits across 4× H100-80GB with tensor parallelism, but it doesn't leave unlimited headroom.
Try it yourself: A colleague says you can serve a 7B model (32 layers, 8 KV heads, 128 head dimension, FP16) to 200 concurrent users on a single 80 GB GPU. The model weights take about 14 GB. Use the formula to see why raw weights plus KV memory are insufficient for an admission decision.
1def kv_gb_per_sequence(tokens: int) -> float:
2 values = 2 * 32 * 8 * 128 * tokens
3 return values * 2 / 1e9
4
5users = 200
6raw_total_gb = 14 + users * kv_gb_per_sequence(tokens=2048)
7print(f"raw weights plus KV: {raw_total_gb:.2f} GB")
8for reserve_gb in (8, 16):
9 admitted = raw_total_gb + reserve_gb <= 80
10 print(f"with {reserve_gb} GB runtime reserve: {admitted}")1raw weights plus KV: 67.69 GB
2with 8 GB runtime reserve: True
3with 16 GB runtime reserve: FalseThe raw calculation leaves only a narrow margin. Whether 200 active sequences fit depends on measured activation, workspace, allocator, and fragmentation headroom for the actual serving engine; do not turn an unmeasured reserve into a promised concurrency count.
In production environments, context length isn't a static limit determined solely by the model's architecture. Instead, it's a dynamic memory budget that dictates how many concurrent users your system can support. Every additional token of context required by one user reduces the available GPU memory (VRAM, Video RAM) for everyone else.
To serve models at scale, inference engines have to enforce these budgets strictly. When a request comes in, the system checks the available GPU memory. If the required KV cache for the new request (plus existing ones) exceeds the remaining capacity, the request waits in a queue. That's why schedulers track memory pressure right alongside latency metrics.
To calculate the maximum affordable context length, we can write a simple capacity planning function. The function takes the total GPU memory, the model's static weight footprint, and its architectural parameters (layers, KV heads, and dimension) as inputs. It computes the available memory per user and divides it by the per-token KV cache size, returning the maximum number of tokens each user can generate:
1def max_context_for_budget(
2 gpu_memory_gb: float,
3 model_memory_gb: float,
4 runtime_reserve_gb: float,
5 num_layers: int,
6 num_kv_heads: int,
7 head_dim: int,
8 dtype_bytes: int = 2, # FP16
9 num_concurrent: int = 1,
10) -> int:
11 """Quick planning estimate using decimal GB for consistency with GPU datasheets."""
12 available_memory = (gpu_memory_gb - model_memory_gb - runtime_reserve_gb) * 1e9
13
14 # Memory per token in KV cache
15 bytes_per_token = 2 * num_layers * num_kv_heads * head_dim * dtype_bytes
16
17 # Divide by concurrent users
18 budget_per_user = available_memory / num_concurrent
19
20 return int(budget_per_user / bytes_per_token)
21
22# Example: 70B-class model on 4×H100-80GB (320 GB raw)
23max_tokens = max_context_for_budget(
24 gpu_memory_gb=320,
25 model_memory_gb=140, # FP16 weights
26 runtime_reserve_gb=40, # engine buffers, workspaces, and allocator margin
27 num_layers=80,
28 num_kv_heads=8, # GQA: 8 KV heads (not 64!)
29 head_dim=128,
30 num_concurrent=50,
31)
32print(f"max context per user: {max_tokens:,} tokens")1max context per user: 8,544 tokensThis calculation drives critical deployment decisions. If you need to support 100 concurrent users but only have the budget for 5,000 tokens each, you might need to add another GPU node, reduce the model precision using quantization, or implement stricter context window limits at the application layer. Treat the result as an upper bound, not a safe production limit: you still need headroom for activations, communication buffers, allocator slack, and the serving runtime itself.
Common mistake: Using the query head count instead of the KV head count. A 70B-class model might have 64 query heads but only 8 KV heads thanks to GQA. If you plug 64 into the formula, you get an 8× memory overestimate and a pessimistic concurrency plan. Check the model card for
num_key_value_heads, notnum_attention_heads.
An online admission check can reserve KV capacity using each request's prompt plus output budget rather than admitting every request at the maximum architectural context:
1BYTES_PER_TOKEN = 2 * 80 * 8 * 128 * 2
2
3def kv_gb(tokens: int) -> float:
4 return tokens * BYTES_PER_TOKEN / 1e9
5
6def admit(existing_tokens: list[int], new_tokens: int, kv_budget_gb: float) -> bool:
7 needed = sum(kv_gb(tokens) for tokens in existing_tokens) + kv_gb(new_tokens)
8 return needed <= kv_budget_gb
9
10active = [4096] * 40
11print("admit 16K request:", admit(active, 16_384, kv_budget_gb=60))
12print("admit 64K request:", admit(active, 65_536, kv_budget_gb=60))1admit 16K request: True
2admit 64K request: FalseOnce you understand the two-phase bottleneck, the next question is how production systems work around it. Engineering teams use three major techniques to smooth the tradeoffs between prefill and decode and maximize hardware utilization.
On shared serving hardware, large prefills can delay decode operations, creating a TTFT-TPS tradeoff: prioritizing new prefills can stall existing decode streams.
Chunked prefill splits long prompts into smaller chunks, interleaving them with decode steps:
Analogy (factory assembly line): Without chunked prefill, it's like shutting down the entire factory assembly line to set up for a new product. All existing products stop moving while you retool. With chunked prefill, you retool one station at a time while the rest of the line keeps running. Existing requests keep flowing (decode continues) while the new request is gradually set up (prefilled in chunks).
The timeline below illustrates how chunked prefill avoids stalling decodes by breaking up the massive prefill block. By interleaving smaller prefill chunks with ongoing decode steps, the system maintains a steady flow of output tokens for existing users while gradually processing the new prompt:
1Without chunked prefill:
2 [Prefill 10K tokens ===========================] [Decode...Decode...Decode...]
3 ↑ All decode requests stall during this prefill
4
5With chunked prefill (chunk=2048):
6 [Prefill chunk1][Decode][Prefill chunk2][Decode][Prefill chunk3][Decode]...
7 ↑ Decode requests continue between chunksChunked scheduling can improve GPU utilization by mixing compute-heavy prefill with memory-heavy decode, while protecting streaming cadence and tail latency. vLLM documents chunked prefill as a scheduling optimization,[9] and systems like Sarathi-Serve[10] study it explicitly.
1def chunked_schedule(prompt_tokens: int, chunk_tokens: int) -> list[str]:
2 actions: list[str] = []
3 remaining = prompt_tokens
4 while remaining:
5 processed = min(chunk_tokens, remaining)
6 actions.append(f"prefill {processed}")
7 remaining -= processed
8 actions.append("decode active streams")
9 return actions
10
11for action in chunked_schedule(prompt_tokens=6144, chunk_tokens=2048):
12 print(action)1prefill 2048
2decode active streams
3prefill 2048
4decode active streams
5prefill 2048
6decode active streamsIn a standard setup, a single GPU handles both prefill and decode phases for its assigned requests. However, because prefill is compute-bound and decode is memory-bandwidth bound, using the same hardware for both can create head-of-line blocking and muddle hardware sizing. Modern systems (Splitwise[11], DistServe[12], Mooncake[13]) explore separating prefill and decode onto different GPU pools when that isolation benefit outweighs the KV-transfer cost:
You do pay for moving KV state across the interconnect, so disaggregation is most attractive when prompts are long, traffic is bursty, or TTFT/ITL isolation matters more than the extra transfer overhead.
1def choose_layout(shared_phase_interference_ms: int, kv_transfer_ms: int) -> str:
2 if kv_transfer_ms < shared_phase_interference_ms:
3 return "separate prefill and decode pools"
4 return "keep phases colocated"
5
6print("bursty workload:", choose_layout(shared_phase_interference_ms=95, kv_transfer_ms=20))
7print("small prompts:", choose_layout(shared_phase_interference_ms=8, kv_transfer_ms=20))1bursty workload: separate prefill and decode pools
2small prompts: keep phases colocatedStore KV cache in an 8-bit format such as FP8 instead of FP16/BF16 to roughly halve the cache footprint. Research systems have demonstrated sub-8-bit KV cache quantization, including 3-bit and 2-bit methods, with model-specific quality evaluation required before deployment.[14][15] vLLM documents FP8 KV-cache support and scaling configuration; support and accuracy trade-offs depend on the engine and hardware.[16] See our model quantization deep-dive for the techniques behind weight and activation quantization:
Reading the formula: an 8-bit K/V tensor payload uses 1 byte per value instead of 2 bytes (FP16/BF16), so its raw payload is halved for the same sequence length and concurrency. Scaling metadata and runtime buffers mean allocated memory savings may differ slightly.
While quantizing weights reduces the static memory footprint of the model, quantizing the KV cache specifically attacks the dynamic memory bottleneck that limits concurrency. Some serving engines now support KV cache quantization directly. If KV memory is the dominant constraint, moving from 16-bit to 8-bit caching can come close to doubling concurrency on the same hardware. In practice, the gain is smaller once you account for model weights, allocator overhead, and other runtime buffers.
1def users_from_kv_budget(kv_budget_gb: float, gb_per_user: float) -> int:
2 return int(kv_budget_gb / gb_per_user)
3
4fp16_gb_per_user = 1.342
5fp8_gb_per_user = fp16_gb_per_user / 2
6kv_budget_gb = 80
7
8print("FP16 users from KV budget:", users_from_kv_budget(kv_budget_gb, fp16_gb_per_user))
9print("FP8 users from KV budget:", users_from_kv_budget(kv_budget_gb, fp8_gb_per_user))
10print("runtime headroom still required:", True)1FP16 users from KV budget: 59
2FP8 users from KV budget: 119
3runtime headroom still required: Truenum_key_value_heads rather than the full query-head count.Symptom: You describe the whole request as sequential and then cannot explain the large pause before token 1. Cause: Decode is sequential, but prefill processes the full prompt in parallel before streaming begins. Fix: Split the request into two phases every time you reason about latency: prefill for first-token delay, decode for streaming cadence.
Symptom: A team adds more GPUs and expects single-request TPS to scale linearly. Cause: Single-stream decode is often limited by memory bandwidth and communication overhead, not raw compute alone. Fix: Ask what bottleneck you are relieving. If decode is bandwidth-bound, look at HBM bandwidth, scheduling, quantization, or a different parallelism strategy before adding more devices.
Symptom: A bandwidth calculation divides one H100's bandwidth by a 140 GB weight tensor. Cause: The estimate ignores capacity: 140 GB of weights cannot reside on one 80 GB device. Fix: Choose a feasible parallel layout first, then estimate using each device's shard and account for interconnect overhead in benchmarks.
Symptom: Product plans assume every user can use the model's full architectural context without affecting concurrency. Cause: The architectural limit and the production memory budget were treated as the same thing. Fix: Turn context policy into a capacity calculation. Budget KV state per user, then cap prompt and output lengths to preserve concurrency headroom.
Symptom: The system optimizes first-token latency aggressively, but active streams become choppy. Cause: Large prefills and smooth decode compete for the same GPU time, so improving one can hurt the other. Fix: Measure TTFT and decode metrics separately. Then tune scheduling, such as chunked prefill, around the actual SLO you need to protect.
Symptom: Capacity plans are off by a large factor and the service either looks impossibly expensive or crashes under load.
Cause: The formula used num_attention_heads instead of num_key_value_heads on a GQA model.
Fix: Check the model config directly. KV memory uses the stored K/V head count, not the full query-head count.
You now have a concrete mental model for how LLM inference behaves in production. When a request arrives, the engine executes a usually compute-heavy prefill pass over the prompt. That prefill drives model-side TTFT. Then the system falls into an often memory-bandwidth-bound low-batch decode loop that generates one token at a time and drives streamed TPS. The KV cache is the bridge between those phases, and it grows with every token, which is why remaining memory capacity, not only compute, often limits concurrency.
If you can explain why a long prompt hurts TTFT more than TPS, why doubling decode speed requires more HBM bandwidth rather than more TFLOPS, and how to estimate whether 100 concurrent users fit on your GPU cluster, you are already ahead of most candidates in an AI infrastructure interview.
Roofline: An Insightful Visual Performance Model for Multicore Architectures
Williams, S., Waterman, A., & Patterson, D. · 2009
H100 GPU
NVIDIA · 2026
Wide Open: NVIDIA Accelerates Inference on Meta Llama 3.
NVIDIA · 2024
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. · 2022 · NeurIPS 2022
Efficient Memory Management for Large Language Model Serving with PagedAttention.
Kwon, W., et al. · 2023 · SOSP 2023
Metrics
vLLM · 2026
Orca: A Distributed Serving System for Transformer-Based Generative Models.
Yu, G.-I., et al. · 2022 · OSDI 2022
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.
Ainslie, J., et al. · 2023 · EMNLP 2023
Optimization and Tuning.
vLLM · 2026
Sarathi-Serve: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.
Agrawal, A., et al. · 2023 · arXiv preprint
Splitwise: Efficient Generative LLM Inference Using Phase Splitting.
Patel, P., et al. · 2023
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving.
Zhong, Y., et al. · 2024 · OSDI 2024
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving.
Qin, Y., et al. · 2024
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
Hooper, C., Kim, S., Gholami, A., et al. · 2024 · arXiv preprint
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
Liu, Z., Chen, B., Hu, X., et al. · 2024 · arXiv preprint
Quantized KV Cache
vLLM Team · 2026 · vLLM Documentation