LearnInference & Production ScaleInference: TTFT, TPS & KV Cache

🚀HardInference Optimization

Inference: TTFT, TPS & KV Cache

Understand the two-phase inference process (prefill vs decode), derive the KV cache memory formula, and learn production optimizations like chunked prefill and prefill/decode disaggregation.

32 min read

Learning path

Step 127 of 158 in the full curriculum

Capstone: Production Agent Multi-Query & Grouped-Query Attention

Your production-agent capstone can fan out into planning, retrieval, tool calls, review, and several model calls. Zoom into one of those calls. Once a request reaches an inference engine, three terms matter immediately: time to first token (TTFT), tokens per second (TPS), and the key-value (KV) cache that grows with the conversation.

When you send a message to ChatGPT, you notice something peculiar about how the response appears. There's often a brief pause, then the first visible text appears, followed by the rest streaming out in small chunks. Why that initial pause? Why does it stream token by token instead of appearing instantly? And why do longer conversations eventually feel slower?

This behavior isn't a quirk of the interface. It's the physics of LLM inference: the process of running a trained model to generate text. Production systems tune the same mechanics you see here: two generation phases, memory bottlenecks, and performance counters. Longer conversations often feel slower for a concrete reason: each decode step has to read a growing KV cache on top of the model weights.

Prefill versus decode: Autoregressive LLM generation has two distinct phases: prefill and decode. They often emphasize different hardware bottlenecks. Knowing which phase is the bottleneck, and why, is the foundation for major optimizations in this space, from Key-Value (KV) cache management to continuous batching.

The two phases of LLM inference

Every autoregressive text-generation request has two computational phases. Their dominant bottlenecks depend on workload shape, but a useful baseline is:

Mental model (request trace): Prefill reads the whole prompt in one wide pass and writes the first cache state. It's intense upfront work but highly parallelizable. Decode then runs a narrow loop: one new token enters, cached prefix state is read, one next-token distribution comes out, and the cache grows by one step.

Two-panel inference pipeline showing prefill then decode with KV cache reuse. — Inference often flips regimes after token 1: long-prompt prefill is parallel and compute-heavy, while low-batch decode becomes a serial loop that rereads weights and cached state.

This diagram traces the sequential dependency between the highly parallel prefill phase and the autoregressive decode phase:

Diagram showing Phase 1: Prefill, Phase 2: Decode, Process entire prompt in parallel, and Write prompt K/V into KV cache. — Phase 1: Prefill, Phase 2: Decode, Process entire prompt in parallel, and Write prompt K/V into KV cache.

Phase 1: Prefill (processing the prompt)

In an unchunked baseline, the model processes the input prompt in parallel during prefill. Prompt tokens participate in large matrix operations, so the GPU can run dense matrix multiplies efficiently. For long prompts on modern accelerators, this phase is usually compute-bound, limited more by available FLOPs than by memory bandwidth. Later, chunked prefill deliberately divides that work for scheduling reasons. A toy trace looks like this:

text

Input: "Explain why decode slows as context grows"
→ Tokenize prompt
→ Process all prompt tokens in one forward pass
→ Produce KV cache entries for the prompt
→ Produce logits (unnormalized probability scores) for the FIRST output token

The time from request to the first output token is called TTFT (Time to First Token). For a user staring at a chat interface, this is the most noticeable latency. Don't copy one generic latency target into every product; establish a service-level objective (SLO) from the interaction mode and measured user tolerance.

Use Case	Primary pressure	What to measure
Real-time voice	Turn-taking delay	End-to-end TTFT and audio pipeline overhead
Code completion	Interruption to typing	Tail TTFT for short prompts
Chat/conversational	Visible waiting	TTFT plus streamed ITL
Batch processing	Job completion	Throughput and cost before TTFT

Phase 2: Decode (generating output tokens)

After producing the first token, the model generates subsequent tokens one at a time, autoregressively. Each new token requires a forward pass through the entire model, but only processes that single new token (reusing cached K/V from all previous tokens). The new token is appended to the running sequence, added to the KV cache, and fed back into the model to predict the next token. The trace below shows how the response grows:

text

Step 1: Output so far: "Decode" → add to KV cache → forward pass → next token: "reads"
Step 2: Output so far: "Decode reads" → add to KV cache → forward pass → next token: "cached"
Step 3: Output so far: "Decode reads cached" → add to KV cache → forward pass → next token: "state"
...

For a single decode stream, this phase is usually memory-bandwidth bound, not compute-bound. The bottleneck is reading the model weights and KV cache from GPU high-bandwidth memory (HBM) for each token. The matrix multiplications are thin relative to the amount of data that must be moved, so the GPU's arithmetic units often spend more time waiting for bytes than doing math. As the response grows, the attention kernel also has to read a larger cached prefix, so per-token latency tends to rise with sequence length even when the model weights stay fixed.

This small timeline separates first-token latency from decode cadence. It's intentionally a measurement exercise, not a model benchmark.

measure-prefill-and-decode.py

arrival_ms = 0
token_times_ms = [320, 355, 392, 428]

ttft_ms = token_times_ms[0] - arrival_ms
itls_ms = [
    current - previous
    for previous, current in zip(token_times_ms, token_times_ms[1:])
]
mean_itl_ms = sum(itls_ms) / len(itls_ms)
tps = 1000 / mean_itl_ms

print("TTFT:", ttft_ms, "ms")
print("decode ITLs:", itls_ms, "ms")
print(f"decode TPS: {tps:.1f}")

Output

TTFT: 320 ms
decode ITLs: [35, 37, 36] ms
decode TPS: 27.8

The arithmetic intensity explanation

The key difference between the two phases is arithmetic intensity: the number of floating-point operations (FLOPs) the GPU can perform per byte of data it must fetch from high-bandwidth memory (HBM).

Why prefill has high arithmetic intensity. During prefill the model loads each weight matrix once and reuses it across the entire prompt batch of $N$ tokens. For a current dense model such as Qwen3.6-27B, BF16 weights are roughly 54 GB before KV cache or runtime buffers.^{[1]Reference 1Qwen3.6-27Bhttps://huggingface.co/Qwen/Qwen3.6-27B} The expensive memory traffic is amortized over a large number of matrix multiplications. The GPU's tensor cores stay saturated with useful work; the bottleneck becomes raw compute throughput (TFLOPS, trillions of floating-point operations per second).

Why decode has low arithmetic intensity. For each new output token, the model effectively streams through the weight tensor to perform what is close to a matrix-vector product when the effective batch is small. The number of FLOPs per byte loaded collapses. The GPU's arithmetic units spend much of their time waiting for the next wave of weights and KV cache entries to arrive from HBM. Memory bandwidth (TB/s) therefore becomes the limiting factor.

Phase	Tokens Processed	Effective Batch	Arithmetic Intensity	Bottleneck
Prefill	$N$ (prompt)	Large (full prompt)	High (many FLOPs/byte)	Often compute (TFLOPS)
Decode	1 at a time	1	Low (few FLOPs/byte)	Often memory bandwidth (TB/s)

The roofline model^{[2]Reference 2Roofline: An Insightful Visual Performance Model for Multicore Architectureshttps://doi.org/10.1145/1498765.1498785} makes this concrete: a kernel's achievable throughput is capped by either peak compute or by (memory bandwidth × arithmetic intensity), whichever is lower. Below a hardware-specific intensity threshold (the "ridge point"), you're bandwidth-bound; above it, compute-bound. Prefill often sits above that threshold, while low-batch decode often sits below it. An H100 SXM GPU has 80 GB of HBM and peak HBM bandwidth of 3.35 TB/s.^{[3]Reference 3H100 GPUhttps://www.nvidia.com/en-us/data-center/h100/} Qwen3.6-27B in BF16 is about 54 GB for dense weights, so raw weights fit on one H100-80GB, but the official 262K native context makes KV-cache policy and runtime headroom part of the real fit decision.^{[1]Reference 1Qwen3.6-27Bhttps://huggingface.co/Qwen/Qwen3.6-27B} That asymmetry is why input/output-aware (IO-aware) kernels like FlashAttention^{[4]Reference 4FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.https://arxiv.org/abs/2205.14135} matter for long prefills, while PagedAttention^{[5]Reference 5Efficient Memory Management for Large Language Model Serving with PagedAttention.https://arxiv.org/abs/2309.06180} focuses on fitting and reusing the KV cache efficiently during serving.

Key performance metrics

To evaluate and optimize an inference system, engineers rely on four standard metrics that capture different parts of the user experience and system throughput. Balancing these metrics often involves direct tradeoffs.

Inference request timeline showing queue, tokenize, prefill, token 1, and repeated decode gaps labeled ITL, with TTFT ending at the first output token and streaming metrics after it. — The timeline separates first-token latency from streaming cadence. TTFT stops at token 1, while ITL, TPOT, and TPS describe the repeated decode gaps that follow.

TTFT (time to first token)

TTFT measures how long it takes before the first output token appears.^{[6]Reference 6Metricshttps://docs.vllm.ai/en/stable/design/metrics/} At the model-kernel level, TTFT is dominated by the prefill phase. In a real serving stack, end-to-end TTFT also includes tokenization, queueing, scheduling, and network overhead. Once those are under control, TTFT usually scales roughly linearly with prompt length for long prompts.

This metric is highly visible to users. It matters most for interactive applications like chat interfaces, voice assistants, and real-time code completion, where a delay of even a second can feel sluggish.

TPS (tokens per second)

TPS (also called "decode throughput") measures how fast the model generates subsequent output tokens after the first token is produced. For low-batch decode, memory bandwidth is often the dominant constraint. Exact results depend on model architecture, quantization, parallelism, engine, context length, and batch shape; measure the deployed configuration rather than using a generic tokens-per-second claim. Production systems use batched inference, often with continuous batching,^{[7]Reference 7Orca: A Distributed Serving System for Transformer-Based Generative Models.https://www.usenix.org/conference/osdi22/presentation/yu} to improve aggregate throughput across requests.

ITL (inter-token latency)

ITL represents the time elapsed between generating consecutive output tokens.^{[6]Reference 6Metricshttps://docs.vllm.ai/en/stable/design/metrics/} For a single request running in isolation, ITL is the inverse of TPS ( $ITL = 1/TPS$ ).

Higher ITL makes streamed output feel choppier. Under heavy batching, ITL can increase because more requests are contending for the same GPU resources. Set alert thresholds from product measurements rather than copying one generic value.

TPOT (time per output token)

While ITL is the individual gap between adjacent streamed tokens, TPOT is the average time per generated output token after the first token. In practice, TPOT is often computed as the mean of a request's ITL values or as an aggregate benchmark statistic across requests.^{[6]Reference 6Metricshttps://docs.vllm.ai/en/stable/design/metrics/}

Monitoring both metrics matters: ITL exposes jitter and stalls in the stream, while TPOT summarizes overall decode pacing. Under load, prefill interruptions, scheduling, and batching contention can make both worse.

The four metrics are summarized below:

Metric	What it measures	Phase	What drives it	Useful aggregation
TTFT	Time until first output token appears	Prefill path	Prompt length, model size, queueing	Median and tail latency
TPS	Speed of token generation after the first	Decode	Memory bandwidth, batching	Per-request and aggregate rate
ITL	Time between consecutive tokens	Decode	Scheduling and contention	Distribution of token gaps
TPOT	Average time per output token after first	Decode	Scheduling, batching, contention	Request or benchmark mean

An alert can route to the relevant investigation path without pretending to diagnose root cause by itself:

route-latency-investigation.py

def investigate(ttft_p95_ms: int, itl_p95_ms: int) -> str:
    if ttft_p95_ms > 900 and itl_p95_ms <= 80:
        return "inspect queueing and prefill"
    if itl_p95_ms > 120:
        return "inspect decode scheduling and memory pressure"
    return "within example thresholds"

print("long initial pause:", investigate(ttft_p95_ms=1100, itl_p95_ms=60))
print("choppy stream:", investigate(ttft_p95_ms=350, itl_p95_ms=160))

Output

long initial pause: inspect queueing and prefill
choppy stream: inspect decode scheduling and memory pressure

Back-of-the-envelope: a bandwidth upper bound

For a single low-batch decode stream, TPS is often constrained by memory bandwidth. To generate one token, the GPU roughly has to stream the model weights from HBM. A rough estimate:

$\text{Max TPS} \approx \frac{\text{HBM Bandwidth}}{\text{Model size (bytes)}}$

A Qwen3.6-27B BF16 dense weight tensor is about 54 GB, so its raw weights fit on one H100-80GB before KV cache, runtime buffers, and allocator headroom.^{[1]Reference 1Qwen3.6-27Bhttps://huggingface.co/Qwen/Qwen3.6-27B} On one H100 SXM GPU, the idealized weight-read ceiling uses the card's 3.35 TB/s HBM bandwidth before runtime losses.^{[3]Reference 3H100 GPUhttps://www.nvidia.com/en-us/data-center/h100/}

$\text{Ideal weight-read bound} \approx \frac{3{,}350 \text{ GB/s}}{54 \text{ GB}} \approx 62 \text{ tokens/sec}$

This is an optimistic bandwidth bound, not a throughput promise. It omits KV cache reads during attention, activation memory, kernel overhead, less-than-peak sustained bandwidth, scheduling, and any interconnect communication if the model is sharded. Benchmark the selected engine and parallelism layout to get real TPS. Batched inference can improve aggregate throughput because multiple sequences share weight reads across concurrent decode work.

estimate-bandwidth-bound.py

def ideal_weight_stream_tps(
    model_gb: float, bandwidth_gb_per_s: float, tensor_parallel_gpus: int
) -> float:
    aggregate_bandwidth = bandwidth_gb_per_s * tensor_parallel_gpus
    return aggregate_bandwidth / model_gb

model_gb = 54
h100_capacity_gb = 80
tensor_parallel_gpus = 1

print("fits on one H100-80GB:", model_gb <= h100_capacity_gb)
bound = ideal_weight_stream_tps(model_gb, 3350, tensor_parallel_gpus)
print(f"single-GPU ideal weight-read bound: {bound:.1f} tokens/s")

Output

fits on one H100-80GB: True
single-GPU ideal weight-read bound: 62.0 tokens/s

Research note: For low-batch decode, quantization can move the weight-streaming bound by reducing bytes read per token. PagedAttention^{[5]Reference 5Efficient Memory Management for Large Language Model Serving with PagedAttention.https://arxiv.org/abs/2309.06180} improves KV-cache packing and sharing, while continuous batching^{[7]Reference 7Orca: A Distributed Serving System for Transformer-Based Generative Models.https://www.usenix.org/conference/osdi22/presentation/yu} improves utilization across requests. These techniques address related serving constraints, but they aren't interchangeable.

The KV cache: the dynamic capacity bottleneck

After weights and runtime buffers are resident, the limit on how many active sequences a deployment can admit is often remaining memory capacity, specifically the memory required for their KV cache.

What is the KV cache?

During attention, each layer computes Key and Value projections for every token. Without caching, each decode step would have to rerun the full prefix through the model and recompute old K/V tensors again and again. That repeated work gets expensive fast as the sequence grows.

Mental model (attention state): The KV cache is the reusable attention state for tokens already seen. For each processed token, the model stores key routing vectors (K) and value content vectors (V). When the next token attends to earlier context, the model reads those tensors instead of re-deriving the whole prefix from scratch. The cache grows with each token, and its size determines how many concurrent sequences fit in GPU memory.

The KV cache stores these K and V tensors. As the sequence grows, the KV cache accumulates data for each token to avoid redundant calculations. This step-by-step accumulation allows the model to compute attention for only the newest token against the historical cache. The trace below shows how the cache expands with each generated token:

text

Token 1: Compute K₁, V₁ → Store in cache
Token 2: Compute K₂, V₂ → Store; Attend to [K₁,K₂], [V₁,V₂]
Token 3: Compute K₃, V₃ → Store; Attend to [K₁,K₂,K₃], [V₁,V₂,V₃]
...

To manage this growing memory efficiently, systems like vLLM use PagedAttention^{[5]Reference 5Efficient Memory Management for Large Language Model Serving with PagedAttention.https://arxiv.org/abs/2309.06180}, which divides the KV cache into fixed-size blocks (pages) similar to operating system virtual memory. Live sequences no longer need one long contiguous reservation each. This reduces fragmentation and over-reservation; it doesn't eliminate partially filled tail blocks or metadata overhead.

KV cache memory formula

For a single sequence:

$\text{KV Cache} = 2 \times L \times n_{kv} \times d_h \times s \times b$

Reading the formula: for every layer ( $L$ ), every KV head ( $n_{kv}$ ), every position in the sequence ( $s$ ), we store a Key vector and a Value vector (the "2") of dimension $d_h$ , each taking $b$ bytes. Multiply it all together and this cache can easily reach gigabytes for long sequences.

Where:

$L$ = number of layers
$n_{kv}$ = number of KV heads (reduced with GQA/MQA (Grouped-Query/Multi-Query Attention)^{[8]Reference 8GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.https://arxiv.org/abs/2305.13245})
$d_h$ = head dimension
$s$ = sequence length
$b$ = bytes per element (2 for FP16, 1 for an 8-bit cache such as INT8 or FP8)

Concrete example: Qwen3.6-27B-style GQA sizing

Parameter	Value
Full-attention layers ( $L$ )	16 (64 total; 48 linear-attention layers use a different state)
KV heads ( $n_{kv}$ )	4 (GQA, not 24 query heads!)
Head dim ( $d_h$ )	256
Sequence length ( $s$ )	4,096
Dtype	FP16 (2 bytes)

\begin{aligned} \text{KV Cache} &= 2 \times 16 \times 4 \times 256 \times 4096 \times 2 \\ &\approx \textbf{0.27 GB} \; (\approx \textbf{0.25 GiB}) \textbf{ per sequence} \end{aligned}

The formula multiplies $2$ (for K and V) by $16$ full-attention layers, $4$ KV heads, a $256$ head dimension, a $4096$ sequence length, and $2$ bytes per value (for FP16). Qwen3.6-27B has 64 total layers, but only its 16 Gated Attention blocks store standard growing K/V tensors; the other 48 linear-attention layers keep separate fixed-size state.^{[1]Reference 1Qwen3.6-27Bhttps://huggingface.co/Qwen/Qwen3.6-27B} With GQA, this is 6× smaller than it would be if every one of the 24 query heads stored separate K/V tensors in those same 16 layers. Without GQA, the same geometry would need about 1.61 GB (1.5 GiB) per sequence.

calculate-kv-cache-footprint.py

def kv_cache_bytes(
    layers: int, kv_heads: int, head_dim: int, tokens: int, bytes_per_value: int
) -> int:
    return 2 * layers * kv_heads * head_dim * tokens * bytes_per_value

gqa_bytes = kv_cache_bytes(16, 4, 256, 4096, 2)
mha_bytes = kv_cache_bytes(16, 24, 256, 4096, 2)

print(f"GQA cache: {gqa_bytes / 1e9:.2f} GB")
print(f"MHA cache: {mha_bytes / 1e9:.2f} GB")
print("MHA / GQA:", mha_bytes // gqa_bytes)

Output

GQA cache: 0.27 GB
MHA cache: 1.61 GB
MHA / GQA: 6

KV-cache charts show memory growth with context length and KV head count. — KV memory grows linearly with context length. GQA helps because the cache stores K/V heads, not every query head.

Production note: Serving Qwen3.6-27B to many concurrent users still requires precise KV-cache budgeting. The BF16 weights are about 54 GB, so one H100-80GB can hold raw weights, but that leaves limited room for KV cache, runtime buffers, prefix cache, and allocator overhead. Long native context support is useful only if the serving policy reserves enough memory for the active prompt-plus-output tokens.^{[1]Reference 1Qwen3.6-27Bhttps://huggingface.co/Qwen/Qwen3.6-27B}

Try it yourself: A colleague says you can serve a 7B model (32 layers, 8 KV heads, 128 head dimension, FP16) to 200 concurrent users on a single 80 GB GPU. The model weights take about 14 GB. Use the formula to see why raw weights plus KV memory are insufficient for an admission decision.

check-capacity-with-runtime-headroom.py

def kv_gb_per_sequence(tokens: int) -> float:
    values = 2 * 32 * 8 * 128 * tokens
    return values * 2 / 1e9

users = 200
raw_total_gb = 14 + users * kv_gb_per_sequence(tokens=2048)
print(f"raw weights plus KV: {raw_total_gb:.2f} GB")
for reserve_gb in (8, 16):
    admitted = raw_total_gb + reserve_gb <= 80
    print(f"with {reserve_gb} GB runtime reserve: {admitted}")

Output

raw weights plus KV: 67.69 GB
with 8 GB runtime reserve: True
with 16 GB runtime reserve: False

The raw calculation leaves only a narrow margin. Whether 200 active sequences fit depends on measured activation, workspace, allocator, and fragmentation headroom for the actual serving engine; don't turn an unmeasured reserve into a promised concurrency count.

Dynamic token budgeting

In production environments, context length isn't a static limit determined solely by the model's architecture. It's also a dynamic memory budget that dictates how many concurrent users your system can support. Every additional active token required by one user reduces the available GPU memory (VRAM, Video RAM) for everyone else.

To serve models at scale, inference engines have to enforce these budgets strictly. When a request comes in, the system checks available GPU memory. If the required KV cache for the new request plus existing work exceeds remaining capacity, the scheduler must queue, reject, or preempt work according to policy rather than overcommit GPU memory. That's why schedulers track memory pressure right alongside latency metrics.

To calculate a maximum affordable active-sequence length, write a simple capacity-planning function. The function takes total GPU memory, the model's static weight footprint, and architectural parameters (layers, KV heads, and dimension) as inputs. It computes available memory per user and divides it by per-token KV-cache size, returning an upper bound for each user's prompt plus output tokens:

dynamic-token-budgeting.py

def max_context_for_budget(
    gpu_memory_gb: float,
    model_memory_gb: float,
    runtime_reserve_gb: float,
    num_layers: int,
    num_kv_heads: int,
    head_dim: int,
    dtype_bytes: int = 2,  # FP16
    num_concurrent: int = 1,
) -> int:
    """Quick planning estimate using decimal GB for consistency with GPU datasheets."""
    available_memory = (gpu_memory_gb - model_memory_gb - runtime_reserve_gb) * 1e9

    # Memory per token in KV cache
    bytes_per_token = 2 * num_layers * num_kv_heads * head_dim * dtype_bytes

    # Divide by concurrent users
    budget_per_user = available_memory / num_concurrent

    return int(budget_per_user / bytes_per_token)

# Example: Qwen3.6-27B-style dimensions on one H100-80GB
max_tokens = max_context_for_budget(
    gpu_memory_gb=80,
    model_memory_gb=54,  # BF16 dense weights
    runtime_reserve_gb=8,  # engine buffers, workspaces, and allocator margin
    num_layers=16,  # Qwen3.6-27B: 16 full-attention layers
    num_kv_heads=4,  # GQA: 4 KV heads (not 24 query heads!)
    head_dim=256,
    num_concurrent=50,
)
print(f"max context per user: {max_tokens:,} tokens")

Output

max context per user: 5,493 tokens

This calculation sets deployment limits. If you need to support 100 concurrent users but only have the budget for 5,000 tokens each, you might need to add another GPU node, reduce the model precision using quantization, or implement stricter context window limits at the application layer. Treat the result as an upper bound, not a safe production limit: you still need headroom for activations, communication buffers, allocator slack, and the serving runtime itself.

Common mistake: Using the query head count instead of the KV head count. Qwen3.6-27B lists 24 query heads but 4 KV heads for its full-attention path.^{[1]Reference 1Qwen3.6-27Bhttps://huggingface.co/Qwen/Qwen3.6-27B} If you plug 24 into the formula, you get a 6× memory overestimate and a pessimistic concurrency plan. Check the model config for num_key_value_heads, not num_attention_heads alone.

An online admission check can reserve KV capacity using each request's prompt plus output token budget rather than admitting every request at the maximum architectural context:

admit-request-by-token-budget.py

BYTES_PER_TOKEN = 2 * 16 * 4 * 256 * 2

def kv_gb(tokens: int) -> float:
    return tokens * BYTES_PER_TOKEN / 1e9

def admit(existing_tokens: list[int], new_tokens: int, kv_budget_gb: float) -> bool:
    needed = sum(kv_gb(tokens) for tokens in existing_tokens) + kv_gb(new_tokens)
    return needed <= kv_budget_gb

active = [4096] * 40
print("admit 16K request:", admit(active, 16_384, kv_budget_gb=12))
print("admit 64K request:", admit(active, 65_536, kv_budget_gb=12))

Output

admit 16K request: True
admit 64K request: False

Production optimizations

Once you understand the two-phase bottleneck, the next question is how production systems work around it. Three practical techniques smooth tradeoffs between prefill and decode or increase useful serving capacity.

Chunked prefill

On shared serving hardware, large prefills can delay decode operations, creating a TTFT-TPS tradeoff: prioritizing new prefills can stall existing decode streams.

Chunked prefill splits long prompts into smaller chunks, interleaving them with decode steps:

Analogy (factory assembly line): Without chunked prefill, it's like shutting down the entire factory assembly line to set up for a new product. All existing products stop moving while you retool. With chunked prefill, you retool one station at a time while the rest of the line keeps running. Existing requests keep flowing (decode continues) while the new request is gradually set up (prefilled in chunks).

The timeline below shows chunked prefill avoiding decode stalls by breaking up the massive prefill block. By interleaving smaller prefill chunks with ongoing decode steps, the system maintains a steady flow of output tokens for existing users while gradually processing the new prompt:

text

Without chunked prefill:
  [Prefill 10K tokens ===========================] [Decode...Decode...Decode...]
  ↑ All decode requests stall during this prefill

With chunked prefill (chunk=2048):
  [Prefill chunk1][Decode][Prefill chunk2][Decode][Prefill chunk3][Decode]...
  ↑ Decode requests continue between chunks

Benefits

Chunked scheduling can improve GPU utilization by mixing compute-heavy prefill with memory-heavy decode, while protecting streaming cadence and tail latency. vLLM documents chunked prefill as a scheduling optimization,^{[9]Reference 9Optimization and Tuning.https://docs.vllm.ai/en/latest/configuration/optimization.html} and systems like Sarathi-Serve^{[10]Reference 10Sarathi-Serve: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.https://arxiv.org/abs/2308.16369} study it explicitly.

interleave-prefill-with-active-decode.py

def chunked_schedule(prompt_tokens: int, chunk_tokens: int) -> list[str]:
    actions: list[str] = []
    remaining = prompt_tokens
    while remaining:
        processed = min(chunk_tokens, remaining)
        actions.append(f"prefill {processed}")
        remaining -= processed
        actions.append("decode active streams")
    return actions

for action in chunked_schedule(prompt_tokens=6144, chunk_tokens=2048):
    print(action)

Output

prefill 2048
decode active streams
prefill 2048
decode active streams
prefill 2048
decode active streams

Prefill-decode disaggregation

In a standard colocated setup, the same serving worker or GPU pool handles both prefill and decode for its assigned requests. Because long-prompt prefill is often compute-bound while low-batch decode is often memory-bandwidth bound, colocating both can create head-of-line blocking and couple hardware sizing decisions. Modern systems (Splitwise^{[11]Reference 11Splitwise: Efficient Generative LLM Inference Using Phase Splitting.https://arxiv.org/abs/2311.18677}, DistServe^{[12]Reference 12DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving.https://arxiv.org/abs/2401.09670}, Mooncake^{[13]Reference 13Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving.https://arxiv.org/abs/2407.00079}) explore separating prefill and decode onto different GPU pools when that isolation benefit outweighs the KV-transfer cost:

Diagram showing Prefill GPUs (Often compute-bound) Need: high tensor-core throughput Scale: prompt load, KV cache transfer, and Decode GPUs (Often memory-BW-bound) Need: high HBM bandwidth + memory Scale: active users. — Prefill GPUs (Often compute-bound) Need: high tensor-core throughput Scale: prompt load, KV cache transfer, and Decode GPUs (Often memory-BW-bound) Need: high HBM bandwidth + memory Scale: active users.

Architectural benefits

Less cross-phase interference: prefill bursts are less likely to stall decode
Independent scaling: add prefill GPUs for prompt-heavy workloads, decode GPUs for concurrent users
Hardware matching: use compute-optimized GPUs for prefill, high-bandwidth GPUs for decode

You do pay for moving KV state across the interconnect, so disaggregation is most attractive when prompts are long, traffic is bursty, or TTFT/ITL isolation matters more than the extra transfer overhead.

compare-disaggregation-overhead.py

def choose_layout(shared_phase_interference_ms: int, kv_transfer_ms: int) -> str:
    if kv_transfer_ms < shared_phase_interference_ms:
        return "separate prefill and decode pools"
    return "keep phases colocated"

print("bursty workload:", choose_layout(shared_phase_interference_ms=95, kv_transfer_ms=20))
print("small prompts:", choose_layout(shared_phase_interference_ms=8, kv_transfer_ms=20))

Output

bursty workload: separate prefill and decode pools
small prompts: keep phases colocated

KV cache quantization

Store KV cache in an 8-bit format such as FP8 instead of FP16/BF16 to roughly halve the cache footprint. Research systems have demonstrated sub-8-bit KV cache quantization, including 3-bit and 2-bit methods, with model-specific quality evaluation required before deployment.^{[14]Reference 14KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantizationhttps://arxiv.org/abs/2401.18079}^{[15]Reference 15KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cachehttps://arxiv.org/abs/2402.02750} vLLM documents FP8 KV-cache support and scaling configuration; support and accuracy trade-offs depend on the engine and hardware.^{[16]Reference 16Quantized KV Cachehttps://docs.vllm.ai/en/latest/features/quantization/quantized_kvcache/} See our model quantization deep-dive for the techniques behind weight and activation quantization:

$\text{KV Cache (8-bit)} = \frac{\text{KV Cache (FP16/BF16)}}{2}$

Reading the formula: an 8-bit K/V tensor payload uses 1 byte per value instead of 2 bytes (FP16/BF16), so its raw payload is halved for the same sequence length and concurrency. Scaling metadata and runtime buffers mean allocated memory savings may differ slightly.

While quantizing weights reduces the static memory footprint of the model, quantizing the KV cache specifically attacks the dynamic memory bottleneck that limits concurrency. Some serving engines now support KV cache quantization directly. If KV memory is the dominant constraint, moving from 16-bit to 8-bit caching can come close to doubling concurrency on the same hardware. In practice, the gain is smaller once you account for model weights, allocator overhead, and other runtime buffers.

The snippet below reuses the Qwen3.6-27B-style GQA dimensions from earlier and assumes each active user reserves 5,120 prompt-plus-output tokens of FP16 KV state:

estimate-kv-quantization-gain.py

def users_from_kv_budget(kv_budget_gb: float, gb_per_user: float) -> int:
    return int(kv_budget_gb / gb_per_user)

tokens_per_user = 5120
fp16_gb_per_user = 2 * 16 * 4 * 256 * tokens_per_user * 2 / 1e9
fp8_gb_per_user = fp16_gb_per_user / 2
kv_budget_gb = 80

print("FP16 users from KV budget:", users_from_kv_budget(kv_budget_gb, fp16_gb_per_user))
print("FP8 users from KV budget:", users_from_kv_budget(kv_budget_gb, fp8_gb_per_user))
print("runtime headroom still required:", True)

Output

FP16 users from KV budget: 238
FP8 users from KV budget: 476
runtime headroom still required: True

Mastery check

Key concepts

Prefill vs decode as two separate inference phases
TTFT, TPS, inter-token latency (ITL), and time per output token (TPOT)
Arithmetic intensity and why the bottleneck often flips after token 1
KV-cache memory formula and why GQA changes the head count
Context length as a concurrency budget, not a model-card limit alone
Chunked prefill, prefill-decode disaggregation, and KV-cache quantization

Evaluation rubric

Foundational: Explains why TTFT ends at token 1 and why decode remains sequential after that.
Intermediate: Diagnoses whether a user complaint points to prefill latency or decode pacing.
Intermediate: Derives the KV-cache formula and uses num_key_value_heads rather than the full query-head count.
Advanced: Explains why prefill is usually compute-bound while single-stream decode is usually memory-bandwidth-bound.
Advanced: Estimates whether a deployment fits by combining model weights, KV cache, and runtime headroom.
Advanced: Chooses among chunked prefill, disaggregation, or KV-cache quantization based on the real bottleneck.

Follow-up questions

Common pitfalls

"LLMs generate tokens one at a time"

Symptom: You describe the whole request as sequential and then can't explain the large pause before token 1.
Cause: Decode is sequential, but prefill processes the full prompt in parallel before streaming begins.
Fix: Split the request into two phases every time you reason about latency: prefill for first-token delay, decode for streaming cadence.

"More GPUs always means faster generation"

Symptom: A team adds more GPUs and expects single-request TPS to scale linearly.
Cause: Single-stream decode is often limited by memory bandwidth and communication overhead, not raw compute alone.
Fix: Ask what bottleneck you're relieving. If decode is bandwidth-bound, look at HBM bandwidth, scheduling, quantization, or a different parallelism strategy before adding more devices.

"A bandwidth estimate ignores fit and headroom"

Symptom: A bandwidth calculation divides one H100's bandwidth by a model's weight footprint and stops there.
Cause: The estimate ignores capacity headroom: weights may fit, but KV cache, runtime buffers, prefix cache, and fragmentation still share the device.
Fix: Choose a feasible memory plan first, then estimate bandwidth and account for cache traffic, interconnect overhead, and measured kernel performance.

"Context length is only a model limitation"

Symptom: Product plans assume every user can use the model's full architectural context without affecting concurrency.
Cause: The architectural limit and the production memory budget were treated as the same thing.
Fix: Turn context policy into a capacity calculation. Budget KV state per user, then cap prompt and output lengths to preserve concurrency headroom.

"TTFT and TPS improve together"

Symptom: The system optimizes first-token latency aggressively, but active streams become choppy.
Cause: Large prefills and smooth decode compete for the same GPU time, so improving one can hurt the other.
Fix: Measure TTFT and decode metrics separately. Then tune scheduling, such as chunked prefill, around the actual SLO you need to protect.

"Using query heads for KV-cache sizing"

Symptom: Capacity plans are off by a large factor and the service either looks impossibly expensive or crashes under load.
Cause: The formula used num_attention_heads instead of num_key_value_heads on a GQA model.
Fix: Check the model config directly. KV memory uses the stored K/V head count, not the full query-head count.

"Using total layer count on hybrid models"

Symptom: KV-cache estimates for Qwen3.6-27B are about 4× too high and concurrency plans look unrealistically tight.
Cause: The model has 64 total layers, but only 16 Gated Attention layers store standard growing K/V tensors.
Fix: Count only layers that actually materialize sequence-length KV cache in your serving runtime, not every block in the stack.

Bringing it together

Put the request path together. When a request arrives, the engine executes a usually compute-heavy prefill pass over the prompt. That prefill drives model-side TTFT. Then the system falls into an often memory-bandwidth-bound low-batch decode loop that generates one token at a time and drives streamed TPS. The KV cache bridges those phases, and it grows with every token, so remaining memory capacity often limits concurrency before raw compute does.

If you can explain why a long prompt hurts TTFT more than TPS, why a low-batch bandwidth-bound decode path needs more effective HBM bandwidth rather than more TFLOPS, and how to estimate whether 100 concurrent users fit on your GPU cluster, you're already ahead of most candidates in an AI infrastructure interview.

Next Step

Continue to Multi-Query & Grouped-Query Attention

The KV cache analysis you just did explains why reducing the number of key/value heads saves so much memory at scale. The next article covers <span data-glossary="multi-query-attention">MQA</span> and GQA, techniques that cut memory usage while preserving model quality.

PreviousCapstone: Production Agent

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Qwen3.6-27B

Qwen Team · 2026

Roofline: An Insightful Visual Performance Model for Multicore Architectures

Williams, S., Waterman, A., & Patterson, D. · 2009

H100 GPU

NVIDIA · 2026

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. · 2022 · NeurIPS 2022

Efficient Memory Management for Large Language Model Serving with PagedAttention.

Kwon, W., et al. · 2023 · SOSP 2023

Metrics

vLLM · 2026

Orca: A Distributed Serving System for Transformer-Based Generative Models.

Yu, G.-I., et al. · 2022 · OSDI 2022

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.

Ainslie, J., et al. · 2023 · EMNLP 2023

Optimization and Tuning.

vLLM · 2026

Sarathi-Serve: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.

Agrawal, A., et al. · 2023 · arXiv preprint

Splitwise: Efficient Generative LLM Inference Using Phase Splitting.

Patel, P., et al. · 2023

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving.

Zhong, Y., et al. · 2024 · OSDI 2024

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving.

Qin, Y., et al. · 2024

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Hooper, C., Kim, S., Gholami, A., et al. · 2024 · arXiv preprint

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Liu, Z., Chen, B., Hu, X., et al. · 2024 · arXiv preprint

Quantized KV Cache

vLLM Team · 2026 · vLLM Documentation

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Back to Topics

LearnInference & Production ScaleInference: TTFT, TPS & KV Cache

🚀HardInference Optimization

Inference: TTFT, TPS & KV Cache

Understand the two-phase inference process (prefill vs decode), derive the KV cache memory formula, and learn production optimizations like chunked prefill and prefill/decode disaggregation.

32 min read

Learning path

Step 127 of 158 in the full curriculum

Capstone: Production Agent Multi-Query & Grouped-Query Attention

Prefill versus decode: Autoregressive LLM generation has two distinct phases: prefill and decode. They often emphasize different hardware bottlenecks. Knowing which phase is the bottleneck, and why, is the foundation for major optimizations in this space, from Key-Value (KV) cache management to continuous batching.

The two phases of LLM inference

Every autoregressive text-generation request has two computational phases. Their dominant bottlenecks depend on workload shape, but a useful baseline is:

Mental model (request trace): Prefill reads the whole prompt in one wide pass and writes the first cache state. It's intense upfront work but highly parallelizable. Decode then runs a narrow loop: one new token enters, cached prefix state is read, one next-token distribution comes out, and the cache grows by one step.

This diagram traces the sequential dependency between the highly parallel prefill phase and the autoregressive decode phase:

Phase 1: Prefill (processing the prompt)

text

Input: "Explain why decode slows as context grows"
→ Tokenize prompt
→ Process all prompt tokens in one forward pass
→ Produce KV cache entries for the prompt
→ Produce logits (unnormalized probability scores) for the FIRST output token

Use Case	Primary pressure	What to measure
Real-time voice	Turn-taking delay	End-to-end TTFT and audio pipeline overhead
Code completion	Interruption to typing	Tail TTFT for short prompts
Chat/conversational	Visible waiting	TTFT plus streamed ITL
Batch processing	Job completion	Throughput and cost before TTFT

Phase 2: Decode (generating output tokens)

text

Step 1: Output so far: "Decode" → add to KV cache → forward pass → next token: "reads"
Step 2: Output so far: "Decode reads" → add to KV cache → forward pass → next token: "cached"
Step 3: Output so far: "Decode reads cached" → add to KV cache → forward pass → next token: "state"
...

This small timeline separates first-token latency from decode cadence. It's intentionally a measurement exercise, not a model benchmark.

measure-prefill-and-decode.py

arrival_ms = 0
token_times_ms = [320, 355, 392, 428]

ttft_ms = token_times_ms[0] - arrival_ms
itls_ms = [
    current - previous
    for previous, current in zip(token_times_ms, token_times_ms[1:])
]
mean_itl_ms = sum(itls_ms) / len(itls_ms)
tps = 1000 / mean_itl_ms

print("TTFT:", ttft_ms, "ms")
print("decode ITLs:", itls_ms, "ms")
print(f"decode TPS: {tps:.1f}")

Output

TTFT: 320 ms
decode ITLs: [35, 37, 36] ms
decode TPS: 27.8

The arithmetic intensity explanation

Phase	Tokens Processed	Effective Batch	Arithmetic Intensity	Bottleneck
Prefill	$N$ (prompt)	Large (full prompt)	High (many FLOPs/byte)	Often compute (TFLOPS)
Decode	1 at a time	1	Low (few FLOPs/byte)	Often memory bandwidth (TB/s)

Key performance metrics

TTFT (time to first token)

TPS (tokens per second)

ITL (inter-token latency)

TPOT (time per output token)

The four metrics are summarized below:

Metric	What it measures	Phase	What drives it	Useful aggregation
TTFT	Time until first output token appears	Prefill path	Prompt length, model size, queueing	Median and tail latency
TPS	Speed of token generation after the first	Decode	Memory bandwidth, batching	Per-request and aggregate rate
ITL	Time between consecutive tokens	Decode	Scheduling and contention	Distribution of token gaps
TPOT	Average time per output token after first	Decode	Scheduling, batching, contention	Request or benchmark mean

An alert can route to the relevant investigation path without pretending to diagnose root cause by itself:

route-latency-investigation.py

def investigate(ttft_p95_ms: int, itl_p95_ms: int) -> str:
    if ttft_p95_ms > 900 and itl_p95_ms <= 80:
        return "inspect queueing and prefill"
    if itl_p95_ms > 120:
        return "inspect decode scheduling and memory pressure"
    return "within example thresholds"

print("long initial pause:", investigate(ttft_p95_ms=1100, itl_p95_ms=60))
print("choppy stream:", investigate(ttft_p95_ms=350, itl_p95_ms=160))

Output

long initial pause: inspect queueing and prefill
choppy stream: inspect decode scheduling and memory pressure

Back-of-the-envelope: a bandwidth upper bound

For a single low-batch decode stream, TPS is often constrained by memory bandwidth. To generate one token, the GPU roughly has to stream the model weights from HBM. A rough estimate:

$\text{Max TPS} \approx \frac{\text{HBM Bandwidth}}{\text{Model size (bytes)}}$

$\text{Ideal weight-read bound} \approx \frac{3{,}350 \text{ GB/s}}{54 \text{ GB}} \approx 62 \text{ tokens/sec}$

estimate-bandwidth-bound.py

def ideal_weight_stream_tps(
    model_gb: float, bandwidth_gb_per_s: float, tensor_parallel_gpus: int
) -> float:
    aggregate_bandwidth = bandwidth_gb_per_s * tensor_parallel_gpus
    return aggregate_bandwidth / model_gb

model_gb = 54
h100_capacity_gb = 80
tensor_parallel_gpus = 1

print("fits on one H100-80GB:", model_gb <= h100_capacity_gb)
bound = ideal_weight_stream_tps(model_gb, 3350, tensor_parallel_gpus)
print(f"single-GPU ideal weight-read bound: {bound:.1f} tokens/s")

Output

fits on one H100-80GB: True
single-GPU ideal weight-read bound: 62.0 tokens/s

Research note: For low-batch decode, quantization can move the weight-streaming bound by reducing bytes read per token. PagedAttention^{[5]Reference 5Efficient Memory Management for Large Language Model Serving with PagedAttention.https://arxiv.org/abs/2309.06180} improves KV-cache packing and sharing, while continuous batching^{[7]Reference 7Orca: A Distributed Serving System for Transformer-Based Generative Models.https://www.usenix.org/conference/osdi22/presentation/yu} improves utilization across requests. These techniques address related serving constraints, but they aren't interchangeable.

The KV cache: the dynamic capacity bottleneck

After weights and runtime buffers are resident, the limit on how many active sequences a deployment can admit is often remaining memory capacity, specifically the memory required for their KV cache.

What is the KV cache?

Mental model (attention state): The KV cache is the reusable attention state for tokens already seen. For each processed token, the model stores key routing vectors (K) and value content vectors (V). When the next token attends to earlier context, the model reads those tensors instead of re-deriving the whole prefix from scratch. The cache grows with each token, and its size determines how many concurrent sequences fit in GPU memory.

text

Token 1: Compute K₁, V₁ → Store in cache
Token 2: Compute K₂, V₂ → Store; Attend to [K₁,K₂], [V₁,V₂]
Token 3: Compute K₃, V₃ → Store; Attend to [K₁,K₂,K₃], [V₁,V₂,V₃]
...

KV cache memory formula

For a single sequence:

$\text{KV Cache} = 2 \times L \times n_{kv} \times d_h \times s \times b$

Where:

$L$ = number of layers
$n_{kv}$ = number of KV heads (reduced with GQA/MQA (Grouped-Query/Multi-Query Attention)^{[8]Reference 8GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.https://arxiv.org/abs/2305.13245})
$d_h$ = head dimension
$s$ = sequence length
$b$ = bytes per element (2 for FP16, 1 for an 8-bit cache such as INT8 or FP8)

Concrete example: Qwen3.6-27B-style GQA sizing

Parameter	Value
Full-attention layers ( $L$ )	16 (64 total; 48 linear-attention layers use a different state)
KV heads ( $n_{kv}$ )	4 (GQA, not 24 query heads!)
Head dim ( $d_h$ )	256
Sequence length ( $s$ )	4,096
Dtype	FP16 (2 bytes)

\begin{aligned} \text{KV Cache} &= 2 \times 16 \times 4 \times 256 \times 4096 \times 2 \\ &\approx \textbf{0.27 GB} \; (\approx \textbf{0.25 GiB}) \textbf{ per sequence} \end{aligned}

calculate-kv-cache-footprint.py

def kv_cache_bytes(
    layers: int, kv_heads: int, head_dim: int, tokens: int, bytes_per_value: int
) -> int:
    return 2 * layers * kv_heads * head_dim * tokens * bytes_per_value

gqa_bytes = kv_cache_bytes(16, 4, 256, 4096, 2)
mha_bytes = kv_cache_bytes(16, 24, 256, 4096, 2)

print(f"GQA cache: {gqa_bytes / 1e9:.2f} GB")
print(f"MHA cache: {mha_bytes / 1e9:.2f} GB")
print("MHA / GQA:", mha_bytes // gqa_bytes)

Output

GQA cache: 0.27 GB
MHA cache: 1.61 GB
MHA / GQA: 6

Production note: Serving Qwen3.6-27B to many concurrent users still requires precise KV-cache budgeting. The BF16 weights are about 54 GB, so one H100-80GB can hold raw weights, but that leaves limited room for KV cache, runtime buffers, prefix cache, and allocator overhead. Long native context support is useful only if the serving policy reserves enough memory for the active prompt-plus-output tokens.^{[1]Reference 1Qwen3.6-27Bhttps://huggingface.co/Qwen/Qwen3.6-27B}

Try it yourself: A colleague says you can serve a 7B model (32 layers, 8 KV heads, 128 head dimension, FP16) to 200 concurrent users on a single 80 GB GPU. The model weights take about 14 GB. Use the formula to see why raw weights plus KV memory are insufficient for an admission decision.

check-capacity-with-runtime-headroom.py

def kv_gb_per_sequence(tokens: int) -> float:
    values = 2 * 32 * 8 * 128 * tokens
    return values * 2 / 1e9

users = 200
raw_total_gb = 14 + users * kv_gb_per_sequence(tokens=2048)
print(f"raw weights plus KV: {raw_total_gb:.2f} GB")
for reserve_gb in (8, 16):
    admitted = raw_total_gb + reserve_gb <= 80
    print(f"with {reserve_gb} GB runtime reserve: {admitted}")

Output

raw weights plus KV: 67.69 GB
with 8 GB runtime reserve: True
with 16 GB runtime reserve: False

Dynamic token budgeting

dynamic-token-budgeting.py

def max_context_for_budget(
    gpu_memory_gb: float,
    model_memory_gb: float,
    runtime_reserve_gb: float,
    num_layers: int,
    num_kv_heads: int,
    head_dim: int,
    dtype_bytes: int = 2,  # FP16
    num_concurrent: int = 1,
) -> int:
    """Quick planning estimate using decimal GB for consistency with GPU datasheets."""
    available_memory = (gpu_memory_gb - model_memory_gb - runtime_reserve_gb) * 1e9

    # Memory per token in KV cache
    bytes_per_token = 2 * num_layers * num_kv_heads * head_dim * dtype_bytes

    # Divide by concurrent users
    budget_per_user = available_memory / num_concurrent

    return int(budget_per_user / bytes_per_token)

# Example: Qwen3.6-27B-style dimensions on one H100-80GB
max_tokens = max_context_for_budget(
    gpu_memory_gb=80,
    model_memory_gb=54,  # BF16 dense weights
    runtime_reserve_gb=8,  # engine buffers, workspaces, and allocator margin
    num_layers=16,  # Qwen3.6-27B: 16 full-attention layers
    num_kv_heads=4,  # GQA: 4 KV heads (not 24 query heads!)
    head_dim=256,
    num_concurrent=50,
)
print(f"max context per user: {max_tokens:,} tokens")

Output

max context per user: 5,493 tokens

Common mistake: Using the query head count instead of the KV head count. Qwen3.6-27B lists 24 query heads but 4 KV heads for its full-attention path.^{[1]Reference 1Qwen3.6-27Bhttps://huggingface.co/Qwen/Qwen3.6-27B} If you plug 24 into the formula, you get a 6× memory overestimate and a pessimistic concurrency plan. Check the model config for num_key_value_heads, not num_attention_heads alone.

An online admission check can reserve KV capacity using each request's prompt plus output token budget rather than admitting every request at the maximum architectural context:

admit-request-by-token-budget.py

BYTES_PER_TOKEN = 2 * 16 * 4 * 256 * 2

def kv_gb(tokens: int) -> float:
    return tokens * BYTES_PER_TOKEN / 1e9

def admit(existing_tokens: list[int], new_tokens: int, kv_budget_gb: float) -> bool:
    needed = sum(kv_gb(tokens) for tokens in existing_tokens) + kv_gb(new_tokens)
    return needed <= kv_budget_gb

active = [4096] * 40
print("admit 16K request:", admit(active, 16_384, kv_budget_gb=12))
print("admit 64K request:", admit(active, 65_536, kv_budget_gb=12))

Output

admit 16K request: True
admit 64K request: False

Production optimizations

Chunked prefill

On shared serving hardware, large prefills can delay decode operations, creating a TTFT-TPS tradeoff: prioritizing new prefills can stall existing decode streams.

Chunked prefill splits long prompts into smaller chunks, interleaving them with decode steps:

Analogy (factory assembly line): Without chunked prefill, it's like shutting down the entire factory assembly line to set up for a new product. All existing products stop moving while you retool. With chunked prefill, you retool one station at a time while the rest of the line keeps running. Existing requests keep flowing (decode continues) while the new request is gradually set up (prefilled in chunks).

text

Without chunked prefill:
  [Prefill 10K tokens ===========================] [Decode...Decode...Decode...]
  ↑ All decode requests stall during this prefill

With chunked prefill (chunk=2048):
  [Prefill chunk1][Decode][Prefill chunk2][Decode][Prefill chunk3][Decode]...
  ↑ Decode requests continue between chunks

Benefits

interleave-prefill-with-active-decode.py

def chunked_schedule(prompt_tokens: int, chunk_tokens: int) -> list[str]:
    actions: list[str] = []
    remaining = prompt_tokens
    while remaining:
        processed = min(chunk_tokens, remaining)
        actions.append(f"prefill {processed}")
        remaining -= processed
        actions.append("decode active streams")
    return actions

for action in chunked_schedule(prompt_tokens=6144, chunk_tokens=2048):
    print(action)

Output

prefill 2048
decode active streams
prefill 2048
decode active streams
prefill 2048
decode active streams

Prefill-decode disaggregation

Architectural benefits

Less cross-phase interference: prefill bursts are less likely to stall decode
Independent scaling: add prefill GPUs for prompt-heavy workloads, decode GPUs for concurrent users
Hardware matching: use compute-optimized GPUs for prefill, high-bandwidth GPUs for decode

compare-disaggregation-overhead.py

def choose_layout(shared_phase_interference_ms: int, kv_transfer_ms: int) -> str:
    if kv_transfer_ms < shared_phase_interference_ms:
        return "separate prefill and decode pools"
    return "keep phases colocated"

print("bursty workload:", choose_layout(shared_phase_interference_ms=95, kv_transfer_ms=20))
print("small prompts:", choose_layout(shared_phase_interference_ms=8, kv_transfer_ms=20))

Output

bursty workload: separate prefill and decode pools
small prompts: keep phases colocated

KV cache quantization

$\text{KV Cache (8-bit)} = \frac{\text{KV Cache (FP16/BF16)}}{2}$

The snippet below reuses the Qwen3.6-27B-style GQA dimensions from earlier and assumes each active user reserves 5,120 prompt-plus-output tokens of FP16 KV state:

estimate-kv-quantization-gain.py

def users_from_kv_budget(kv_budget_gb: float, gb_per_user: float) -> int:
    return int(kv_budget_gb / gb_per_user)

tokens_per_user = 5120
fp16_gb_per_user = 2 * 16 * 4 * 256 * tokens_per_user * 2 / 1e9
fp8_gb_per_user = fp16_gb_per_user / 2
kv_budget_gb = 80

print("FP16 users from KV budget:", users_from_kv_budget(kv_budget_gb, fp16_gb_per_user))
print("FP8 users from KV budget:", users_from_kv_budget(kv_budget_gb, fp8_gb_per_user))
print("runtime headroom still required:", True)

Output

FP16 users from KV budget: 238
FP8 users from KV budget: 476
runtime headroom still required: True

Mastery check

Key concepts

Prefill vs decode as two separate inference phases
TTFT, TPS, inter-token latency (ITL), and time per output token (TPOT)
Arithmetic intensity and why the bottleneck often flips after token 1
KV-cache memory formula and why GQA changes the head count
Context length as a concurrency budget, not a model-card limit alone
Chunked prefill, prefill-decode disaggregation, and KV-cache quantization

Evaluation rubric

Foundational: Explains why TTFT ends at token 1 and why decode remains sequential after that.
Intermediate: Diagnoses whether a user complaint points to prefill latency or decode pacing.
Intermediate: Derives the KV-cache formula and uses num_key_value_heads rather than the full query-head count.
Advanced: Explains why prefill is usually compute-bound while single-stream decode is usually memory-bandwidth-bound.
Advanced: Estimates whether a deployment fits by combining model weights, KV cache, and runtime headroom.
Advanced: Chooses among chunked prefill, disaggregation, or KV-cache quantization based on the real bottleneck.

Follow-up questions

Common pitfalls

"LLMs generate tokens one at a time"

Symptom: You describe the whole request as sequential and then can't explain the large pause before token 1.
Cause: Decode is sequential, but prefill processes the full prompt in parallel before streaming begins.
Fix: Split the request into two phases every time you reason about latency: prefill for first-token delay, decode for streaming cadence.

"More GPUs always means faster generation"

Symptom: A team adds more GPUs and expects single-request TPS to scale linearly.
Cause: Single-stream decode is often limited by memory bandwidth and communication overhead, not raw compute alone.
Fix: Ask what bottleneck you're relieving. If decode is bandwidth-bound, look at HBM bandwidth, scheduling, quantization, or a different parallelism strategy before adding more devices.

"A bandwidth estimate ignores fit and headroom"

Symptom: A bandwidth calculation divides one H100's bandwidth by a model's weight footprint and stops there.
Cause: The estimate ignores capacity headroom: weights may fit, but KV cache, runtime buffers, prefix cache, and fragmentation still share the device.
Fix: Choose a feasible memory plan first, then estimate bandwidth and account for cache traffic, interconnect overhead, and measured kernel performance.

"Context length is only a model limitation"

Symptom: Product plans assume every user can use the model's full architectural context without affecting concurrency.
Cause: The architectural limit and the production memory budget were treated as the same thing.
Fix: Turn context policy into a capacity calculation. Budget KV state per user, then cap prompt and output lengths to preserve concurrency headroom.

"TTFT and TPS improve together"

Symptom: The system optimizes first-token latency aggressively, but active streams become choppy.
Cause: Large prefills and smooth decode compete for the same GPU time, so improving one can hurt the other.
Fix: Measure TTFT and decode metrics separately. Then tune scheduling, such as chunked prefill, around the actual SLO you need to protect.

"Using query heads for KV-cache sizing"

Symptom: Capacity plans are off by a large factor and the service either looks impossibly expensive or crashes under load.
Cause: The formula used num_attention_heads instead of num_key_value_heads on a GQA model.
Fix: Check the model config directly. KV memory uses the stored K/V head count, not the full query-head count.

"Using total layer count on hybrid models"

Symptom: KV-cache estimates for Qwen3.6-27B are about 4× too high and concurrency plans look unrealistically tight.
Cause: The model has 64 total layers, but only 16 Gated Attention layers store standard growing K/V tensors.
Fix: Count only layers that actually materialize sequence-length KV cache in your serving runtime, not every block in the stack.

Bringing it together

Next Step

Continue to Multi-Query & Grouped-Query Attention

PreviousCapstone: Production Agent

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Qwen3.6-27B

Qwen Team · 2026

Roofline: An Insightful Visual Performance Model for Multicore Architectures

Williams, S., Waterman, A., & Patterson, D. · 2009

H100 GPU

NVIDIA · 2026

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. · 2022 · NeurIPS 2022

Efficient Memory Management for Large Language Model Serving with PagedAttention.

Kwon, W., et al. · 2023 · SOSP 2023

Metrics

vLLM · 2026

Orca: A Distributed Serving System for Transformer-Based Generative Models.

Yu, G.-I., et al. · 2022 · OSDI 2022

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.

Ainslie, J., et al. · 2023 · EMNLP 2023

Optimization and Tuning.

vLLM · 2026

Sarathi-Serve: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.

Agrawal, A., et al. · 2023 · arXiv preprint

Splitwise: Efficient Generative LLM Inference Using Phase Splitting.

Patel, P., et al. · 2023

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving.

Zhong, Y., et al. · 2024 · OSDI 2024

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving.

Qin, Y., et al. · 2024

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Hooper, C., Kim, S., Gholami, A., et al. · 2024 · arXiv preprint

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Liu, Z., Chen, B., Hu, X., et al. · 2024 · arXiv preprint

Quantized KV Cache

vLLM Team · 2026 · vLLM Documentation

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Inference: TTFT, TPS & KV Cache

Why doesn't a chat response appear all at once after you submit a prompt?

The two phases of LLM inference

What is the cleanest mental split between prefill and decode?

Phase 1: Prefill (processing the prompt)

Phase 2: Decode (generating output tokens)

The arithmetic intensity explanation

Why is prefill usually compute-bound while single-stream decode is usually memory-bandwidth-bound?

Key performance metrics

TTFT (time to first token)

TPS (tokens per second)

ITL (inter-token latency)

TPOT (time per output token)

If users complain that nothing appears for a long time, which metric do you inspect first? What if streaming looks choppy after it starts?

Back-of-the-envelope: a bandwidth upper bound

Why does the rough bandwidth estimate divide HBM bandwidth by model size?

The KV cache: the dynamic capacity bottleneck

What is the KV cache?

What does the KV cache save, and what does it cost?

KV cache memory formula

Concrete example: Qwen3.6-27B-style GQA sizing

Why must the KV-cache formula use num_key_value_heads instead of num_attention_heads for GQA models?

Dynamic token budgeting

Why is maximum context length a production policy, not a model-card number alone?

Production optimizations

Chunked prefill

Benefits

Why does chunked prefill improve tail latency for existing decode streams?

Prefill-decode disaggregation

Architectural benefits

When is prefill-decode disaggregation worth its extra KV-transfer cost?

KV cache quantization

Why can KV-cache quantization improve concurrency even if model weights are already quantized?

Mastery check

Key concepts

Evaluation rubric

Follow-up questions

Why is decode memory-bandwidth bound rather than compute-bound for a single active stream?

How does GQA reduce KV-cache requirements compared with full multi-head attention?

How would you estimate hardware for Qwen3.6-27B serving many concurrent users at long context?

When is prefill-decode disaggregation worth its extra KV-transfer cost?

Common pitfalls

"LLMs generate tokens one at a time"

"More GPUs always means faster generation"

"A bandwidth estimate ignores fit and headroom"

"Context length is only a model limitation"

"TTFT and TPS improve together"

"Using query heads for KV-cache sizing"

"Using total layer count on hybrid models"

A request has a 6,000-token prompt and will generate only 80 tokens. Which phase is more likely to dominate user-visible latency?

Bringing it together

Mastery Check

Discussion

Inference: TTFT, TPS & KV Cache

Why doesn't a chat response appear all at once after you submit a prompt?

The two phases of LLM inference

What is the cleanest mental split between prefill and decode?

Phase 1: Prefill (processing the prompt)

Phase 2: Decode (generating output tokens)

The arithmetic intensity explanation

Why is prefill usually compute-bound while single-stream decode is usually memory-bandwidth-bound?

Key performance metrics

TTFT (time to first token)

TPS (tokens per second)

ITL (inter-token latency)

TPOT (time per output token)

If users complain that nothing appears for a long time, which metric do you inspect first? What if streaming looks choppy after it starts?

Back-of-the-envelope: a bandwidth upper bound

Why does the rough bandwidth estimate divide HBM bandwidth by model size?

The KV cache: the dynamic capacity bottleneck

What is the KV cache?

What does the KV cache save, and what does it cost?

KV cache memory formula

Concrete example: Qwen3.6-27B-style GQA sizing

Why must the KV-cache formula use num_key_value_heads instead of num_attention_heads for GQA models?

Dynamic token budgeting

Why is maximum context length a production policy, not a model-card number alone?

Production optimizations

Chunked prefill

Benefits