LeetLLM
LearnFeaturesPricingBlog
Menu
LearnFeaturesPricingBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Pricing
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

ยฉ 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 76 articles completed

๐ŸงชAI Engineering Foundations0/11
The Bitter Lesson & ComputeTokenization: BPE & SentencePieceWord to Contextual EmbeddingsSentence Embeddings & Contrastive LossDimensionality Reduction for EmbeddingsEmbedding Similarity & QuantizationScaled Dot-Product AttentionPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNDecoding Strategies: Greedy to NucleusPerplexity & model evaluation
โšกInference Systems & Optimization0/12
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceSpeculative DecodingLong Context Window ManagementModel Quantization: GPTQ, AWQ & GGUFMixture of Experts (MoE)Mamba & State Space ModelsReasoning & Test-Time Compute
๐Ÿ”Advanced Retrieval & Enterprise Memory0/7
Chunking StrategiesVector DB Internals: HNSW & IVFHybrid Search: Dense + SparseProduction RAG PipelinesAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access Control
๐Ÿค–Agentic Architecture & Orchestration0/13
CoT, ToT & Self-Consistency PromptingStructured Output GenerationFunction Calling & Tool UseMCP & Tool Protocol StandardsReAct & Plan-and-ExecuteAgent Memory & PersistenceHuman-in-the-Loop AgentsGuardrails & Safety FiltersPrompt Injection DefenseCode Generation & SandboxingAgent Failure & RecoveryMulti-Agent OrchestrationAI Agent Evaluation and Benchmarking
๐Ÿ“ŠEvaluation & Reliability0/6
LLM Benchmarks & LimitationsLLM-as-a-Judge EvaluationA/B Testing for LLMsLLM Observability & MonitoringHallucination Detection & MitigationBias & Fairness in LLMs
๐Ÿ› ๏ธLLMOps & Production Engineering0/4
Semantic Caching & Cost OptimizationLLM Cost Engineering and Token EconomicsModel Versioning & DeploymentGPU Serving & Autoscaling
๐ŸงฌTraining, Alignment & Reasoning0/13
Scaling Laws & Compute TrainingPre-training Data at ScaleInstruction Tuning & Chat TemplatesMixed Precision TrainingDistributed Training: FSDP & ZeROPrompt Optimization with DSPyRecursive Language Models (RLM)LoRA & Parameter-Efficient TuningKnowledge DistillationModel Merging and Weight InterpolationConstitutional AI & Red TeamingRLHF & DPO AlignmentRLVR & Verifiable Rewards
๐Ÿ—๏ธSystem Design Case Studies0/10
Automated Support AgentContent Moderation SystemLLM-Powered Search EngineCode Completion SystemMulti-Tenant LLM PlatformReasoning & Test-Time ComputeReal-Time Voice AI AgentVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image Generation
Track Your Progress

Create a free account to save your reading progress across devices and unlock the full learning experience.

LeetLLM Premium
  • All question breakdowns
  • Architecture diagrams
  • Model answers & rubrics
  • Follow-up Q&A analysis
  • New content weekly
Back to Topics
LearnInference Systems & OptimizationInference: TTFT, TPS & KV Cache
๐Ÿš€MediumInference Optimization

Inference: TTFT, TPS & KV Cache

Understand the two-phase inference process (prefill vs decode), derive the KV cache memory formula, and learn production optimizations like chunked prefill and disaggregation.

30 min readOpenAI, Anthropic, Google +39 key concepts

When you send a prompt to ChatGPT and see the response appear word by word, you're watching inference: the process of a trained model generating output. But why does it stream one word at a time? Why is the first word slow and the rest faster? And why does a longer conversation make everything slower? Understanding these mechanics (the two-phase process, the memory bottleneck, and the key performance metrics) is essential for anyone building or optimizing production LLM systems.

๐Ÿ’ก Key insight: LLM inference has two distinct phases (prefill and decode) with fundamentally different hardware bottlenecks. Knowing which phase is the bottleneck, and why, is the foundation for every optimization in this space, from KV cache management to continuous batching.


The two phases of LLM inference

Every LLM request goes through two distinct computational phases with fundamentally different hardware bottlenecks:

๐Ÿฝ๏ธ Analogy, restaurant kitchen: Prefill is like a chef reading the entire recipe at once, gathering all ingredients, and prepping the mise en place. It's intense upfront work but highly parallelizable (multiple sous chefs can chop simultaneously). Decode is like plating one dish at a time in a specific order: each plate (token) must wait for the previous one, and the bottleneck is how fast you can carry ingredients from the fridge (memory bandwidth), not how fast you can cook (compute).

LLM inference two-phase pipeline: Prefill (compute-bound, processes all input tokens in parallel to build the KV cache) followed by Decode (memory-bandwidth-bound, generates tokens one at a time). A KV Cache Handoff bridge connects the two phases. Key metrics shown: TTFT, TPS, KV cache size, and HBM bandwidth bottleneck. LLM inference two-phase pipeline: Prefill (compute-bound, processes all input tokens in parallel to build the KV cache) followed by Decode (memory-bandwidth-bound, generates tokens one at a time). A KV Cache Handoff bridge connects the two phases. Key metrics shown: TTFT, TPS, KV cache size, and HBM bandwidth bottleneck.
Diagram Diagram

Phase 1: Prefill (processing the prompt)

The model processes your entire input prompt in parallel. Every token is attended to simultaneously in a single forward pass. This is compute-bound, limited by GPU FLOPS, not memory bandwidth. For example, when starting generation, the prefill phase processes the input text to produce the first token:

text
1Input: "Explain quantum computing in simple terms" 2โ†’ 6 tokens processed simultaneously 3โ†’ Produces KV cache entries for all 6 tokens 4โ†’ Produces logits (unnormalized probability scores) for the FIRST output token

The time from request to the first output token is called TTFT (Time to First Token). For a user staring at a chat interface, this is the most noticeable latency.

Use CaseTTFT TargetWhy
Real-time voice< 150msConversational flow
Code completion< 200msDeveloper productivity
Chat/conversational< 500msUser patience
Batch processing< 2sBackground job

Phase 2: Decode (generating output tokens)

After producing the first token, the model generates subsequent tokens one at a time, autoregressively. Each new token requires a forward pass through the entire model, but only processes that single new token (reusing cached K/V from all previous tokens). This step-by-step process builds the response iteratively:

text
1Step 1: "Quantum" โ†’ add to KV cache โ†’ forward pass โ†’ "computing" 2Step 2: "computing" โ†’ add to KV cache โ†’ forward pass โ†’ "is" 3Step 3: "is" โ†’ add to KV cache โ†’ forward pass โ†’ "like" 4...

This phase is memory-bandwidth bound, not compute-bound. The bottleneck is reading the model weights and KV cache from GPU HBM (High Bandwidth Memory) for each token. The matrix multiplications are small (batch=1), so the GPU's arithmetic units are mostly idle, waiting for data to arrive.

The arithmetic intensity explanation

The key difference is arithmetic intensity (FLOPs per byte loaded from memory):

PhaseTokens ProcessedMatrix SizeArithmetic IntensityBottleneck
PrefillNNN (prompt)Large batch matmulHigh (many FLOPs/byte)Compute (TFLOPS)
Decode1 at a timeThin matmul (batch=1)Low (few FLOPs/byte)Memory bandwidth (TB/s)

On an H100 GPU: ~990 TFLOPS (Tera Floating Point Operations Per Second) compute, ~3.35 TB/s HBM bandwidth. During decode, the GPU is reading ~140GB of model weights to process a single token. The computation itself takes microseconds, but loading the weights takes milliseconds. This memory-bound nature is what motivates architectural changes like FlashAttention[1] to minimize HBM access.


Key performance metrics

Latency breakdown of LLM inference: TTFT dominated by prefill compute, inter-token latency dominated by memory bandwidth for KV cache reads during decode. Latency breakdown of LLM inference: TTFT dominated by prefill compute, inter-token latency dominated by memory bandwidth for KV cache reads during decode.

TTFT (Time to First Token)

  • โ€ขMeasures prefill latency, how quickly the model starts responding
  • โ€ขDominated by prompt length and model size
  • โ€ขScales roughly linearly with prompt token count (for long prompts)
  • โ€ขCritical for: interactive applications, voice assistants, code completion

TPS (Tokens Per Second), also called "decode throughput"

  • โ€ขMeasures how fast output tokens are generated after the first
  • โ€ขSingle request: typically 30โ€“80 TPS for large models on high-end GPUs
  • โ€ขBatched inference (often using continuous batching[2]): hundreds or thousands of aggregate TPS
  • โ€ขDetermined by: memory bandwidth, not compute

ITL (Inter-Token Latency)

  • โ€ขTime between consecutive output tokens: ITL = 1/TPS for a single request
  • โ€ขUsers perceive stuttering when ITL exceeds ~100ms
  • โ€ขUnder batching, ITL increases as more requests share the GPU

TPOT (Time Per Output Token)

  • โ€ขSystem-level metric accounting for batching overhead and scheduling
  • โ€ขTPOT โ‰ฅ ITL due to scheduling delays and contention

The KV cache: the dominant bottleneck

What is the KV cache?

During attention, each layer computes Key and Value projections for every token. Without caching, generating token NNN would require recomputing attention over all Nโˆ’1N-1Nโˆ’1 previous tokens from scratch, which is quadratic in sequence length.

๐Ÿ“‹ Analogy, exam reference sheet: The KV cache is like a reference sheet you build during an exam. For each question (token) you've already answered, you write down the key facts (K) and your reasoning (V) on the sheet. When the next question references a previous one, you glance at your reference sheet instead of re-deriving everything from scratch. The sheet grows with each question, and its size determines how many questions you can handle before running out of paper (GPU memory).

The KV cache stores these K and V tensors. As the sequence grows, the KV cache accumulates data for each token to avoid redundant calculations:

text
1Token 1: Compute Kโ‚, Vโ‚ โ†’ Store in cache 2Token 2: Compute Kโ‚‚, Vโ‚‚ โ†’ Store; Attend to [Kโ‚,Kโ‚‚], [Vโ‚,Vโ‚‚] 3Token 3: Compute Kโ‚ƒ, Vโ‚ƒ โ†’ Store; Attend to [Kโ‚,Kโ‚‚,Kโ‚ƒ], [Vโ‚,Vโ‚‚,Vโ‚ƒ] 4...

KV cache memory formula

For a single sequence:

KVย Cache=2ร—Lร—nkvร—dhร—sร—b\text{KV Cache} = 2 \times L \times n_{kv} \times d_h \times s \times bKVย Cache=2ร—Lร—nkvโ€‹ร—dhโ€‹ร—sร—b

Reading the formula: for every layer (LLL), every KV head (nkvn_{kv}nkvโ€‹), every position in the sequence (sss), we store a Key vector and a Value vector (the "2") of dimension dhd_hdhโ€‹, each taking bbb bytes. Multiply it all together and this cache can easily reach gigabytes for long sequences.

Where:

  • โ€ขLLL = number of layers
  • โ€ขnkvn_{kv}nkvโ€‹ = number of KV heads (reduced with GQA/MQA (Grouped-Query/Multi-Query Attention)[3])
  • โ€ขdhd_hdhโ€‹ = head dimension
  • โ€ขsss = sequence length
  • โ€ขbbb = bytes per element (2 for FP16, 1 for INT8)

Concrete Example: Llama 3.1 70B

ParameterValue
Layers (LLL)80
KV heads (nkvn_{kv}nkvโ€‹)8 (GQA, not 64 query heads!)
Head dim (dhd_hdhโ€‹)128
Sequence length (sss)4,096
DtypeFP16 (2 bytes)

KVย Cache=2ร—80ร—8ร—128ร—4096ร—2=1.34ย GBย perย sequence\text{KV Cache} = 2 \times 80 \times 8 \times 128 \times 4096 \times 2 = \textbf{1.34 GB per sequence}KVย Cache=2ร—80ร—8ร—128ร—4096ร—2=1.34ย GBย perย sequence

Reading the formula: 222 is for K and V, 808080 is the number of layers, 888 is KV heads, 128128128 is head dimension, 409640964096 is sequence length, and the final 222 is FP16 bytes per value.

With GQA (8 KV heads instead of 64), this is 8ร— smaller than it would be with MHA. Without GQA, the same model would need 10.7 GB per sequence.

๐ŸŽฏ Production tip: Serving a 70B model to 100 concurrent users requires precise KV cache budgeting: Model weights (140 GB FP16) + KV cache (1.34 GB ร— 100 users = 134 GB) = 274 GB total. Feasible on 4ร— H100-80GB with tensor parallelism (splitting the model across multiple GPUs).


Dynamic token budgeting

In production environments, context length is not a static limit determined solely by the model's architecture. Instead, it is a dynamic memory budget that dictates how many concurrent users your system can support. Every additional token of context required by one user reduces the available VRAM for everyone else.

To serve models at scale, inference engines must strictly enforce these budgets. When a request comes in, the system checks the available GPU memory. If the required KV cache for the new request (plus existing ones) exceeds the remaining capacity, the request must wait in a queue. This is why managing the TTFT-TPS tradeoff is so important for overall throughput.

You can dynamically calculate the maximum affordable context length by subtracting the model weights' footprint from total GPU memory, then dividing the remainder by the number of concurrent users and the per-token KV cache size:

python
1def max_context_for_budget( 2 gpu_memory_gb: float, 3 model_memory_gb: float, 4 num_layers: int, 5 num_kv_heads: int, 6 head_dim: int, 7 dtype_bytes: int = 2, # FP16 8 num_concurrent: int = 1, 9) -> int: 10 """Calculate maximum context length given memory constraints.""" 11 available_memory = (gpu_memory_gb - model_memory_gb) * 1e9 12 13 # Memory per token in KV cache 14 bytes_per_token = 2 * num_layers * num_kv_heads * head_dim * dtype_bytes 15 16 # Divide by concurrent users 17 budget_per_user = available_memory / num_concurrent 18 19 return int(budget_per_user / bytes_per_token) 20 21# Example: Llama 3.1 70B on 4ร—H100-80GB (320GB total) 22max_tokens = max_context_for_budget( 23 gpu_memory_gb=320, 24 model_memory_gb=140, # FP16 weights 25 num_layers=80, 26 num_kv_heads=8, # GQA: 8 KV heads (not 64!) 27 head_dim=128, 28 num_concurrent=50, 29) 30# Result: ~11,000 tokens per user with 50 concurrent users

This calculation drives critical deployment decisions. If you need to support 100 concurrent users but only have the budget for 5,000 tokens each, you might need to add another GPU node, reduce the model precision using quantization, or implement stricter context window limits at the application layer.


Production optimizations

Chunked prefill

Large prefills block decode operations, causing TTFT-TPS tradeoff: prioritizing new prefills stalls existing decode streams.

Chunked prefill splits long prompts into smaller chunks, interleaving them with decode steps:

๐Ÿญ Analogy, factory assembly line: Without chunked prefill, it's like shutting down the entire factory assembly line to set up for a new product. All existing products stop moving while you retool. With chunked prefill, you retool one station at a time while the rest of the line keeps running. Existing requests keep flowing (decode continues) while the new request is gradually set up (prefilled in chunks).

The timeline below illustrates how chunked prefill avoids stalling decodes:

text
1Without chunked prefill: 2 [Prefill 10K tokens ===========================] [Decode...Decode...Decode...] 3 โ†‘ All decode requests stall during this prefill 4 5With chunked prefill (chunk=2048): 6 [Prefill chunk1][Decode][Prefill chunk2][Decode][Prefill chunk3][Decode]... 7 โ†‘ Decode requests continue between chunks

Benefits: Better GPU utilization (mixing compute-bound prefill with memory-bound decode), smoother token streaming, reduced tail latency. Enabled by default in vLLM V1 and discussed in the context of efficient memory management systems like PagedAttention[4].

Prefill-decode disaggregation

Modern systems (Splitwise[5], DistServe[6], Mooncake[7]) separate prefill and decode onto different GPU pools:

Diagram Diagram

Why this works:

  • โ€ขNo cross-phase interference: prefill never stalls decode
  • โ€ขIndependent scaling: add prefill GPUs for prompt-heavy workloads, decode GPUs for concurrent users
  • โ€ขHardware matching: use compute-optimized GPUs for prefill, high-bandwidth GPUs for decode

KV cache quantization

Store KV cache in INT8 or even INT4 instead of FP16 to halve/quarter memory with minimal quality loss (see our model quantization deep-dive for the techniques behind weight and activation quantization):

KVย Cacheย (INT8)=KVย Cacheย (FP16)2\text{KV Cache (INT8)} = \frac{\text{KV Cache (FP16)}}{2}KVย Cacheย (INT8)=2KVย Cacheย (FP16)โ€‹

Reading the formula: INT8 uses 1 byte per value instead of 2 bytes (FP16), so KV memory is cut in half for the same sequence length and concurrency.

This doubles the number of concurrent users you can serve on the same hardware.


Common misconceptions

โŒ "LLMs generate tokens one at a time" Only true for the decode phase. Prefill processes the entire prompt in parallel.

โŒ "More GPUs = faster generation" For a single request, model parallelism adds communication latency. Faster generation requires higher memory bandwidth, not more compute.

โŒ "Context length is just a model limitation" In production, context length is a memory budget decision. Shorter contexts = more concurrent users.

โŒ "TTFT and TPS improve together" They're often in tension. Optimizing TTFT (prioritizing prefills) hurts TPS (stalling decodes), and vice versa.


Key takeaways

  1. โ€ขTwo-phase inference: Prefill (compute-bound, parallel) โ†’ Decode (memory-bandwidth-bound, sequential)
  2. โ€ขTTFT measures prefill speed; TPS measures decode throughput. They're independent and often in tension.
  3. โ€ขKV Cache is the primary memory bottleneck for concurrent serving; derive the formula from model architecture
  4. โ€ขGQA/MQA reduces KV cache proportionally to head count reduction (8ร— for Llama 3.1 70B)
  5. โ€ขChunked prefill interleaves prompt processing with decode to avoid stalling
  6. โ€ขDisaggregation separates prefill and decode onto different GPU pools for independent scaling
  7. โ€ขDynamic token budgeting calculates max context from GPU memory, model size, and concurrency

Evaluation Rubric
  • 1
    Explains prefill vs decode with correct bottleneck identification
  • 2
    Derives KV cache memory formula from architecture parameters
  • 3
    Accounts for GQA in KV cache calculation (not full query heads)
  • 4
    Discusses arithmetic intensity to explain compute vs memory-BW bound
  • 5
    Explains chunked prefill and why it helps
  • 6
    Explains prefill-decode disaggregation and when to use it
  • 7
    Calculates max concurrent users from GPU memory budget
Common Pitfalls
  • Saying LLMs generate tokens one at a time (prefill is parallel)
  • Thinking more GPUs always means faster single-request generation
  • Using full query head count instead of KV head count for cache sizing
  • Not knowing that TTFT and TPS are often in tension
  • Treating context length as only a model limitation, not a memory budget
Follow-up Questions to Expect

Key Concepts Tested
Two-phase inference: prefill (compute-bound) vs decode (memory-BW-bound)TTFT, TPS, ITL, TPOT: what each measures and what drives itKV cache memory formula derivation from model architectureGQA/MQA effect on KV cache sizeArithmetic intensity: why prefill is compute-bound and decode is memory-boundChunked prefill for smoothing TTFT-TPS tradeoffPrefill-decode disaggregation onto separate GPU poolsDynamic token budgeting for concurrent servingKV cache quantization (INT8/INT4) to double concurrency
References

Orca: A Distributed Serving System for Transformer-Based Generative Models.

Yu, G.-I., et al. ยท 2022 ยท OSDI 2022

Splitwise: Efficient Generative LLM Inference Using Phase Splitting.

Patel, P., et al. ยท 2023

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving.

Zhong, Y., et al. ยท 2024 ยท OSDI 2024

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving.

Qin, Y., et al. ยท 2024

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Rรฉ, C. ยท 2022 ยท NeurIPS 2022

Efficient Memory Management for Large Language Model Serving with PagedAttention.

Kwon, W., et al. ยท 2023 ยท SOSP 2023

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.

Ainslie, J., et al. ยท 2023 ยท EMNLP 2023

Your account is free and you can post anonymously if you choose.