LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnInference & Production ScaleInference: TTFT, TPS & KV Cache
🚀HardInference Optimization

Inference: TTFT, TPS & KV Cache

Understand the two-phase inference process (prefill vs decode), derive the KV cache memory formula, and learn production optimizations like chunked prefill and prefill/decode disaggregation.

31 min read
Learning path
Step 123 of 155 in the full curriculum
Multi-Agent OrchestrationMulti-Query & Grouped-Query Attention

Inference: TTFT, TPS & KV Cache

The previous chapter showed how multi-agent systems turn one product action into planning, retrieval, tool calls, review, and several model calls. This chapter zooms into one of those calls. Once a request reaches an inference engine, three terms matter immediately: time to first token (TTFT), tokens per second (TPS), and the key-value (KV) cache that grows with the conversation.

When you send a message to ChatGPT, you notice something peculiar about how the response appears. There's often a brief pause, then the first visible text appears, followed by the rest streaming out in small chunks. Why that initial pause? Why does it stream token by token instead of appearing instantly? And why do longer conversations eventually feel slower?

This behavior isn't a quirk of the interface. It's the physics of LLM inference: the process of running a trained model to generate text. Understanding these mechanics (two-phase nature of generation, memory bottlenecks, and how engineers measure performance) is important for anyone building or optimizing production AI systems. Longer conversations often feel slower for a concrete reason: each decode step has to read a growing KV cache on top of the model weights.

Key insight: LLM inference has two distinct phases (prefill and decode) that usually emphasize different hardware bottlenecks. Knowing which phase is the bottleneck, and why, is the foundation for major optimizations in this space, from Key-Value (KV) cache management to continuous batching.


The two phases of LLM inference

Every LLM request goes through two distinct computational phases with fundamentally different hardware bottlenecks:

Analogy (fulfillment desk): Prefill is like reading the entire order history, return policy, and carrier notes at once before the first reply. It's intense upfront work but highly parallelizable. Decode is like scanning one outbound box at a time in strict order: each new token waits for the previous token, and the bottleneck is how fast the system can fetch stored state from memory.

Two-panel inference pipeline showing prefill processing the full prompt in parallel, then decode generating one token at a time while reusing the KV cache. Two-panel inference pipeline showing prefill processing the full prompt in parallel, then decode generating one token at a time while reusing the KV cache.
Inference flips regimes after token 1: prefill is parallel and compute-heavy, while decode becomes a serial loop that rereads weights and cached state.

The diagram below illustrates the sequential dependency between the highly parallel prefill phase and the autoregressive decode phase:

Diagram showing Phase 1: Prefill, Phase 2: Decode, Process entire prompt in parallel, and Write prompt K/V into KV cache. Diagram showing Phase 1: Prefill, Phase 2: Decode, Process entire prompt in parallel, and Write prompt K/V into KV cache.
Phase 1: Prefill, Phase 2: Decode, Process entire prompt in parallel, and Write prompt K/V into KV cache.

Phase 1: Prefill (processing the prompt)

In an unchunked baseline, the model processes the input prompt in parallel during prefill. Prompt tokens participate in large matrix operations, so the GPU can run dense matrix multiplies efficiently. For long prompts on modern accelerators, this phase is usually compute-bound, limited more by available FLOPs than by memory bandwidth. Later in this lesson, chunked prefill deliberately divides that work for scheduling reasons. A toy trace looks like this:

text
1Input: "Explain why order 102 is delayed" 2→ Tokenize prompt 3→ Process all prompt tokens in one forward pass 4→ Produce KV cache entries for the prompt 5→ Produce logits (unnormalized probability scores) for the FIRST output token

The time from request to the first output token is called TTFT (Time to First Token). For a user staring at a chat interface, this is the most noticeable latency. Do not copy one generic latency target into every product; establish a service-level objective (SLO) from the interaction mode and measured user tolerance.

Use CasePrimary pressureWhat to measure
Real-time voiceTurn-taking delayEnd-to-end TTFT and audio pipeline overhead
Code completionInterruption to typingTail TTFT for short prompts
Chat/conversationalVisible waitingTTFT plus streamed ITL
Batch processingJob completionThroughput and cost before TTFT

Phase 2: Decode (generating output tokens)

After producing the first token, the model generates subsequent tokens one at a time, autoregressively. Each new token requires a forward pass through the entire model, but only processes that single new token (reusing cached K/V from all previous tokens). The new token is appended to the running sequence, added to the KV cache, and fed back into the model to predict the next token. The trace below shows how the response grows:

text
1Step 1: Output so far: "Order" → add to KV cache → forward pass → next token: "102" 2Step 2: Output so far: "Order 102" → add to KV cache → forward pass → next token: "is" 3Step 3: Output so far: "Order 102 is" → add to KV cache → forward pass → next token: "delayed" 4...

For a single decode stream, this phase is usually memory-bandwidth bound, not compute-bound. The bottleneck is reading the model weights and KV cache from GPU HBM (High Bandwidth Memory) for each token. The matrix multiplications are thin relative to the amount of data that must be moved, so the GPU's arithmetic units often spend more time waiting for bytes than doing math. As the response grows, the attention kernel also has to read a larger cached prefix, so per-token latency tends to rise with sequence length even when the model weights stay fixed.

The following small timeline separates first-token latency from decode cadence. It is intentionally a measurement exercise, not a model benchmark.

measure-prefill-and-decode.py
1arrival_ms = 0 2token_times_ms = [320, 355, 392, 428] 3 4ttft_ms = token_times_ms[0] - arrival_ms 5itls_ms = [ 6 current - previous 7 for previous, current in zip(token_times_ms, token_times_ms[1:]) 8] 9mean_itl_ms = sum(itls_ms) / len(itls_ms) 10tps = 1000 / mean_itl_ms 11 12print("TTFT:", ttft_ms, "ms") 13print("decode ITLs:", itls_ms, "ms") 14print(f"decode TPS: {tps:.1f}")
Output
1TTFT: 320 ms 2decode ITLs: [35, 37, 36] ms 3decode TPS: 27.8

The arithmetic intensity explanation

The key difference between the two phases is arithmetic intensity: the number of floating-point operations (FLOPs) the GPU can perform per byte of data it must fetch from high-bandwidth memory (HBM).

Why prefill has high arithmetic intensity. During prefill the model loads each weight matrix (the ~140 GB of parameters) once and reuses it across the entire prompt batch of NNN tokens. The expensive memory traffic is amortized over a large number of matrix multiplications. The GPU's tensor cores stay saturated with useful work; the bottleneck becomes raw compute throughput (TFLOPS, trillions of floating-point operations per second).

Why decode has low arithmetic intensity. For each new output token, the model effectively streams through the weight tensor (~140 GB for a 70B-class BF16/FP16 model) to perform what is close to a matrix-vector product when the effective batch is small. The number of FLOPs per byte loaded collapses. The GPU's arithmetic units spend much of their time waiting for the next wave of weights and KV cache entries to arrive from HBM. Memory bandwidth (TB/s) therefore becomes the limiting factor.

PhaseTokens ProcessedEffective BatchArithmetic IntensityBottleneck
PrefillNNN (prompt)Large (full prompt)High (many FLOPs/byte)Compute (TFLOPS)
Decode1 at a time1Low (few FLOPs/byte)Memory bandwidth (TB/s)

The roofline model[1] makes this concrete: a kernel's achievable throughput is capped by either peak compute or by (memory bandwidth × arithmetic intensity), whichever is lower. Below a hardware-specific intensity threshold (the "ridge point"), you are bandwidth-bound; above it, compute-bound. Prefill often sits above that threshold, while low-batch decode often sits below it. An H100 SXM GPU has 80 GB of HBM and peak HBM bandwidth of 3.35 TB/s.[2] A 70B-class model in BF16/FP16 (16-bit floating-point formats) needs roughly 138-140 GB for weights alone, so it cannot be served unsharded on one H100-80GB.[3] That asymmetry is why input/output-aware (IO-aware) kernels like FlashAttention[4] matter for long prefills, while PagedAttention[5] focuses on fitting and reusing the KV cache efficiently during serving.


Key performance metrics

To evaluate and optimize an inference system, engineers rely on four standard metrics that capture different parts of the user experience and system throughput. Balancing these metrics often involves direct tradeoffs.

Request timeline showing TTFT spanning queue, tokenize, and prefill before token 1, followed by decode gaps that define inter-token latency and streaming speed. Request timeline showing TTFT spanning queue, tokenize, and prefill before token 1, followed by decode gaps that define inter-token latency and streaming speed.
TTFT ends at token 1. After that, decode cadence is about ITL, TPOT, and TPS rather than first-token latency.

TTFT (time to first token)

TTFT measures how long it takes before the first output token appears.[6] At the model-kernel level, TTFT is dominated by the prefill phase. In a real serving stack, end-to-end TTFT also includes tokenization, queueing, scheduling, and network overhead. Once those are under control, TTFT usually scales roughly linearly with prompt length for long prompts.

This metric is highly visible to users. It's critical for interactive applications like chat interfaces, voice assistants, and real-time code completion, where a delay of even a second can feel sluggish.

TPS (tokens per second)

TPS (also called "decode throughput") measures how fast the model generates subsequent output tokens after the first token is produced. For low-batch decode, memory bandwidth is often the dominant constraint. Exact results depend on model architecture, quantization, parallelism, engine, context length, and batch shape; measure the deployed configuration rather than using a generic tokens-per-second claim. Production systems use batched inference, often with continuous batching,[7] to improve aggregate throughput across requests.

ITL (inter-token latency)

ITL represents the time elapsed between generating consecutive output tokens.[6] For a single request running in isolation, ITL is simply the inverse of TPS (ITL=1/TPSITL = 1/TPSITL=1/TPS).

In interactive UIs, ITL above roughly 100 milliseconds often starts to feel choppy. Under heavy batching, ITL can increase because more requests are contending for the same GPU resources.

TPOT (time per output token)

While ITL is the individual gap between adjacent streamed tokens, TPOT is the average time per generated output token after the first token. In practice, TPOT is often computed as the mean of a request's ITL values or as an aggregate benchmark statistic across requests.[6]

Monitoring both metrics matters: ITL exposes jitter and stalls in the stream, while TPOT summarizes overall decode pacing. Under load, prefill interruptions, scheduling, and batching contention can make both worse.

The four metrics are summarized below:

MetricWhat it measuresPhaseWhat drives itUseful aggregation
TTFTTime until first output token appearsPrefill pathPrompt length, model size, queueingMedian and tail latency
TPSSpeed of token generation after the firstDecodeMemory bandwidth, batchingPer-request and aggregate rate
ITLTime between consecutive tokensDecodeScheduling and contentionDistribution of token gaps
TPOTAverage time per output token after firstDecodeScheduling, batching, contentionRequest or benchmark mean

An alert can route to the relevant investigation path without pretending to diagnose root cause by itself:

route-latency-investigation.py
1def investigate(ttft_p95_ms: int, itl_p95_ms: int) -> str: 2 if ttft_p95_ms > 900 and itl_p95_ms <= 80: 3 return "inspect queueing and prefill" 4 if itl_p95_ms > 120: 5 return "inspect decode scheduling and memory pressure" 6 return "within example thresholds" 7 8print("long initial pause:", investigate(ttft_p95_ms=1100, itl_p95_ms=60)) 9print("choppy stream:", investigate(ttft_p95_ms=350, itl_p95_ms=160))
Output
1long initial pause: inspect queueing and prefill 2choppy stream: inspect decode scheduling and memory pressure

Back-of-the-envelope: a bandwidth upper bound

For a single decode stream, TPS is governed by memory bandwidth. To generate one token, the GPU roughly has to stream the model weights from HBM. A rough estimate:

Max TPS≈HBM BandwidthModel size (bytes)\text{Max TPS} \approx \frac{\text{HBM Bandwidth}}{\text{Model size (bytes)}}Max TPS≈Model size (bytes)HBM Bandwidth​

A 70B BF16/FP16 weight tensor does not fit on one H100-80GB. Suppose it is tensor-parallel sharded across four H100 SXM GPUs. Each GPU holds about one quarter of the 140 GB weight tensor and can read its shard at up to 3.35 TB/s; equivalently, the four devices offer an idealized aggregate 13.4 TB/s before communication and runtime losses.[2][3]

Ideal shard-read bound≈4×3,350 GB/s140 GB≈96 tokens/sec\text{Ideal shard-read bound} \approx \frac{4 \times 3{,}350 \text{ GB/s}}{140 \text{ GB}} \approx 96 \text{ tokens/sec}Ideal shard-read bound≈140 GB4×3,350 GB/s​≈96 tokens/sec

This is an optimistic bandwidth bound, not a throughput promise. It omits tensor-parallel communication, KV cache reads during attention, activation memory, kernel overhead, less-than-peak sustained bandwidth, and scheduling. Benchmark the selected engine and parallelism layout to get real TPS. Batched inference can improve aggregate throughput because multiple sequences share weight reads across concurrent decode work.

estimate-sharded-bandwidth-bound.py
1def ideal_weight_stream_tps( 2 model_gb: float, bandwidth_gb_per_s: float, tensor_parallel_gpus: int 3) -> float: 4 aggregate_bandwidth = bandwidth_gb_per_s * tensor_parallel_gpus 5 return aggregate_bandwidth / model_gb 6 7model_gb = 140 8h100_capacity_gb = 80 9tensor_parallel_gpus = 4 10 11print("fits on one H100-80GB:", model_gb <= h100_capacity_gb) 12bound = ideal_weight_stream_tps(model_gb, 3350, tensor_parallel_gpus) 13print(f"four-GPU ideal shard-read bound: {bound:.1f} tokens/s")
Output
1fits on one H100-80GB: False 2four-GPU ideal shard-read bound: 95.7 tokens/s

Research note: This memory-bandwidth bound is fundamental to transformer inference and is why many serving optimizations (PagedAttention[5], continuous batching[7], quantization) attack the same bottleneck: reducing bytes moved per token or increasing effective HBM bandwidth.


The KV cache: the dynamic capacity bottleneck

After weights and runtime buffers are resident, the limit on how many active sequences a deployment can admit is often remaining memory capacity, specifically the memory required for their KV cache.

What is the KV cache?

During attention, each layer computes Key and Value projections for every token. Without caching, each decode step would have to rerun the full prefix through the model and recompute old K/V tensors again and again. That repeated work gets expensive fast as the sequence grows.

Analogy (shipment trace): The KV cache is like a shipment trace you build while processing an order conversation. For each token already processed, the model stores key routing facts (K) and useful details (V). When the next token references earlier context, the model reads that trace instead of re-deriving the whole prompt from scratch. The trace grows with each token, and its size determines how many concurrent conversations fit in GPU memory.

The KV cache stores these K and V tensors. As the sequence grows, the KV cache accumulates data for each token to avoid redundant calculations. This step-by-step accumulation allows the model to compute attention for only the newest token against the historical cache. The trace below shows how the cache expands with each generated token:

text
1Token 1: Compute K₁, V₁ → Store in cache 2Token 2: Compute K₂, V₂ → Store; Attend to [K₁,K₂], [V₁,V₂] 3Token 3: Compute K₃, V₃ → Store; Attend to [K₁,K₂,K₃], [V₁,V₂,V₃] 4...

To manage this growing memory dynamically without fragmentation, systems like vLLM use PagedAttention[5], which divides the KV cache into fixed-size blocks (pages) similar to operating system virtual memory.

KV cache memory formula

For a single sequence:

KV Cache=2×L×nkv×dh×s×b\text{KV Cache} = 2 \times L \times n_{kv} \times d_h \times s \times bKV Cache=2×L×nkv​×dh​×s×b

Reading the formula: for every layer (LLL), every KV head (nkvn_{kv}nkv​), every position in the sequence (sss), we store a Key vector and a Value vector (the "2") of dimension dhd_hdh​, each taking bbb bytes. Multiply it all together and this cache can easily reach gigabytes for long sequences.

Where:

  • LLL = number of layers
  • nkvn_{kv}nkv​ = number of KV heads (reduced with GQA/MQA (Grouped-Query/Multi-Query Attention)[8])
  • dhd_hdh​ = head dimension
  • sss = sequence length
  • bbb = bytes per element (2 for FP16, 1 for an 8-bit cache such as INT8 or FP8)

Concrete example: 70B-class GQA model

ParameterValue
Layers (LLL)80
KV heads (nkvn_{kv}nkv​)8 (GQA, not 64 query heads!)
Head dim (dhd_hdh​)128
Sequence length (sss)4,096
DtypeFP16 (2 bytes)
KV Cache=2×80×8×128×4096×2≈1.34 GB  (≈1.25 GiB) per sequence\begin{aligned} \text{KV Cache} &= 2 \times 80 \times 8 \times 128 \times 4096 \times 2 \\ &\approx \textbf{1.34 GB} \; (\approx \textbf{1.25 GiB}) \textbf{ per sequence} \end{aligned}KV Cache​=2×80×8×128×4096×2≈1.34 GB(≈1.25 GiB) per sequence​

Here's the breakdown: the formula multiplies 222 (for K and V) by 808080 layers, 888 KV heads, a 128128128 head dimension, a 409640964096 sequence length, and 222 bytes per value (for FP16). This matches the common 64-query-head / 8-KV-head GQA geometry used by many 70B-class models. With GQA, this is 8× smaller than it would be with standard Multi-Head Attention (MHA). Without GQA, the same geometry would need about 10.74 GB (10.0 GiB) per sequence.

calculate-kv-cache-footprint.py
1def kv_cache_bytes( 2 layers: int, kv_heads: int, head_dim: int, tokens: int, bytes_per_value: int 3) -> int: 4 return 2 * layers * kv_heads * head_dim * tokens * bytes_per_value 5 6gqa_bytes = kv_cache_bytes(80, 8, 128, 4096, 2) 7mha_bytes = kv_cache_bytes(80, 64, 128, 4096, 2) 8 9print(f"GQA cache: {gqa_bytes / 1e9:.2f} GB") 10print(f"MHA cache: {mha_bytes / 1e9:.2f} GB") 11print("MHA / GQA:", mha_bytes // gqa_bytes)
Output
1GQA cache: 1.34 GB 2MHA cache: 10.74 GB 3MHA / GQA: 8
Charts showing KV-cache memory growing linearly with context length, and showing how GQA uses far less per-sequence memory than full multi-head attention. Charts showing KV-cache memory growing linearly with context length, and showing how GQA uses far less per-sequence memory than full multi-head attention.
KV memory grows linearly with context length. GQA helps because the cache stores K/V heads, not every query head.

Production note: Serving a 70B model to 100 concurrent users requires precise KV cache budgeting: model weights (~138-140 GB BF16/FP16) + KV cache (1.34 GB × 100 users = 134 GB) = about 274 GB of raw footprint before allocator overhead, runtime buffers, and communication memory. On paper that fits across 4× H100-80GB with tensor parallelism, but it doesn't leave unlimited headroom.

Try it yourself: A colleague says you can serve a 7B model (32 layers, 8 KV heads, 128 head dimension, FP16) to 200 concurrent users on a single 80 GB GPU. The model weights take about 14 GB. Use the formula to see why raw weights plus KV memory are insufficient for an admission decision.

check-capacity-with-runtime-headroom.py
1def kv_gb_per_sequence(tokens: int) -> float: 2 values = 2 * 32 * 8 * 128 * tokens 3 return values * 2 / 1e9 4 5users = 200 6raw_total_gb = 14 + users * kv_gb_per_sequence(tokens=2048) 7print(f"raw weights plus KV: {raw_total_gb:.2f} GB") 8for reserve_gb in (8, 16): 9 admitted = raw_total_gb + reserve_gb <= 80 10 print(f"with {reserve_gb} GB runtime reserve: {admitted}")
Output
1raw weights plus KV: 67.69 GB 2with 8 GB runtime reserve: True 3with 16 GB runtime reserve: False

The raw calculation leaves only a narrow margin. Whether 200 active sequences fit depends on measured activation, workspace, allocator, and fragmentation headroom for the actual serving engine; do not turn an unmeasured reserve into a promised concurrency count.


Dynamic token budgeting

In production environments, context length isn't a static limit determined solely by the model's architecture. Instead, it's a dynamic memory budget that dictates how many concurrent users your system can support. Every additional token of context required by one user reduces the available GPU memory (VRAM, Video RAM) for everyone else.

To serve models at scale, inference engines have to enforce these budgets strictly. When a request comes in, the system checks the available GPU memory. If the required KV cache for the new request (plus existing ones) exceeds the remaining capacity, the request waits in a queue. That's why schedulers track memory pressure right alongside latency metrics.

To calculate the maximum affordable context length, we can write a simple capacity planning function. The function takes the total GPU memory, the model's static weight footprint, and its architectural parameters (layers, KV heads, and dimension) as inputs. It computes the available memory per user and divides it by the per-token KV cache size, returning the maximum number of tokens each user can generate:

dynamic-token-budgeting.py
1def max_context_for_budget( 2 gpu_memory_gb: float, 3 model_memory_gb: float, 4 runtime_reserve_gb: float, 5 num_layers: int, 6 num_kv_heads: int, 7 head_dim: int, 8 dtype_bytes: int = 2, # FP16 9 num_concurrent: int = 1, 10) -> int: 11 """Quick planning estimate using decimal GB for consistency with GPU datasheets.""" 12 available_memory = (gpu_memory_gb - model_memory_gb - runtime_reserve_gb) * 1e9 13 14 # Memory per token in KV cache 15 bytes_per_token = 2 * num_layers * num_kv_heads * head_dim * dtype_bytes 16 17 # Divide by concurrent users 18 budget_per_user = available_memory / num_concurrent 19 20 return int(budget_per_user / bytes_per_token) 21 22# Example: 70B-class model on 4×H100-80GB (320 GB raw) 23max_tokens = max_context_for_budget( 24 gpu_memory_gb=320, 25 model_memory_gb=140, # FP16 weights 26 runtime_reserve_gb=40, # engine buffers, workspaces, and allocator margin 27 num_layers=80, 28 num_kv_heads=8, # GQA: 8 KV heads (not 64!) 29 head_dim=128, 30 num_concurrent=50, 31) 32print(f"max context per user: {max_tokens:,} tokens")
Output
1max context per user: 8,544 tokens

This calculation drives critical deployment decisions. If you need to support 100 concurrent users but only have the budget for 5,000 tokens each, you might need to add another GPU node, reduce the model precision using quantization, or implement stricter context window limits at the application layer. Treat the result as an upper bound, not a safe production limit: you still need headroom for activations, communication buffers, allocator slack, and the serving runtime itself.

Common mistake: Using the query head count instead of the KV head count. A 70B-class model might have 64 query heads but only 8 KV heads thanks to GQA. If you plug 64 into the formula, you get an 8× memory overestimate and a pessimistic concurrency plan. Check the model card for num_key_value_heads, not num_attention_heads.

An online admission check can reserve KV capacity using each request's prompt plus output budget rather than admitting every request at the maximum architectural context:

admit-request-by-token-budget.py
1BYTES_PER_TOKEN = 2 * 80 * 8 * 128 * 2 2 3def kv_gb(tokens: int) -> float: 4 return tokens * BYTES_PER_TOKEN / 1e9 5 6def admit(existing_tokens: list[int], new_tokens: int, kv_budget_gb: float) -> bool: 7 needed = sum(kv_gb(tokens) for tokens in existing_tokens) + kv_gb(new_tokens) 8 return needed <= kv_budget_gb 9 10active = [4096] * 40 11print("admit 16K request:", admit(active, 16_384, kv_budget_gb=60)) 12print("admit 64K request:", admit(active, 65_536, kv_budget_gb=60))
Output
1admit 16K request: True 2admit 64K request: False


Production optimizations

Once you understand the two-phase bottleneck, the next question is how production systems work around it. Engineering teams use three major techniques to smooth the tradeoffs between prefill and decode and maximize hardware utilization.

Chunked prefill

On shared serving hardware, large prefills can delay decode operations, creating a TTFT-TPS tradeoff: prioritizing new prefills can stall existing decode streams.

Chunked prefill splits long prompts into smaller chunks, interleaving them with decode steps:

Analogy (factory assembly line): Without chunked prefill, it's like shutting down the entire factory assembly line to set up for a new product. All existing products stop moving while you retool. With chunked prefill, you retool one station at a time while the rest of the line keeps running. Existing requests keep flowing (decode continues) while the new request is gradually set up (prefilled in chunks).

The timeline below illustrates how chunked prefill avoids stalling decodes by breaking up the massive prefill block. By interleaving smaller prefill chunks with ongoing decode steps, the system maintains a steady flow of output tokens for existing users while gradually processing the new prompt:

text
1Without chunked prefill: 2 [Prefill 10K tokens ===========================] [Decode...Decode...Decode...] 3 ↑ All decode requests stall during this prefill 4 5With chunked prefill (chunk=2048): 6 [Prefill chunk1][Decode][Prefill chunk2][Decode][Prefill chunk3][Decode]... 7 ↑ Decode requests continue between chunks

Benefits

Chunked scheduling can improve GPU utilization by mixing compute-heavy prefill with memory-heavy decode, while protecting streaming cadence and tail latency. vLLM documents chunked prefill as a scheduling optimization,[9] and systems like Sarathi-Serve[10] study it explicitly.

interleave-prefill-with-active-decode.py
1def chunked_schedule(prompt_tokens: int, chunk_tokens: int) -> list[str]: 2 actions: list[str] = [] 3 remaining = prompt_tokens 4 while remaining: 5 processed = min(chunk_tokens, remaining) 6 actions.append(f"prefill {processed}") 7 remaining -= processed 8 actions.append("decode active streams") 9 return actions 10 11for action in chunked_schedule(prompt_tokens=6144, chunk_tokens=2048): 12 print(action)
Output
1prefill 2048 2decode active streams 3prefill 2048 4decode active streams 5prefill 2048 6decode active streams

Prefill-decode disaggregation

In a standard setup, a single GPU handles both prefill and decode phases for its assigned requests. However, because prefill is compute-bound and decode is memory-bandwidth bound, using the same hardware for both can create head-of-line blocking and muddle hardware sizing. Modern systems (Splitwise[11], DistServe[12], Mooncake[13]) explore separating prefill and decode onto different GPU pools when that isolation benefit outweighs the KV-transfer cost:

Diagram showing Prefill GPUs (Compute-bound) Need: high tensor-core throughput Scale: prompt load, KV cache transfer, and Decode GPUs (Memory-BW-bound) Need: high HBM bandwidth + memory Scale: active users. Diagram showing Prefill GPUs (Compute-bound) Need: high tensor-core throughput Scale: prompt load, KV cache transfer, and Decode GPUs (Memory-BW-bound) Need: high HBM bandwidth + memory Scale: active users.
Prefill GPUs (Compute-bound) Need: high tensor-core throughput Scale: prompt load, KV cache transfer, and Decode GPUs (Memory-BW-bound) Need: high HBM bandwidth + memory Scale: active users.

Architectural benefits

  • Less cross-phase interference: prefill bursts are less likely to stall decode
  • Independent scaling: add prefill GPUs for prompt-heavy workloads, decode GPUs for concurrent users
  • Hardware matching: use compute-optimized GPUs for prefill, high-bandwidth GPUs for decode

You do pay for moving KV state across the interconnect, so disaggregation is most attractive when prompts are long, traffic is bursty, or TTFT/ITL isolation matters more than the extra transfer overhead.

compare-disaggregation-overhead.py
1def choose_layout(shared_phase_interference_ms: int, kv_transfer_ms: int) -> str: 2 if kv_transfer_ms < shared_phase_interference_ms: 3 return "separate prefill and decode pools" 4 return "keep phases colocated" 5 6print("bursty workload:", choose_layout(shared_phase_interference_ms=95, kv_transfer_ms=20)) 7print("small prompts:", choose_layout(shared_phase_interference_ms=8, kv_transfer_ms=20))
Output
1bursty workload: separate prefill and decode pools 2small prompts: keep phases colocated

KV cache quantization

Store KV cache in an 8-bit format such as FP8 instead of FP16/BF16 to roughly halve the cache footprint. Research systems have demonstrated sub-8-bit KV cache quantization, including 3-bit and 2-bit methods, with model-specific quality evaluation required before deployment.[14][15] vLLM documents FP8 KV-cache support and scaling configuration; support and accuracy trade-offs depend on the engine and hardware.[16] See our model quantization deep-dive for the techniques behind weight and activation quantization:

KV Cache (8-bit)=KV Cache (FP16/BF16)2\text{KV Cache (8-bit)} = \frac{\text{KV Cache (FP16/BF16)}}{2}KV Cache (8-bit)=2KV Cache (FP16/BF16)​

Reading the formula: an 8-bit K/V tensor payload uses 1 byte per value instead of 2 bytes (FP16/BF16), so its raw payload is halved for the same sequence length and concurrency. Scaling metadata and runtime buffers mean allocated memory savings may differ slightly.

While quantizing weights reduces the static memory footprint of the model, quantizing the KV cache specifically attacks the dynamic memory bottleneck that limits concurrency. Some serving engines now support KV cache quantization directly. If KV memory is the dominant constraint, moving from 16-bit to 8-bit caching can come close to doubling concurrency on the same hardware. In practice, the gain is smaller once you account for model weights, allocator overhead, and other runtime buffers.

estimate-kv-quantization-gain.py
1def users_from_kv_budget(kv_budget_gb: float, gb_per_user: float) -> int: 2 return int(kv_budget_gb / gb_per_user) 3 4fp16_gb_per_user = 1.342 5fp8_gb_per_user = fp16_gb_per_user / 2 6kv_budget_gb = 80 7 8print("FP16 users from KV budget:", users_from_kv_budget(kv_budget_gb, fp16_gb_per_user)) 9print("FP8 users from KV budget:", users_from_kv_budget(kv_budget_gb, fp8_gb_per_user)) 10print("runtime headroom still required:", True)
Output
1FP16 users from KV budget: 59 2FP8 users from KV budget: 119 3runtime headroom still required: True


Mastery check

Key concepts

  • Prefill vs decode as two separate inference phases
  • TTFT, TPS, inter-token latency (ITL), and time per output token (TPOT)
  • Arithmetic intensity and why the bottleneck flips after token 1
  • KV-cache memory formula and why GQA changes the head count
  • Context length as a concurrency budget, not only a model-card limit
  • Chunked prefill, prefill-decode disaggregation, and KV-cache quantization

Evaluation rubric

  • Foundational: Explains why TTFT ends at token 1 and why decode remains sequential after that.
  • Intermediate: Diagnoses whether a user complaint points to prefill latency or decode pacing.
  • Intermediate: Derives the KV-cache formula and uses num_key_value_heads rather than the full query-head count.
  • Advanced: Explains why prefill is usually compute-bound while single-stream decode is usually memory-bandwidth-bound.
  • Advanced: Estimates whether a deployment fits by combining model weights, KV cache, and runtime headroom.
  • Advanced: Chooses among chunked prefill, disaggregation, or KV-cache quantization based on the real bottleneck.

Follow-up questions

Common pitfalls

"LLMs generate tokens one at a time"

Symptom: You describe the whole request as sequential and then cannot explain the large pause before token 1. Cause: Decode is sequential, but prefill processes the full prompt in parallel before streaming begins. Fix: Split the request into two phases every time you reason about latency: prefill for first-token delay, decode for streaming cadence.

"More GPUs always means faster generation"

Symptom: A team adds more GPUs and expects single-request TPS to scale linearly. Cause: Single-stream decode is often limited by memory bandwidth and communication overhead, not raw compute alone. Fix: Ask what bottleneck you are relieving. If decode is bandwidth-bound, look at HBM bandwidth, scheduling, quantization, or a different parallelism strategy before adding more devices.

"A 70B BF16 throughput estimate uses one H100-80GB"

Symptom: A bandwidth calculation divides one H100's bandwidth by a 140 GB weight tensor. Cause: The estimate ignores capacity: 140 GB of weights cannot reside on one 80 GB device. Fix: Choose a feasible parallel layout first, then estimate using each device's shard and account for interconnect overhead in benchmarks.

"Context length is only a model limitation"

Symptom: Product plans assume every user can use the model's full architectural context without affecting concurrency. Cause: The architectural limit and the production memory budget were treated as the same thing. Fix: Turn context policy into a capacity calculation. Budget KV state per user, then cap prompt and output lengths to preserve concurrency headroom.

"TTFT and TPS improve together"

Symptom: The system optimizes first-token latency aggressively, but active streams become choppy. Cause: Large prefills and smooth decode compete for the same GPU time, so improving one can hurt the other. Fix: Measure TTFT and decode metrics separately. Then tune scheduling, such as chunked prefill, around the actual SLO you need to protect.

"Using query heads for KV-cache sizing"

Symptom: Capacity plans are off by a large factor and the service either looks impossibly expensive or crashes under load. Cause: The formula used num_attention_heads instead of num_key_value_heads on a GQA model. Fix: Check the model config directly. KV memory uses the stored K/V head count, not the full query-head count.


Bringing it together

You now have a concrete mental model for how LLM inference behaves in production. When a request arrives, the engine executes a usually compute-heavy prefill pass over the prompt. That prefill drives model-side TTFT. Then the system falls into an often memory-bandwidth-bound low-batch decode loop that generates one token at a time and drives streamed TPS. The KV cache is the bridge between those phases, and it grows with every token, which is why remaining memory capacity, not only compute, often limits concurrency.

If you can explain why a long prompt hurts TTFT more than TPS, why doubling decode speed requires more HBM bandwidth rather than more TFLOPS, and how to estimate whether 100 concurrent users fit on your GPU cluster, you are already ahead of most candidates in an AI infrastructure interview.

Next Step
Continue to Multi-Query & Grouped-Query Attention

The KV cache analysis you just did explains why reducing the number of key/value heads is so valuable at scale. The next article covers MQA and GQA, techniques that significantly reduce memory usage while preserving model quality.

PreviousMulti-Agent Orchestration
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Roofline: An Insightful Visual Performance Model for Multicore Architectures

Williams, S., Waterman, A., & Patterson, D. · 2009

H100 GPU

NVIDIA · 2026

Wide Open: NVIDIA Accelerates Inference on Meta Llama 3.

NVIDIA · 2024

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. · 2022 · NeurIPS 2022

Efficient Memory Management for Large Language Model Serving with PagedAttention.

Kwon, W., et al. · 2023 · SOSP 2023

Metrics

vLLM · 2026

Orca: A Distributed Serving System for Transformer-Based Generative Models.

Yu, G.-I., et al. · 2022 · OSDI 2022

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.

Ainslie, J., et al. · 2023 · EMNLP 2023

Optimization and Tuning.

vLLM · 2026

Sarathi-Serve: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.

Agrawal, A., et al. · 2023 · arXiv preprint

Splitwise: Efficient Generative LLM Inference Using Phase Splitting.

Patel, P., et al. · 2023

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving.

Zhong, Y., et al. · 2024 · OSDI 2024

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving.

Qin, Y., et al. · 2024

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Hooper, C., Kim, S., Gholami, A., et al. · 2024 · arXiv preprint

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Liu, Z., Chen, B., Hu, X., et al. · 2024 · arXiv preprint

Quantized KV Cache

vLLM Team · 2026 · vLLM Documentation