LeetLLM
LearnFeaturesPricingBlog
Menu
LearnFeaturesPricingBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Pricing
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 76 articles completed

🧪AI Engineering Foundations0/11
The Bitter Lesson & ComputeTokenization: BPE & SentencePieceWord to Contextual EmbeddingsSentence Embeddings & Contrastive LossDimensionality Reduction for EmbeddingsEmbedding Similarity & QuantizationScaled Dot-Product AttentionPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNDecoding Strategies: Greedy to NucleusPerplexity & Model Evaluation
⚡Inference Systems & Optimization0/12
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceSpeculative DecodingLong Context Window ManagementModel Quantization: GPTQ, AWQ & GGUFMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time Compute
🔍Advanced Retrieval & Enterprise Memory0/7
Chunking StrategiesVector DB Internals: HNSW & IVFHybrid Search: Dense + SparseProduction RAG PipelinesAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access Control
🤖Agentic Architecture & Orchestration0/13
CoT, ToT & Self-Consistency PromptingStructured Output GenerationFunction Calling & Tool UseMCP & Tool Protocol StandardsReAct & Plan-and-ExecuteAgent Memory & PersistenceHuman-in-the-Loop AgentsGuardrails & Safety FiltersPrompt Injection DefenseCode Generation & SandboxingAgent Failure & RecoveryMulti-Agent OrchestrationAI Agent Evaluation and Benchmarking
📊Evaluation & Reliability0/6
LLM Benchmarks & LimitationsLLM-as-a-Judge EvaluationA/B Testing for LLMsLLM Observability & MonitoringHallucination Detection & MitigationBias & Fairness in LLMs
🛠️LLMOps & Production Engineering0/4
Semantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Versioning & DeploymentGPU Serving & Autoscaling
🧬Training, Alignment & Reasoning0/13
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleInstruction Tuning & Chat TemplatesMixed Precision TrainingDistributed Training: FSDP & ZeROPrompt Optimization with DSPyRecursive Language Models (RLM)LoRA & Parameter-Efficient TuningKnowledge Distillation for LLMsModel Merging and Weight InterpolationConstitutional AI & Red TeamingRLHF & DPO AlignmentRLVR & Verifiable Rewards
🏗️System Design Case Studies0/10
Design an Automated Support AgentContent Moderation SystemLLM-Powered Search EngineCode Completion SystemMulti-Tenant LLM PlatformReasoning & Test-Time ComputeReal-Time Voice AI AgentVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image Generation
Track Your Progress

Create a free account to save your reading progress across devices and unlock the full learning experience.

LeetLLM Premium
  • All question breakdowns
  • Architecture diagrams
  • Model answers & rubrics
  • Follow-up Q&A analysis
  • New content weekly
Back to Topics
LearnInference Systems & OptimizationInference: TTFT, TPS & KV Cache
🚀HardInference Optimization

Inference: TTFT, TPS & KV Cache

Understand the two-phase inference process (prefill vs decode), derive the KV cache memory formula, and learn production optimizations like chunked prefill and disaggregation.

30 min readOpenAI, Anthropic, Google +39 key concepts

When you send a message to ChatGPT, you notice something peculiar about how the response appears. There's often a brief pause, then the first word appears, followed by the rest streaming out word by word. Why that initial pause? Why does it stream one word at a time instead of appearing instantly? And why do longer conversations eventually feel slower?

This behavior isn't a quirk of the interface. It's the physics of LLM inference: the process of running a trained model to generate text. Understanding these mechanics (two-phase nature of generation, memory bottlenecks, and how engineers measure performance) is essential for anyone building or optimizing production AI systems.

Key insight: LLM inference has two distinct phases (prefill and decode) with fundamentally different hardware bottlenecks. Knowing which phase is the bottleneck, and why, is the foundation for every optimization in this space, from Key-Value (KV) cache management to continuous batching.


The two phases of LLM inference

Every LLM request goes through two distinct computational phases with fundamentally different hardware bottlenecks:

Analogy (restaurant kitchen): Prefill is like a chef reading the entire recipe at once, gathering all ingredients, and prepping the mise en place. It's intense upfront work but highly parallelizable (multiple sous chefs can chop simultaneously). Decode is like plating one dish at a time in a specific order: each plate (token) must wait for the previous one, and the bottleneck is how fast you can carry ingredients from the fridge (memory bandwidth), not how fast you can cook (compute).

LLM inference two-phase pipeline: Prefill (compute-bound, processes all input tokens in parallel to build the KV cache) followed by Decode (memory-bandwidth-bound, generates tokens one at a time). A KV Cache Handoff bridge connects the two phases. Key metrics shown: TTFT, TPS, KV cache size, and HBM bandwidth bottleneck. LLM inference two-phase pipeline: Prefill (compute-bound, processes all input tokens in parallel to build the KV cache) followed by Decode (memory-bandwidth-bound, generates tokens one at a time). A KV Cache Handoff bridge connects the two phases. Key metrics shown: TTFT, TPS, KV cache size, and HBM bandwidth bottleneck.

The diagram below illustrates the sequential dependency between the highly parallel prefill phase and the autoregressive decode phase:

Diagram Diagram

Phase 1: Prefill (processing the prompt)

The model processes your entire input prompt in parallel. Every token is attended to simultaneously in a single forward pass. This is compute-bound, limited by GPU Floating Point Operations Per Second (FLOPS), not memory bandwidth. For example, when starting generation, the prefill phase processes the input text to produce the first token. The following trace shows how a short input sentence is transformed into initial states and the first prediction:

text
1Input: "Explain quantum computing in simple terms" 2→ 6 tokens processed simultaneously 3→ Produces KV cache entries for all 6 tokens 4→ Produces logits (unnormalized probability scores) for the FIRST output token

The time from request to the first output token is called TTFT (Time to First Token). For a user staring at a chat interface, this is the most noticeable latency.

Use CaseTTFT TargetWhy
Real-time voice< 150msConversational flow
Code completion< 200msDeveloper productivity
Chat/conversational< 500msUser patience
Batch processing< 2sBackground job

Phase 2: Decode (generating output tokens)

After producing the first token, the model generates subsequent tokens one at a time, autoregressively. Each new token requires a forward pass through the entire model, but only processes that single new token (reusing cached K/V from all previous tokens). This step-by-step process builds the response iteratively by taking the previously generated token, adding it to the ongoing cache, and predicting the next word. The sequence below demonstrates how the model constructs a phrase one token at a time:

text
1Step 1: "Quantum" → add to KV cache → forward pass → "computing" 2Step 2: "computing" → add to KV cache → forward pass → "is" 3Step 3: "is" → add to KV cache → forward pass → "like" 4...

This phase is memory-bandwidth bound, not compute-bound. The bottleneck is reading the model weights and KV cache from GPU HBM (High Bandwidth Memory) for each token. The matrix multiplications are small (batch=1), so the GPU's arithmetic units are mostly idle, waiting for data to arrive.

The arithmetic intensity explanation

The key difference is arithmetic intensity (FLOPs per byte loaded from memory):

PhaseTokens ProcessedMatrix SizeArithmetic IntensityBottleneck
PrefillNNN (prompt)Large batch matmulHigh (many FLOPs/byte)Compute (TFLOPS)
Decode1 at a timeThin matmul (batch=1)Low (few FLOPs/byte)Memory bandwidth (TB/s)

On an H100 GPU: ~990 TFLOPS (TeraFLOPS) compute, ~3.35 TB/s HBM bandwidth. During decode, the GPU is reading ~140GB of model weights to process a single token. The computation itself takes microseconds, but loading the weights takes milliseconds. This memory-bound nature is what motivates architectural changes like FlashAttention[1] to minimize intermediate HBM access during prefill, and PagedAttention[2] for efficient KV cache management during decode.


Key performance metrics

To evaluate and optimize an inference system, engineers rely on four standard metrics that capture different parts of the user experience and system throughput. Balancing these metrics often involves direct tradeoffs.

Latency breakdown of LLM inference: TTFT dominated by prefill compute, inter-token latency dominated by memory bandwidth for KV cache reads during decode. Latency breakdown of LLM inference: TTFT dominated by prefill compute, inter-token latency dominated by memory bandwidth for KV cache reads during decode.

TTFT (time to first token)

TTFT measures the prefill latency: specifically, how quickly the model processes the initial prompt and generates the very first output token. Because the entire prompt is processed in parallel, TTFT is primarily dominated by the prompt's length and the model's overall size. For long prompts, TTFT scales roughly linearly with the token count.

This metric is highly visible to users. It's critical for interactive applications like chat interfaces, voice assistants, and real-time code completion, where a delay of even a second can feel sluggish.

  • •Measures prefill latency, how quickly the model starts responding
  • •Dominated by prompt length and model size
  • •Scales roughly linearly with prompt token count (for long prompts)
  • •Critical for: interactive applications, voice assistants, code completion

TPS (tokens per second)

TPS (also called "decode throughput") measures how fast the model generates subsequent output tokens after the first token is produced. This metric is constrained by memory bandwidth rather than compute power.

For a single request on a high-end GPU, TPS typically ranges between 30 and 80 tokens per second. However, in production environments, systems use batched inference (often using continuous batching[3]) to process multiple requests concurrently, yielding hundreds or thousands of aggregate TPS across the system.

  • •Measures how fast output tokens are generated after the first
  • •Single request: typically 10-25 TPS for large models (70B+), up to 100+ TPS for smaller models (7B) on high-end GPUs
  • •Batched inference (often using continuous batching[3]): hundreds or thousands of aggregate TPS
  • •Determined by: memory bandwidth, not compute

ITL (inter-token latency)

ITL represents the time elapsed between generating consecutive output tokens. For a single request running in isolation, ITL is simply the inverse of TPS (ITL=1/TPSITL = 1/TPSITL=1/TPS).

In real-world applications, human readers perceive stuttering or unnatural pauses when ITL exceeds roughly 100 milliseconds. Under heavy batching, ITL can increase because more requests are contending for the same GPU resources.

  • •Time between consecutive output tokens: ITL = 1/TPS for a single request
  • •Users perceive stuttering when ITL exceeds ~100ms
  • •Under batching, ITL increases as more requests share the GPU

TPOT (time per output token)

While ITL measures the raw time between tokens from the user's perspective, TPOT is a system-level metric that accounts for scheduling overhead, queueing delays, and batching contention.

In an actively serving system, TPOT is typically greater than or equal to ITL. Monitoring TPOT helps engineers understand when the inference engine is overloaded and scheduling delays are impacting the overall output rate.

  • •System-level metric accounting for batching overhead and scheduling
  • •TPOT ≥ ITL due to scheduling delays and contention

Back-of-the-envelope: maximum TPS on H100

For a single decode stream, TPS is governed by memory bandwidth. To generate one token, the GPU must load the full model weights from HBM. A rough estimate:

Max TPS≈HBM BandwidthModel size (bytes)\text{Max TPS} \approx \frac{\text{HBM Bandwidth}}{\text{Model size (bytes)}}Max TPS≈Model size (bytes)HBM Bandwidth​

For a 70B model in FP16 on an H100 (3.35 TB/s HBM bandwidth, ~140 GB weights):

Max TPS≈3,350 GB/s140 GB≈24 tokens/sec\text{Max TPS} \approx \frac{3{,}350 \text{ GB/s}}{140 \text{ GB}} \approx 24 \text{ tokens/sec}Max TPS≈140 GB3,350 GB/s​≈24 tokens/sec

This is the theoretical ceiling. Real-world single-stream throughput lands around 10-25 tokens/sec for large models because the simple calculation omits KV cache reads during attention, activation memory, kernel launch overhead, and because the full model weights can't be fully cached in on-chip SRAM between steps. With aggressive batching and TensorRT-LLM optimizations, single-stream throughput can reach the upper end of this range. Batched inference on the same hardware yields much higher aggregate throughput by sharing weight loads across concurrent requests.

🔬 Research insight: This memory-bandwidth bound is fundamental to transformer inference and is why every major serving optimization (PagedAttention[2], continuous batching[3], quantization) ultimately attacks the same bottleneck: reducing bytes moved per token or increasing effective HBM bandwidth.


The KV cache: the dominant bottleneck

In production deployments, the true limit on how many users a system can serve concurrently is almost always memory capacity, specifically the memory required to store the KV cache.

What is the KV cache?

During attention, each layer computes Key and Value projections for every token. Without caching, generating token NNN would require recomputing attention over all N−1N-1N−1 previous tokens from scratch, which is quadratic in sequence length.

💡 Analogy (exam reference sheet): The KV cache is like a reference sheet you build during an exam. For each question (token) you've already answered, you write down the key facts (K) and your reasoning (V) on the sheet. When the next question references a previous one, you glance at your reference sheet instead of re-deriving everything from scratch. The sheet grows with each question, and its size determines how many questions you can handle before running out of paper (GPU memory).

The KV cache stores these K and V tensors. As the sequence grows, the KV cache accumulates data for each token to avoid redundant calculations. This step-by-step accumulation allows the model to compute attention for only the newest token against the historical cache. The trace below shows how the cache expands with each generated token:

text
1Token 1: Compute K₁, V₁ → Store in cache 2Token 2: Compute K₂, V₂ → Store; Attend to [K₁,K₂], [V₁,V₂] 3Token 3: Compute K₃, V₃ → Store; Attend to [K₁,K₂,K₃], [V₁,V₂,V₃] 4...

To manage this growing memory dynamically without fragmentation, systems like vLLM use PagedAttention[2], which divides the KV cache into fixed-size blocks (pages) similar to operating system virtual memory.

KV cache memory formula

For a single sequence:

KV Cache=2×L×nkv×dh×s×b\text{KV Cache} = 2 \times L \times n_{kv} \times d_h \times s \times bKV Cache=2×L×nkv​×dh​×s×b

Reading the formula: for every layer (LLL), every KV head (nkvn_{kv}nkv​), every position in the sequence (sss), we store a Key vector and a Value vector (the "2") of dimension dhd_hdh​, each taking bbb bytes. Multiply it all together and this cache can easily reach gigabytes for long sequences.

Where:

  • •LLL = number of layers
  • •nkvn_{kv}nkv​ = number of KV heads (reduced with GQA/MQA (Grouped-Query/Multi-Query Attention)[4])
  • •dhd_hdh​ = head dimension
  • •sss = sequence length
  • •bbb = bytes per element (2 for FP16, 1 for INT8)

Concrete example: Qwen3.5 72B

ParameterValue
Layers (LLL)80
KV heads (nkvn_{kv}nkv​)8 (GQA, not 64 query heads!)
Head dim (dhd_hdh​)128
Sequence length (sss)4,096
DtypeFP16 (2 bytes)

KV Cache=2×80×8×128×4096×2=1.34 GB per sequence\text{KV Cache} = 2 \times 80 \times 8 \times 128 \times 4096 \times 2 = \textbf{1.34 GB per sequence}KV Cache=2×80×8×128×4096×2=1.34 GB per sequence

Here's the breakdown: the formula multiplies 222 (for K and V) by 808080 layers, 888 KV heads, a 128128128 head dimension, a 409640964096 sequence length, and 222 bytes per value (for FP16). With GQA (8 KV heads instead of 64), this is 8× smaller than it would be with Multi-Head Attention (MHA). Without GQA, the same model would need 10.7 GB per sequence.

🎯 Production tip: Serving a 70B model to 100 concurrent users requires precise KV cache budgeting: Model weights (140 GB FP16) + KV cache (1.34 GB × 100 users = 134 GB) = 274 GB total. Feasible on 4× H100-80GB with tensor parallelism (splitting the model across multiple GPUs).


Dynamic token budgeting

In production environments, context length isn't a static limit determined solely by the model's architecture. Instead, it's a dynamic memory budget that dictates how many concurrent users your system can support. Every additional token of context required by one user reduces the available GPU Video Random Access Memory (VRAM) for everyone else.

To serve models at scale, inference engines must strictly enforce these budgets. When a request comes in, the system checks the available GPU memory. If the required KV cache for the new request (plus existing ones) exceeds the remaining capacity, the request must wait in a queue. This is why managing the TTFT-TPS tradeoff is so important for overall throughput.

To calculate the maximum affordable context length, we can write a simple capacity planning function. The function takes the total GPU memory, the model's static weight footprint, and its architectural parameters (layers, KV heads, and dimension) as inputs. It computes the available memory per user and divides it by the per-token KV cache size, returning the maximum number of tokens each user can generate:

python
1def max_context_for_budget( 2 gpu_memory_gb: float, 3 model_memory_gb: float, 4 num_layers: int, 5 num_kv_heads: int, 6 head_dim: int, 7 dtype_bytes: int = 2, # FP16 8 num_concurrent: int = 1, 9) -> int: 10 """Calculate maximum context length given memory constraints.""" 11 available_memory = (gpu_memory_gb - model_memory_gb) * 1e9 12 13 # Memory per token in KV cache 14 bytes_per_token = 2 * num_layers * num_kv_heads * head_dim * dtype_bytes 15 16 # Divide by concurrent users 17 budget_per_user = available_memory / num_concurrent 18 19 return int(budget_per_user / bytes_per_token) 20 21# Example: Qwen3.5 72B on 4×H100-80GB (320GB total) 22max_tokens = max_context_for_budget( 23 gpu_memory_gb=320, 24 model_memory_gb=140, # FP16 weights 25 num_layers=80, 26 num_kv_heads=8, # GQA: 8 KV heads (not 64!) 27 head_dim=128, 28 num_concurrent=50, 29) 30# Result: ~110,000 tokens per user with 50 concurrent users

This calculation drives critical deployment decisions. If you need to support 100 concurrent users but only have the budget for 5,000 tokens each, you might need to add another GPU node, reduce the model precision using quantization, or implement stricter context window limits at the application layer.


Production optimizations

As the mechanics of inference reveal, balancing compute-bound prefilling and memory-bound decoding is challenging. Engineering teams use advanced serving techniques to smooth out these tradeoffs and maximize hardware utilization.

Chunked prefill

Large prefills block decode operations, causing a TTFT-TPS tradeoff: prioritizing new prefills stalls existing decode streams.

Chunked prefill splits long prompts into smaller chunks, interleaving them with decode steps:

💡 Analogy (factory assembly line): Without chunked prefill, it's like shutting down the entire factory assembly line to set up for a new product. All existing products stop moving while you retool. With chunked prefill, you retool one station at a time while the rest of the line keeps running. Existing requests keep flowing (decode continues) while the new request is gradually set up (prefilled in chunks).

The timeline below illustrates how chunked prefill avoids stalling decodes by breaking up the massive prefill block. By interleaving smaller prefill chunks with ongoing decode steps, the system maintains a steady flow of output tokens for existing users while gradually processing the new prompt:

text
1Without chunked prefill: 2 [Prefill 10K tokens ===========================] [Decode...Decode...Decode...] 3 ↑ All decode requests stall during this prefill 4 5With chunked prefill (chunk=2048): 6 [Prefill chunk1][Decode][Prefill chunk2][Decode][Prefill chunk3][Decode]... 7 ↑ Decode requests continue between chunks

Benefits

Better GPU utilization (mixing compute-bound prefill with memory-bound decode), smoother token streaming, reduced tail latency. Enabled by default in vLLM V1 and formalized in systems like Sarathi-Serve[5].

Prefill-decode disaggregation

In a standard setup, a single GPU handles both prefill and decode phases for its assigned requests. However, because prefill is compute-bound and decode is memory-bandwidth bound, using the same hardware for both leads to resource underutilization. Modern systems (Splitwise[6], DistServe[7], Mooncake[8]) solve this by separating prefill and decode onto different GPU pools:

Diagram Diagram

Architectural benefits

  • •No cross-phase interference: prefill never stalls decode
  • •Independent scaling: add prefill GPUs for prompt-heavy workloads, decode GPUs for concurrent users
  • •Hardware matching: use compute-optimized GPUs for prefill, high-bandwidth GPUs for decode

KV cache quantization

Store KV cache in INT8 or even INT4 instead of FP16 to halve/quarter memory with minimal quality loss (see our model quantization deep-dive for the techniques behind weight and activation quantization):

KV Cache (INT8)=KV Cache (FP16)2\text{KV Cache (INT8)} = \frac{\text{KV Cache (FP16)}}{2}KV Cache (INT8)=2KV Cache (FP16)​

Reading the formula: INT8 uses 1 byte per value instead of 2 bytes (FP16), so KV memory is cut in half for the same sequence length and concurrency.

While quantizing weights reduces the static memory footprint of the model, quantizing the KV cache specifically attacks the dynamic memory bottleneck that limits concurrency. Modern serving engines apply this quantization on-the-fly, converting activations to INT8 before storing them in the cache and dequantizing them back to higher precision during the attention computation. This immediately doubles the number of concurrent users you can serve on the same hardware.


Common misconceptions

When working with LLM inference, it's easy to misunderstand where the bottlenecks actually lie. Here are a few common pitfalls to avoid.

❌ "LLMs generate tokens one at a time"

This is only true for the decode phase. During the prefill phase, the entire prompt is processed in parallel. All input tokens are passed through the transformer layers simultaneously to compute their initial Key and Value states, which constructs the starting KV cache. Assuming generation is entirely sequential ignores the massive burst of parallel compute that happens before the first output token is ever produced.

❌ "More GPUs = faster generation"

Throwing more compute at the problem doesn't linearly speed up token generation for a single request. During decode, the bottleneck is loading weights from memory (memory bandwidth), not arithmetic operations. In fact, for a single request, excessive model parallelism across multiple GPUs can actually slow down generation because the communication overhead (passing intermediate tensors between GPUs via NVLink) outpaces the minor compute gains. Faster single-stream generation requires GPUs with higher memory bandwidth (e.g., H200 vs H100), not simply adding more GPUs.

❌ "Context length is just a model limitation"

While models are trained with a maximum context window (e.g., 1M+ for GPT-5.4), in production, context length is primarily a memory budget decision. Every token held in the context window consumes physical VRAM on the GPU in the form of the KV cache. If you allow users to use the full 1M+ context, you can serve drastically fewer concurrent users on the same hardware. Engineering teams often cap context lengths well below the model's theoretical maximum to ensure sufficient memory is available for high concurrency.

❌ "TTFT and TPS improve together"

These two metrics are often in direct tension. Optimizing for TTFT generally means dedicating large, uninterrupted blocks of GPU time to process incoming prefill requests as fast as possible. However, doing so stalls any ongoing decode requests, causing the system's overall TPS (and the user's inter-token latency) to suffer. Conversely, prioritizing smooth token streaming (high TPS) means delaying new prefills. System designers must carefully tune scheduling algorithms like chunked prefill to balance these competing requirements.


Key takeaways

  1. •Two-phase inference: Prefill (compute-bound, parallel) → Decode (memory-bandwidth-bound, sequential)
  2. •TTFT measures prefill speed; TPS measures decode throughput. They're independent and often in tension.
  3. •KV Cache is the primary memory bottleneck for concurrent serving; derive the formula from model architecture
  4. •GQA/MQA reduces KV cache proportionally to head count reduction (8× for Qwen3.5 72B)
  5. •Chunked prefill interleaves prompt processing with decode to avoid stalling
  6. •Disaggregation separates prefill and decode onto different GPU pools for independent scaling
  7. •Dynamic token budgeting calculates max context from GPU memory, model size, and concurrency

Evaluation Rubric
  • 1
    Explains prefill vs decode with correct bottleneck identification
  • 2
    Derives KV cache memory formula from architecture parameters
  • 3
    Accounts for GQA in KV cache calculation (not full query heads)
  • 4
    Discusses arithmetic intensity to explain compute vs memory-BW bound
  • 5
    Explains chunked prefill and why it helps
  • 6
    Explains prefill-decode disaggregation and when to use it
  • 7
    Calculates max concurrent users from GPU memory budget
Common Pitfalls
  • Saying LLMs generate tokens one at a time (prefill is parallel)
  • Thinking more GPUs always means faster single-request generation
  • Using full query head count instead of KV head count for cache sizing
  • Not knowing that TTFT and TPS are often in tension
  • Treating context length as only a model limitation, not a memory budget
Follow-up Questions to Expect

Key Concepts Tested
Two-phase inference: prefill (compute-bound) vs decode (memory-BW-bound)TTFT, TPS, ITL, TPOT: what each measures and what drives itKV cache memory formula derivation from model architectureGQA/MQA effect on KV cache sizeArithmetic intensity: why prefill is compute-bound and decode is memory-boundChunked prefill for smoothing TTFT-TPS tradeoffPrefill-decode disaggregation onto separate GPU poolsDynamic token budgeting for concurrent servingKV cache quantization (INT8/INT4) to double concurrency
References

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. · 2022 · NeurIPS 2022

Efficient Memory Management for Large Language Model Serving with PagedAttention.

Kwon, W., et al. · 2023 · SOSP 2023

Orca: A Distributed Serving System for Transformer-Based Generative Models.

Yu, G.-I., et al. · 2022 · OSDI 2022

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.

Ainslie, J., et al. · 2023 · EMNLP 2023

Sarathi-Serve: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.

Agrawal, A., et al. · 2023 · arXiv preprint

Splitwise: Efficient Generative LLM Inference Using Phase Splitting.

Patel, P., et al. · 2023

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving.

Zhong, Y., et al. · 2024 · OSDI 2024

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving.

Qin, Y., et al. · 2024

Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail

Your account is free and you can post anonymously if you choose.