LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnInference & Production ScaleScaling LLM Inference
🚀HardInference Optimization

Scaling LLM Inference

Explains why decode-heavy LLM serving is often memory-bound and how KV-cache design, batching, PagedAttention, and speculative decoding improve scale.

41 min read
Learning path
Step 129 of 155 in the full curriculum
Continuous Batching & SchedulingModel Parallelism for LLM Inference

Scaling LLM Inference

The previous chapter showed how continuous batching keeps decode slots useful. This chapter asks the capacity question for scaling large language model (LLM) inference: when requests, model weights, and KV state all compete for HBM, what actually limits serving concurrency?

Imagine you run an online store and you want a chatbot that answers "Where's my order?" Every time a customer asks, the model has to generate a response one word at a time. It's not because the model is "thinking." During decode, the serving stack keeps rereading billions of model weights from GPU memory and consulting a growing KV cache, and that memory movement takes time. Picture a fulfillment line where the routing map has to be reloaded for every single item. Reading the pick list is fast. Reopening the same giant routing map one item at a time is painfully slow.

This article ties together prefill, decode, batching, KV-cache memory, PagedAttention, disaggregation, speculative decoding, and quantization so serving bottlenecks become measurable instead of mysterious. The thread running through all of them is one decision: where to sit on the throughput, latency, and cost triangle for your workload. For the scheduling loop behind this chapter, see continuous batching.

The two phases of generation

LLM inference is distinct from training because it consists of two radically different computational phases: Prefill and Decode. Understanding this distinction is the first step to optimization.

Think of our order-tracking bot. When a customer sends "Where is order 48291?", the system first has to read and understand that entire sentence. That's the prefill phase. Then it starts answering, generating one word at a time: "Your", "order", "is", "in", "transit." That's the decode phase.

Prefill: reading the prompt in one go

In the prefill phase, the model processes the entire user prompt in parallel. This is similar to training: the GPU receives a matrix of shape [batch_size, prompt_len, hidden_dim] and computes attention for all tokens simultaneously.

Because all the input tokens are known upfront, the attention mechanism can compute the interactions between every token in the prompt at once. This parallel processing allows the GPU to use its massive matrix multiplication engines efficiently. A long prefill usually dominates Time To First Token (TTFT), though TTFT also includes queueing and scheduling delay before the first output token is emitted. The figure below shows how all tokens in the prompt are processed simultaneously to generate the first output token.

Prefill phase diagram showing known prompt tokens processed in parallel by large matrix operations before the first output token. Prefill phase diagram showing known prompt tokens processed in parallel by large matrix operations before the first output token.
Prefill is the parallel prompt-processing phase. It usually dominates TTFT for long prompts because the first output token can't be emitted until the prompt has been processed.

Key characteristics

  • Often compute-heavy: Prefill exposes large matrix multiplications. FlashAttention keeps attention exact while reducing HBM traffic relative to materializing the full attention matrix; it does not make all attention IO linear in sequence length.[1]
  • Parallel-friendly: Processing many prompt positions together can drive much higher tensor-core utilization than one-token decode. Whether it saturates compute depends on sequence shape, kernel, and hardware.
  • Latency: Time usually grows with prompt length, and long prompts often dominate TTFT.

Decode: answering one word at a time

Once the first token is generated, the model switches to autoregressive generation. It generates one token at a time, feeding it back as input for the next step.

Unlike the prefill phase, decoding can't be parallelized across tokens because each new token depends on the previous ones. The system is locked into a sequential, step-by-step loop. The speed at which tokens are produced in this phase is measured as Time Per Output Token (TPOT), often expressed as tokens per second (TPS), which dictates how fast the text streams to the user. While TTFT affects perceived responsiveness, TPOT determines the "reading speed" of the generation. The following figure shows this autoregressive process, where each generated token is fed back as input for the next step.

Decode loop diagram showing each output token rereading model weights and KV state, computing logits, sampling one token, and appending new KV before the next step. Decode loop diagram showing each output token rereading model weights and KV state, computing logits, sampling one token, and appending new KV before the next step.
Decode is a sequential memory loop. Batching can amortize repeated weight reads across requests, but each request still advances one generated token at a time.

Key characteristics

  • Often memory-bound: A decode step needs model weights and the KV state used by attention. At small or latency-sensitive batches, repeated reads commonly make HBM bandwidth the ceiling; batching can raise arithmetic intensity by sharing weight reads across active requests.
  • Low arithmetic intensity at small batches: The arithmetic intensity (FLOPs/byte, i.e., Floating Point Operations per byte of data loaded) can be low because the runtime moves large tensors for only one new position per sequence.

Why decode is memory-bound

Decode-heavy LLM serving is often memory-bandwidth bound, not compute-bound. In a compute-bound operation, the system is bottlenecked by the mathematical calculations it must perform. Training and long-prefill workloads typically expose much larger matrix operations than interactive decode, so they can drive compute hardware more effectively.

During token generation, the bottleneck often shifts toward memory movement. Each new token needs model weights and attention state, but contributes only one new position per active request. At modest decode batches, this produces low arithmetic intensity and makes HBM traffic a central constraint.

To make this concrete, imagine a model with 7 billion parameters stored in 16-bit precision. Its weights occupy about 14 GB in decimal units. If one uncached decode step had to read that full weight footprint for one active token, the weight-read lower bound alone would be about 14 GB per step. Real kernels, cache reuse, batch size, tensor parallelism, and KV traffic determine the observed bandwidth cost.

When profiling confirms this bandwidth ceiling, serving work should focus on bytes moved, cache residency, batch policy, and queueing behavior rather than only raw floating-point throughput.

Roofline-style utilization comparison showing prefill leaning toward compute while small-batch decode can lean toward HBM bandwidth. Roofline-style utilization comparison showing prefill leaning toward compute while small-batch decode can lean toward HBM bandwidth.
Prefill and decode hit different ceilings. Decode often saturates memory bandwidth while tensor cores wait, so bandwidth-reducing optimizations matter more than peak FLOPs.

decode-bandwidth-lower-bound.py
1parameters = 7_000_000_000 2bytes_per_parameter = 2 # FP16 3ideal_hbm_bandwidth_gb_s = 2_000 4 5weight_bytes = parameters * bytes_per_parameter 6ideal_steps_per_second = ideal_hbm_bandwidth_gb_s * 1_000_000_000 / weight_bytes 7 8print(f"FP16 weight footprint: {weight_bytes / 1_000_000_000:.2f} GB") 9print(f"ideal weight-read upper bound: {ideal_steps_per_second:.1f} single-token steps/s") 10print("Observed TPS is lower once KV reads and runtime overhead are included.")
Output
1FP16 weight footprint: 14.00 GB 2ideal weight-read upper bound: 142.9 single-token steps/s 3Observed TPS is lower once KV reads and runtime overhead are included.

The KV cache: saving state so you don't restart

Without caching, every new token requires recomputing attention over all previous tokens. The KV cache stores the Key and Value matrices for all past tokens, so we only need to compute them for the new token.

Think of it like a shift handoff log. Without it, our order-tracking bot would have to reread the entire customer conversation from the beginning every time it wanted to say the next word. With the KV cache, it remembers what it already understood and only processes the newest token.

KV cache packing comparison showing contiguous reservations wasting memory while paged block allocation packs active request blocks into a shared reusable pool. KV cache packing comparison showing contiguous reservations wasting memory while paged block allocation packs active request blocks into a shared reusable pool.
Once KV state persists across decode steps, memory packing becomes a scheduling constraint. Paged blocks make more HBM usable for active requests.

The illustration here zooms in on a different but equally important serving concern: once you keep KV states around, you need to pack them efficiently in GPU memory instead of reserving one giant contiguous region per request.

KV cache append diagram showing prefix keys and values reused across decode steps while the newest token adds one fresh KV pair. KV cache append diagram showing prefix keys and values reused across decode steps while the newest token adds one fresh KV pair.
The KV cache keeps old keys and values alive across decode steps. Each new token adds one fresh KV pair instead of rebuilding the whole prefix.

Memory cost of KV cache

The KV cache is often the largest consumer of GPU memory during inference, sometimes exceeding the model weights themselves for long contexts. This is crucial for capacity planning and determining the maximum batch size a given GPU can support.

Let's work through a concrete example by hand before showing the code. Suppose we're serving our order-tracking bot with a model that has 80 layers, uses Grouped Query Attention with 8 KV heads, and each head has dimension 128. For one request with a sequence length of 8,192 tokens, stored in FP16 (2 bytes per element):

  • We need both K and V: that's a factor of 2
  • One request, 8,192 tokens, 80 layers, 8 heads, head size 128, 2 bytes each
  • Total bytes = 2 * 1 * 8,192 * 80 * 8 * 128 * 2 = 2,684,354,560 bytes
  • Divide by 1024^3: that's about 2.5 GiB per request

Now scale that up. For a production batch of 64 concurrent requests at an 8K context window, that's about 160 GiB of KV cache alone. This is why techniques like Grouped Query Attention (GQA), which reduces the number of KV heads from num_heads to num_kv_heads, are standard in modern models.[2]

The following Python function generalizes that exact calculation. It takes the model's architectural parameters and returns the KV cache memory in GiB.

KV cache capacity chart showing memory growing linearly with sequence length and batch size, with an 8K-context 64-request batch reaching about 160 GiB in the worked example. KV cache capacity chart showing memory growing linearly with sequence length and batch size, with an 8K-context 64-request batch reaching about 160 GiB in the worked example.
KV memory grows linearly with sequence length and batch size. For long contexts, KV cache alone can cap concurrency before model weights do.

memory-cost-of-kv-cache.py
1def kv_cache_memory( 2 batch_size: int, 3 seq_len: int, 4 num_layers: int, 5 num_kv_heads: int, 6 head_dim: int, 7 dtype_bytes: int = 2 # FP16 8) -> float: 9 """Calculate KV cache memory in GiB.""" 10 # 2 for K and V, per layer, per head 11 total_bytes = ( 12 2 * batch_size * seq_len * num_layers * num_kv_heads * head_dim * dtype_bytes 13 ) 14 return total_bytes / (1024 ** 3) 15 16# Example model: 80 layers, 8 KV heads (GQA), head_dim=128 17# Batch=1, seq_len=8192, FP16 (2 bytes): 18# 2 * 1 * 8192 * 80 * 8 * 128 * 2 = ~2.5 GiB per request 19one_request = kv_cache_memory( 20 batch_size=1, 21 seq_len=8192, 22 num_layers=80, 23 num_kv_heads=8, 24 head_dim=128, 25) 26production_batch = kv_cache_memory( 27 batch_size=64, 28 seq_len=8192, 29 num_layers=80, 30 num_kv_heads=8, 31 head_dim=128, 32) 33 34print(f"one 8K request: {one_request:.1f} GiB") 35print(f"64 active 8K requests: {production_batch:.0f} GiB") 36print("single-request estimate correct:", one_request == 2.5) 37print("64-request estimate correct:", production_batch == 160.0)
Output
1one 8K request: 2.5 GiB 264 active 8K requests: 160 GiB 3single-request estimate correct: True 464-request estimate correct: True

Throughput vs. latency trade-off

There's an inherent tension between maximizing system throughput and minimizing per-request latency.

Chart showing throughput rising with batch pressure while p99 latency rises faster once batch size gets large. Chart showing throughput rising with batch pressure while p99 latency rises faster once batch size gets large.
Batching improves aggregate tokens per second, but user-facing latency can worsen once large batches push harder on shared memory bandwidth. Track throughput and TTFT or TPOT together.
MetricOptimized ByTrade-off
Throughput (tokens/sec)Larger effective batchesCan increase TTFT or inter-token latency once shared resources are pressured.
Latency (ms/token)Smaller admitted batchesCan leave throughput unused and raise cost per token.

Production tip: Monitor GPU KV-cache usage, prefill backlog, and decode queue depth together. High KV usage plus rising TTFT usually means memory pressure is capping concurrency. Low KV usage with idle compute means you're leaving throughput on the table.

The throughput, latency, cost triangle

Throughput and latency are two corners of a third constraint that the business actually cares about: cost per token. These three pull against each other, and picking where to sit on that triangle is the central job of an inference engineer.

Cost per token is simpler than it looks. If you rent a GPU at a fixed hourly rate and it sustains some number of tokens per second, then:

cost per token=GPU $ per hoursustained tokens per second×3600\text{cost per token} = \frac{\text{GPU \$ per hour}}{\text{sustained tokens per second} \times 3600}cost per token=sustained tokens per second×3600GPU $ per hour​

Sustained throughput, not the sticker hourly rate, dominates the answer. A faster, pricier GPU can still be cheaper per token if its throughput rises faster than its price. Let's work an example by hand. Suppose one GPU costs $3.00/hour and a well-batched deployment sustains 2,500 decode tokens/second across all active requests:

  • Tokens per hour = 2,500 * 3,600 = 9,000,000
  • Cost per token = $3.00 / 9,000,000 = $0.00000033
  • Cost per million tokens = about $0.33

Now starve the batch. If under-configured batching or idle capacity drops sustained throughput to 250 tokens/second, the same GPU-hour spreads over one-tenth the tokens, so cost per million jumps to about $3.33. Utilization is a direct 10x multiplier on cost. This is why batching is not only a latency knob; it moves the cost corner of the triangle.

cost-per-million-tokens.py
1def cost_per_million(hourly_cost: float, sustained_tps: int) -> float: 2 return hourly_cost / (sustained_tps * 3600) * 1_000_000 3 4well_batched = cost_per_million(3.00, 2_500) 5starved = cost_per_million(3.00, 250) 6 7print(f"2,500 tokens/s: ${well_batched:.2f} per million tokens") 8print(f"250 tokens/s: ${starved:.2f} per million tokens") 9print(f"cost multiplier: {starved / well_batched:.0f}x")
Output
12,500 tokens/s: $0.33 per million tokens 2250 tokens/s: $3.33 per million tokens 3cost multiplier: 10x

The triangle has a simple rule: you can usually optimize two corners hard, but the third drifts. Push batch size for throughput and cost, and tail latency rises. Cap batch size for tight latency SLOs, and your cost per token climbs because the GPU is underused. There is no single best operating point, only the one that fits your product's latency SLO at acceptable cost.

Operating pointBatch sizeCost per tokenLatency (TTFT/TPOT)Typical fit
Latency-firstSmallHighLowInteractive chat, code completion
BalancedMediumMediumMediumGeneral chat assistants
Throughput-firstLargeLowHighOffline batch jobs, summarization, evals

Production tip: Pick the operating point from the product SLO, then size hardware to it. An interactive assistant with a 500 ms TTFT budget can't run the same batch size as an overnight document-summarization job, even on identical GPUs. The summarization job can push batch size until cost per token bottoms out because no human is waiting on each token.

Chunked prefills

Long prefills (e.g., Retrieval-Augmented Generation (RAG), where retrieved documents are appended to the user's prompt, creating contexts of 10,000 tokens) can delay decode turns on a shared worker. The duration depends on model, kernel, hardware, and prompt length, but enough long prompts can noticeably worsen active streams in a multi-tenant service.

To reduce that interference, engineers can break large prefills into smaller, fixed-size chunks (e.g., 512 tokens). The system admits a chunk, gives active decodes another scheduling opportunity, and later processes the next chunk. This bounds admitted prefill work per turn, but does not guarantee a latency SLO when queues or kernels are already overloaded. It often trades a higher TTFT for the long request against improved TPOT for existing streams.[3]

Chunked prefill timeline showing a long prompt split into chunks that provide decode scheduling opportunities between admitted prefill slices. Chunked prefill timeline showing a long prompt split into chunks that provide decode scheduling opportunities between admitted prefill slices.
Chunked prefill slices a large prompt into smaller GPU turns. The long request may wait longer for first token, while active streams get more opportunities to decode.

prefill-chunk-budget.py
1import math 2 3prompt_tokens = 10_000 4chunk_tokens = 512 5turns = math.ceil(prompt_tokens / chunk_tokens) 6last_chunk = prompt_tokens - chunk_tokens * (turns - 1) 7 8print(f"prefill turns: {turns}") 9print(f"largest admitted prefill slice: {chunk_tokens} tokens") 10print(f"last slice: {last_chunk} tokens")
Output
1prefill turns: 20 2largest admitted prefill slice: 512 tokens 3last slice: 272 tokens

Disaggregated inference

Disaggregated inference separates prefill and decode across worker pools rather than running both phases on one worker. Systems such as Splitwise and DistServe show when this can improve goodput: avoided interference must repay KV-transfer and coordination overhead.[4][5]

The problem: conflicting optimization targets

Prefill and decode phases have opposite hardware needs:

  • Prefill is often compute-heavy: Long prompts create large matrix operations over many positions
  • Decode is often bandwidth-heavy: Small-batch token generation repeatedly reads weights and KV state

When both phases run on the same GPU, they interfere. A long prefill can "block" decode requests, causing head-of-line blocking where a massive prompt stalls generation for all other users.

Prefill-decode disaggregation

The solution is architectural separation:

  1. Prefill cluster (compute-optimized): Dedicated prefill workers process incoming prompts in parallel. These nodes excel at the compute-heavy attention operations.

  2. Decode cluster (bandwidth-optimized): Separate workers handle token generation. These nodes are tuned for the memory-bound sequential decoding loop and steady high-concurrency decode traffic.

  3. KV cache handoff: After prefill completes, the KV cache is transferred over a fast interconnect from the prefill worker to a decode worker, which continues generation.

Disaggregated inference architecture showing compute-optimized prefill workers handing KV cache over a fast interconnect to bandwidth-optimized decode workers. Disaggregated inference architecture showing compute-optimized prefill workers handing KV cache over a fast interconnect to bandwidth-optimized decode workers.
Prefill-decode disaggregation separates the compute-heavy prompt phase from the bandwidth-heavy streaming phase. It helps when avoided queueing costs exceed KV-transfer overhead.

Benefits of disaggregation

  • Can reduce head-of-line blocking: Long prefills no longer directly occupy decode workers
  • Potentially right-sized hardware: Each phase can run on workers chosen for its measured bottleneck
  • Separate scaling knobs: Prefill and decode clusters can scale independently based on workload patterns
  • Potential efficiency gain: Separate pools can better match the two workload profiles when transfer overhead is acceptable

Disaggregation is a design pattern, not a mandatory default. The KV-transfer cost has to be lower than the queueing and interference it removes, which is why it becomes more attractive as prompts get longer and decode traffic gets denser.[4][5]

Disaggregation also changes how you autoscale. Because prefill load tracks incoming prompt tokens and decode load tracks active generation, the two pools scale on different signals. A burst of long RAG prompts points toward more prefill capacity; a surge in concurrent streaming conversations points toward more decode capacity. Useful signals include queue depth, KV-cache utilization, and TTFT/TPOT percentiles, with cold-start time included because loading model weights onto a new worker is not instant.

kv-handoff-lower-bound.py
1kv_cache_gib = 2.5 2interconnect_gb_s = 200 3 4transfer_bytes = kv_cache_gib * 1024**3 5ideal_transfer_ms = transfer_bytes / (interconnect_gb_s * 1_000_000_000) * 1000 6 7print(f"KV state to transfer: {kv_cache_gib:.1f} GiB") 8print(f"ideal one-way transfer floor at {interconnect_gb_s} GB/s: {ideal_transfer_ms:.1f} ms") 9print("Queueing saved must exceed transfer plus scheduling overhead.")
Output
1KV state to transfer: 2.5 GiB 2ideal one-way transfer floor at 200 GB/s: 13.4 ms 3Queueing saved must exceed transfer plus scheduling overhead.

Batching strategies: the loading-dock analogy

Naive batching strategies lead to significant inefficiency due to the variable length of text. Picture a warehouse loading dock handling shipments that take different amounts of time to prepare.

Static batching: waiting for the whole pallet

In static batching, the dock waits for 4 parcels before releasing the pallet. If one parcel needs 10 extra minutes of labeling, the other 3 sit ready but blocked. In serving terms, we group requests into a batch and pad them to the length of the longest active sequence. The batch membership stays fixed for that run, so when one request finishes early its slot often turns into padding or sits idle until the longest request finishes. The timeline below illustrates how shorter requests waste compute cycles while the batch waits on the longest request.

Static versus continuous batching timeline showing static batches wasting finished slots as padding while continuous batching admits new requests after each decode step. Static versus continuous batching timeline showing static batches wasting finished slots as padding while continuous batching admits new requests after each decode step.
Static batching holds slots until the batch cycle ends. Continuous batching changes membership at token-step boundaries so finished requests leave and queued requests enter.

The problem with static batching

Static batching creates two significant inefficiencies. First, the GPU is forced to process "padding tokens" that don't contribute to the final output, wasting valuable compute cycles and memory bandwidth. Second, once shorter requests finish, their batch slots usually can't be reused until the scheduler rebuilds the batch around the longest surviving sequence. This degrades both latency and overall throughput, especially when request lengths vary widely.

Continuous batching: filling open dock slots

Continuous batching (introduced by Orca) operates at the iteration level.[6] The dock releases one finished parcel, can pull the next queued parcel into the open slot, and keeps useful work flowing when demand exists.

In serving terms, instead of waiting for a whole batch cycle to finish, the scheduler can eject completed requests and insert new ones after every token generation step. See our continuous batching deep-dive for scheduling algorithms and preemption strategies. The timeline above shows slot reuse; its actual benefit depends on queued work, KV capacity, and latency policy.

Benefits of continuous batching

Continuous batching provides three useful advantages under mixed, queued workloads:

  • Less slot waste: Finished requests can leave at iteration boundaries rather than remaining as padding or idle slots.
  • Lower completion delay for short requests: A completed request need not wait for the longest request in a fixed batch, though queueing and large active batches can still worsen latency.
  • Policy control: The scheduler can decide how to admit prefills alongside ongoing decode while respecting KV memory and latency SLOs.

To implement continuous batching, systems use a scheduling loop that manages active requests dynamically. The following sketch shows the shape of a continuous batcher. It takes a queue of incoming requests, processes prefill for new requests up to the maximum batch size, and then runs a single decoding step for all active requests.

benefits-of-continuous-batching.py
1import torch 2 3# Minimal request stub for illustration 4class Request: 5 def is_done(self) -> bool: return False 6 def get_next_token(self) -> torch.Tensor: return torch.tensor([0]) 7 def update(self, logits: torch.Tensor): pass 8 9class ContinuousBatcher: 10 def __init__(self, model: torch.nn.Module, max_batch_size: int = 64): 11 self.model = model 12 self.max_batch = max_batch_size 13 self.active_requests: list[Request] = [] 14 self.queue: list[Request] = [] 15 16 def step(self): 17 """ 18 Executes a single generation step for the current batch. 19 """ 20 # 1. Remove completed requests 21 self.active_requests = [ 22 req for req in self.active_requests if not req.is_done() 23 ] 24 25 # 2. Add new requests from queue (up to max batch size) 26 while self.queue and len(self.active_requests) < self.max_batch: 27 new_req = self.queue.pop(0) 28 # Run prefill for the new request (often done in parallel or on a separate stream) 29 # pseudo-code: new_req.run_prefill(self.model) 30 self.active_requests.append(new_req) 31 32 # 3. Run one decode step for all active requests 33 if self.active_requests: 34 # Gather current input tokens from all requests 35 input_tokens = torch.stack([req.get_next_token() for req in self.active_requests]) 36 37 # Forward pass (batched) 38 logits = self.model.decode(input_tokens) 39 40 # Update requests with new tokens 41 for i, req in enumerate(self.active_requests): 42 req.update(logits[i])

Memory management: PagedAttention (vLLM)

Traditional KV-cache allocation is like reserving a full pallet position for every request, even if it only needs a small bin. PagedAttention is like a shared bin system that assigns fixed-size slots on demand: Request A gets slots 7, 2, and 5 (non-contiguous, but tracked by a block table). When a request finishes, its slots become available again. Paging sharply reduces worst-case reservation waste, but a partially filled final block and bookkeeping still consume memory.

The problem

KV cache is allocated per-request, but request lengths vary. Pre-allocating the maximum possible sequence length for every request wastes a massive amount of memory. For example, if the system allocates 4096 tokens per request by default:

RequestTokens NeededTokens AllocatedMemory Wasted
Request A100409697.5%
Request B3000409626.8%

PagedAttention solution

PagedAttention applies the operating system concept of virtual memory to KV cache management.[7] Instead of contiguous physical memory, we divide the KV cache into fixed-size "blocks" (pages). The following figure illustrates how logical blocks map to non-contiguous physical GPU memory via a block table.

PagedAttention block-table diagram mapping logical KV blocks to non-contiguous physical GPU memory pages while preserving sequence order. PagedAttention block-table diagram mapping logical KV blocks to non-contiguous physical GPU memory pages while preserving sequence order.
PagedAttention separates logical token order from physical HBM placement. A block table lets the runtime use non-contiguous pages while attention still sees the right sequence.

Impact of PagedAttention

By avoiding large contiguous reservations and allocating fixed-size blocks on demand, PagedAttention lets the runtime fit more useful KV state into the same HBM budget.[7] It does not eliminate all slack: each live request can still leave a partially filled last block, and the block table has overhead. In practice, it substantially reduces memory lost to worst-case preallocation.

paged-kv-slack.py
1import math 2 3requests = [100, 3_000] 4max_context = 4_096 5block_tokens = 16 6reserved_tokens = len(requests) * max_context 7paged_tokens = sum(math.ceil(tokens / block_tokens) * block_tokens for tokens in requests) 8 9print(f"max-context reservation: {reserved_tokens} token slots") 10print(f"paged allocation: {paged_tokens} token slots") 11print(f"remaining final-block slack: {paged_tokens - sum(requests)} token slots")
Output
1max-context reservation: 8192 token slots 2paged allocation: 3120 token slots 3remaining final-block slack: 20 token slots

Copy-on-write for shared blocks

PagedAttention's copy-on-write mechanism matters whenever multiple active continuations share the same prompt prefix. In the original vLLM setting, this is especially important for beam search and parallel sampling, where several continuations reuse the same prompt blocks before they diverge.[7] The figure below shows two continuations initially pointing to the same shared prefix blocks before branching.

Copy-on-write KV cache diagram showing two continuations sharing prefix blocks until one branch diverges and allocates a private block. Copy-on-write KV cache diagram showing two continuations sharing prefix blocks until one branch diverges and allocates a private block.
Copy-on-write shares immutable prefix KV blocks across continuations, then clones only the block that must diverge. That saves memory without corrupting another branch's view.

Initially, the shared prefix blocks have a reference count greater than one. Appending new tokens usually allocates fresh blocks for each continuation. If a continuation needs to write into a block that's still shared, the runtime first clones that block so the other continuations keep their original view. That's what copy-on-write means here: share immutable prefix state aggressively, then split only when sequences diverge.[7]

Context parallelism and long-context serving

As context windows grow into the hundreds of thousands or millions of tokens, a single GPU often can't hold the full KV cache or attention working set for one request. Context Parallelism (CP) addresses this by splitting the input sequence itself across multiple GPUs.[8]

How context parallelism works

Instead of splitting layers (tensor parallelism) or batches (data parallelism), CP splits the sequence dimension:

  1. A 1M token sequence is divided into N chunks (e.g., 250K tokens per GPU on 4 GPUs)
  2. Each GPU processes its chunk independently during the prefill phase
  3. Attention is computed using ring-style communication patterns to handle cross-chunk dependencies
  4. The KV cache is distributed across the GPU cluster

This approach becomes useful once a single request's context no longer fits comfortably on one accelerator.

Ring attention for context parallelism

Modern implementations use ring attention (Liu et al., 2024) or similar distributed attention algorithms that minimize communication overhead. GPUs form a logical ring, passing partial attention results to their neighbors until the full context is covered. This can extend supported context length roughly linearly with the number of devices, at least until communication becomes the next bottleneck.[8]

context-parallel-kv-shards.py
1def kv_gib(sequence_tokens: int) -> float: 2 total_bytes = 2 * sequence_tokens * 80 * 8 * 128 * 2 3 return total_bytes / 1024**3 4 5total_kv = kv_gib(1_000_000) 6devices = 4 7print(f"one 1M-token request KV: {total_kv:.1f} GiB") 8print(f"evenly sharded over {devices} devices: {total_kv / devices:.1f} GiB/device") 9print("Communication and runtime buffers still add overhead.")
Output
1one 1M-token request KV: 305.2 GiB 2evenly sharded over 4 devices: 76.3 GiB/device 3Communication and runtime buffers still add overhead.

Speculative decoding: the smart assistant

Speculative decoding works like a fulfillment shortcut. A small draft model proposes the next few support-message tokens. The large target model checks the proposed span in a verification pass. If enough draft tokens survive acceptance, this can replace several target decode passes; if not, draft and verification work can lose to ordinary decoding.

Think of it as a senior routing checker (the large target model) and a fast draft scanner (the small draft model). The scanner guesses the next 5 tokens. The checker looks at the span together. If the first 3 are accepted, one target verification pass can emit those accepted tokens plus a correction or continuation token. The accounting must still include the scanner's draft work.

To visualize this, consider how speculative decoding coordinates the interaction between the two models.[9] We use a fast, small draft model to propose a sequence of multiple tokens. The large target model then verifies these proposed tokens in parallel, accepting correct ones and correcting any mistakes.

Speculative decoding diagram showing a small draft model proposing multiple tokens, a large target model verifying them in one pass, and accepted tokens reducing target decode passes. Speculative decoding diagram showing a small draft model proposing multiple tokens, a large target model verifying them in one pass, and accepted tokens reducing target decode passes.
Speculative decoding is only faster when draft work is cheap and acceptance rate is high. The target model still preserves correctness through accept/reject correction.

The key insight of speculative decoding is the asymmetric cost of the two paths. Drafting may be cheap enough, while verification can score all k draft positions in one target-model pass. If acceptance is high, one target pass can replace several ordinary target decode passes. Total latency still includes draft generation, verification, sampling, rejected-token correction, and kernel overhead.

The function below shows one exact speculative step for a single sequence. It first samples k draft tokens from the small model, then runs the large target model once on [prompt + draft_tokens], and finally performs the accept/reject test from Leviathan et al. with residual resampling on the first mismatch.[9]

speculative-decoding-the-smart-assistant.py
1import torch 2import torch.nn.functional as F 3 4def next_token_probs(model, input_ids: torch.Tensor) -> torch.Tensor: 5 with torch.no_grad(): 6 logits = model(input_ids).logits[0, -1] 7 return F.softmax(logits, dim=-1) 8 9def speculative_step( 10 draft_model, 11 target_model, 12 input_ids: torch.Tensor, 13 k: int = 4, 14) -> list[int]: 15 """ 16 Return one speculative chunk for a single sequence. 17 Assumes input_ids has shape [1, seq_len]. 18 """ 19 assert input_ids.shape[0] == 1, "single-sequence example" 20 21 draft_tokens: list[int] = [] 22 draft_dists: list[torch.Tensor] = [] 23 draft_input = input_ids 24 25 for _ in range(k): 26 q = next_token_probs(draft_model, draft_input) 27 token = torch.multinomial(q, num_samples=1).item() 28 draft_tokens.append(token) 29 draft_dists.append(q) 30 token_tensor = torch.tensor([[token]], device=input_ids.device) 31 draft_input = torch.cat([draft_input, token_tensor], dim=1) 32 33 # One target-model pass verifies all draft positions at once. 34 with torch.no_grad(): 35 target_logits = target_model(draft_input).logits[0] 36 37 target_dists = F.softmax( 38 target_logits[input_ids.size(1) - 1 : input_ids.size(1) + k], 39 dim=-1, 40 ) 41 42 accepted: list[int] = [] 43 for i, token in enumerate(draft_tokens): 44 p = target_dists[i] 45 q = draft_dists[i] 46 acceptance = min(1.0, (p[token] / q[token]).item()) 47 48 if torch.rand(()) < acceptance: 49 accepted.append(token) 50 continue 51 52 residual = torch.clamp(p - q, min=0) 53 if residual.sum() <= 0: 54 replacement = torch.argmax(p).item() 55 else: 56 residual = residual / residual.sum() 57 replacement = torch.multinomial(residual, num_samples=1).item() 58 return accepted + [replacement] 59 60 # If all k draft tokens are accepted, sample one extra token from p. 61 extra = torch.multinomial(target_dists[k], num_samples=1).item() 62 return accepted + [extra]

Follow-on work such as EAGLE uses the target model's hidden states to predict future tokens instead of relying on a separately trained small draft model.[10] The important takeaway isn't that one speculative variant always wins. It's that these methods trade extra compute for fewer expensive target-model decode passes, so the real payoff depends on acceptance rate, hardware, and implementation overhead.

speculative-pass-accounting.py
1ordinary_target_passes = 5 2draft_proposals = 4 3accepted_prefix = 3 4target_verification_passes = 1 5emitted_tokens = accepted_prefix + 1 6 7print(f"ordinary target passes for {ordinary_target_passes} tokens: {ordinary_target_passes}") 8print(f"one verification emits in this example: {emitted_tokens} tokens") 9print(f"extra draft passes paid: {draft_proposals}") 10print("Speedup requires cheap drafting and high acceptance.")
Output
1ordinary target passes for 5 tokens: 5 2one verification emits in this example: 4 tokens 3extra draft passes paid: 4 4Speedup requires cheap drafting and high acceptance.

Hardware-aware optimization: quantization and precision

Inference performance isn't just about algorithms. Modern GPUs and specialized accelerators provide hardware-level features that fundamentally change the efficiency equation.

Low-precision inference and quantization

Many serving stacks still use BF16/FP16 as a baseline, but lower-precision modes are increasingly common because they cut model-weight bandwidth and, in some systems, shrink the KV footprint enough to raise concurrency.[11][12][13]

  • FP8 (8-bit floating point): Useful when your hardware and kernels support it, because it lowers bandwidth pressure while preserving more dynamic range than integer-only formats.[11]
  • INT8/INT4-style weight quantization: Common for weight-only inference, where the goal is to shrink the bytes reread on every decode step without fully quantizing the rest of the runtime.[12]
  • KV-cache compression or quantization: Targets the other major memory consumer during long-context serving, which matters once the KV cache rather than weights caps concurrency.[13]

Quantization approaches include:

  • Weight-only quantization: Keep activations in higher precision (BF16/FP16) but compress model weights to 4-8 bits
  • KV-cache compression/quantization: Reduce KV bytes, or compress less useful KV state, when long contexts would otherwise cap batch size and residency
Quantization bandwidth chart showing FP16, FP8, INT8, and INT4 reducing bytes per weight and increasing effective decode bandwidth for memory-bound serving. Quantization bandwidth chart showing FP16, FP8, INT8, and INT4 reducing bytes per weight and increasing effective decode bandwidth for memory-bound serving.
For memory-bound decode, lower precision mainly helps by reducing bytes moved per token. Actual speedup depends on kernel support, dequantization overhead, and quality constraints.

Common mistake: Beginners often think 4-bit quantization is just about saving disk space. The real win is reducing bytes moved during decode. Whether that turns into a large latency win still depends on kernels, dequant overhead, and quality constraints.

weight-bytes-by-precision.py
1parameters = 7_000_000_000 2bytes_per_weight = {"FP16": 2.0, "INT8": 1.0, "INT4": 0.5} 3 4for precision, width in bytes_per_weight.items(): 5 traffic_gb = parameters * width / 1_000_000_000 6 print(f"{precision}: {traffic_gb:.1f} GB of weights per full read") 7 8print("INT4 weight traffic is 0.25x FP16 before kernel overhead.")
Output
1FP16: 14.0 GB of weights per full read 2INT8: 7.0 GB of weights per full read 3INT4: 3.5 GB of weights per full read 4INT4 weight traffic is 0.25x FP16 before kernel overhead.

Inference-first silicon beyond GPUs

GPUs still dominate general-purpose LLM serving, but some deployments use inference-first accelerators with larger on-chip memory, dataflow execution, or more deterministic scheduling. The trade-off is usually a narrower software stack: you may get excellent latency or cost for a specific serving pattern, but at the price of custom compilation, fewer kernels, and less ecosystem flexibility.

Common pitfalls

The following symptoms show up in production logs and profiling traces. If you see them, here's what they mean and how to fix them.

"Training optimizations don't speed up my inference"

Symptom: You upgrade math kernels or buy more peak FLOPs, but decode barely speeds up. Cause: Training is compute-bound, while decode is often memory-bound. If the GPU is already waiting on weights and KV reads, more math capacity does little. Fix: Profile bandwidth first. If HBM is near ceiling, prioritize batching, quantization, GQA, or KV-cache management before chasing more tensor-core throughput.

"Short requests are as slow as long ones"

Symptom: Small chats wait almost as long as large ones even when the queue looks short. Cause: Static batching pads to the longest request and cannot reuse finished slots until the batch cycle ends. Fix: Switch to continuous batching so completed requests leave immediately and queued work fills open slots on the next decode step.

"Speculative decoding made latency worse"

Symptom: Draft-model serving added overhead, but target-model passes did not fall enough to repay it. Cause: Acceptance rate is too low or implementation overhead is too high. Wrong draft tokens force extra verification and correction work. Fix: Measure accepted tokens per draft span before rollout. If most proposals are rejected, keep standard decoding or use a stronger draft path.

"HBM usage grows almost linearly even when prefixes are shared"

Symptom: HBM usage spikes even though many sessions start from the same long system prompt or retrieved prefix. Cause: The runtime is not reusing cached prefix KV state across independent requests, or its isolation policy does not allow that reuse. Fix: Enable an explicit prefix-caching feature with an appropriate tenant and privacy policy. Copy-on-write can preserve shared blocks after reuse is established; it is not by itself a cross-request cache.

"Throughput keeps rising but users complain about slowness"

Symptom: Aggregate TPS looks better, but TTFT or TPOT percentiles get worse and chat feels sluggish. Cause: Larger batches improve throughput while making active requests compete harder for bandwidth. Fix: Use SLO-aware scheduling. Cap batch size or split latency-sensitive traffic from background work, and watch throughput together with p95 or p99 latency.

Try it yourself: the VRAM calculator

Here's a practical check you can do with pen and paper or a short Python script.

Problem: You want to serve a 7B-class model on a single GPU with 80 GiB of HBM. The model has 28 layers, uses Grouped Query Attention with 4 KV heads, head dimension 128, and you plan to use FP16 (2 bytes per element). Your average customer conversation has a 4,096-token context. What's the maximum batch size you can support before the KV cache alone fills the GPU, assuming you need to leave 20 GiB for weights and runtime overhead?

Hint: Start by computing the KV cache for one request, then see how many fit in the remaining 60 GiB.

Worked solution

For one request at 4,096 tokens:

text
1KV bytes = 2 * batch * seq_len * layers * kv_heads * head_dim * dtype_bytes 2 = 2 * 1 * 4096 * 28 * 4 * 128 * 2 3 = 234,881,024 bytes 4 ≈ 0.22 GiB per request

Available memory for KV cache: 80 GiB total - 20 GiB reserved = 60 GiB

Maximum batch size = 60 GiB / 0.21875 GiB per request ≈ 274 requests (rounding the per-request figure to 0.22 GiB gives about 273, so treat this as a ballpark, not a precise count)

In practice, you'd run at a lower batch size to leave headroom for activation buffers, temporary tensors, allocator slack, and bursty long-context requests. A production engineer might cap this far below the arithmetic ceiling and monitor actual HBM usage.

vram-capacity-headroom.py
1import math 2 3kv_bytes = 2 * 1 * 4_096 * 28 * 4 * 128 * 2 4kv_per_request_gib = kv_bytes / 1024**3 5kv_budget_gib = 80 - 20 6arithmetic_ceiling = math.floor(kv_budget_gib / kv_per_request_gib) 7 8print(f"KV per request: {kv_per_request_gib:.5f} GiB") 9print(f"KV budget after reserved memory: {kv_budget_gib} GiB") 10print(f"arithmetic-only batch ceiling: {arithmetic_ceiling}")
Output
1KV per request: 0.21875 GiB 2KV budget after reserved memory: 60 GiB 3arithmetic-only batch ceiling: 274

Mastery check

Evaluation rubric

  • Foundational: Why decode-heavy LLM serving is often memory-bandwidth bound even when peak FLOPs look huge.
  • Intermediate: How prefill and decode differ, and why TTFT and TPOT need separate dashboards.
  • Advanced: How KV-cache size scales with batch size, context length, layers, KV heads, head dimension, and precision.
  • Advanced: Why static batching wastes slots, while continuous batching uses iteration-level admission to keep decode work flowing.
  • Advanced: How PagedAttention uses virtual-memory-style blocks to pack KV state, and why cross-request prefix reuse requires explicit caching policy in addition to copy-on-write.
  • Advanced: When speculative decoding, disaggregated inference, context parallelism, and quantization help, and when their overheads can erase the win.
  • Advanced: How throughput, latency, and cost per token trade off, and how to pick an operating point from a product SLO instead of chasing one metric.

Follow-up questions

How does KV-cache size scale with sequence length and batch size?

KV cache stores key and value tensors from previous tokens across all layers, so memory grows linearly with sequence length, batch size, number of layers, number of KV heads, head dimension, and bytes per element. A useful formula is 2 * batch * seq_len * layers * kv_heads * head_dim * bytes. GQA reduces the kv_heads term, and KV-cache quantization reduces bytes.

When does speculative decoding hurt performance?

Speculative decoding hurts when the draft path is inaccurate or expensive enough that extra draft work and rejection handling cost more than the target-model passes saved. Low acceptance, poor kernel fusion, or tiny latency-sensitive batches can erase the win.

How does tensor parallelism interact with batching?

Tensor parallelism splits layer math across multiple GPUs, so every decode step includes collective communication such as all-reduce or all-gather. Larger batches can amortize that communication overhead, but small interactive batches may become communication-bound before they become compute-bound. Batching and tensor parallelism need to be tuned together.

What are the tradeoffs between throughput and latency?

Serving systems balance aggregate throughput against per-request latency. Continuous batching can increase total tokens per second, but larger batches also make active requests compete for the same memory bandwidth. Production schedulers need latency SLOs, not only raw tokens/sec.

How do you estimate cost per token for a self-hosted model?

Take the GPU hourly rate, divide by sustained tokens per second times 3,600, then multiply by one million for cost per million tokens. The dominant variables are real-world sustained throughput (driven by batching, quantization, and model size) and utilization. A GPU sitting idle multiplies cost per token directly, so a 5x cheaper sticker price can still lose if it delivers less than one-fifth the throughput.

A long RAG prefill is stalling active chat streams. Do you try chunked prefill or full disaggregation first?

Usually try chunked prefill first if the main problem is one large prompt monopolizing a shared worker for too long. Move to prefill-decode disaggregation when that interference remains large enough that separate pools and KV handoff beat their transfer and coordination overhead.

Course handoff

You now understand why serving capacity depends on phase behavior, memory bandwidth, KV-cache residency, scheduler policy, and precision choices. These tools let you read a slow serving trace and separate prefill queueing, decode bandwidth pressure, fragmented KV memory, and speculative-decoding overhead.

Next Step
Continue to Model Parallelism for LLM Inference

Batching and KV-cache planning show where serving memory goes; model parallelism teaches what changes when one production model must be split across several GPUs.

PreviousContinuous Batching & Scheduling
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. · 2022 · NeurIPS 2022

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.

Ainslie, J., et al. · 2023 · EMNLP 2023

Sarathi-Serve: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.

Agrawal, A., et al. · 2023 · arXiv preprint

Splitwise: Efficient Generative LLM Inference Using Phase Splitting.

Patel, P., et al. · 2023

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving.

Zhong, Y., et al. · 2024 · OSDI 2024

Orca: A Distributed Serving System for Transformer-Based Generative Models.

Yu, G.-I., et al. · 2022 · OSDI 2022

Efficient Memory Management for Large Language Model Serving with PagedAttention.

Kwon, W., et al. · 2023 · SOSP 2023

Ring Attention with Blockwise Transformers for Near-Infinite Context.

Liu, H., et al. · 2024 · arXiv preprint

Fast Inference from Transformers via Speculative Decoding.

Leviathan, Y., Kalman, M., & Matias, Y. · 2023 · ICML 2023

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty.

Li, Y., et al. · 2024 · ICML 2024

FP8 Formats for Deep Learning.

Micikevicius, P., et al. · 2022

GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers

Frantar, E., et al. · 2023 · ICLR 2023

SnapKV: Compressing KV Cache by Selecting Global Attention Patterns.

Li, Y., et al. · 2024