LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnInference & Production ScaleContinuous Batching & Scheduling
🚀HardInference Optimization

Continuous Batching & Scheduling

Understand how LLM schedulers use continuous batching, chunked prefill, and prefill-decode disaggregation to improve throughput without violating TTFT or inter-token latency targets.

34 min read
Learning path
Step 128 of 155 in the full curriculum
FlashAttention & Memory EfficiencyScaling LLM Inference

Continuous Batching & Scheduling

The previous chapter showed how FlashAttention reduces memory traffic inside one attention kernel. This chapter moves up one layer: how does an inference server decide which requests share each forward pass?

Continuous batching is a serving scheduler problem: keep the GPU busy while requests arrive with different prompt lengths, output lengths, and deadlines. This chapter shows how production LLM servers batch work without waiting for every request to finish at the same time.

Imagine an order-support chatbot on a busy shopping holiday. Some customers ask for a quick refund status (two sentences). Others paste their entire order history and ask for a detailed analysis (ten paragraphs). Hundreds of these requests hit the server at once.

A simple request-level baseline handles this with static batching. It fills a batch of requests, runs them together, and waits for the longest one to finish before starting the next batch. If one customer asks for a ten-paragraph analysis, completed short answers leave idle batch slots until the long generation ends.

Continuous batching[1][2] fixes this waste by letting the server reuse a slot at the next token-generation step (the next iteration) after a request finishes. Under sustained mixed-length workloads, that can increase useful GPU work per step without changing the model itself. It pairs naturally with flexible Key-Value (KV) cache management, because admission still depends on available cache memory.

Static batching leaves finished request slots idle until the longest request ends, while continuous batching reuses slots at decode boundaries. Static batching leaves finished request slots idle until the longest request ends, while continuous batching reuses slots at decode boundaries.
Continuous batching improves serving throughput by reusing finished request slots between decode steps.

Why Batching Is Tricky for LLMs

To understand why static batching is so wasteful, recall two facts about how LLMs generate text.

First, LLMs are autoregressive: they produce text one token at a time. Processing the input prompt (the prefill phase) can be done in parallel, but generating each new token (the decode phase) happens sequentially. A short refund question might need only ten decode steps. A long order analysis might need two hundred.

Second, the output token count usually isn't known ahead of time. The engine stops when it hits a stop token or a length limit. That means the scheduler can't know in advance when a request will finish.

These two facts make LLM serving fundamentally different from image-classification serving, where every request has the same fixed shape and finishes in a single forward pass.

The Problem with Static Batching

In fixed-shape model serving, batching is simpler because requests may complete in one forward pass. In LLM serving, requests have varying prompt and output lengths. Because output token count usually isn't known until generation stops (either at a stop token or a length limit), the engine can't know in advance when a request will finish.

With static batching, all requests in a batch must wait for the longest request to complete before the next batch can start. This synchronous execution model forces the engine to maintain the batch shape until each sequence is fully generated. The result is large idle capacity and, in pipelined deployments, pipeline bubbles:

RequestTokensStatus
Request 1100 tokensDone at step 100, waits 300 more steps
Request 2400 tokensFinishes at step 400
Request 350 tokensDone at step 50, waits 350 more steps

Most slots become idle as soon as their request finishes, so utilization can collapse in heterogeneous workloads. Requests 1 and 3 are done early, but their GPU memory stays reserved until Request 2 completes. No new requests can join because the batch shape is fixed. Throughput falls even though the hardware still has work it could be doing.


Continuous Batching (Iteration-Level Scheduling)

Continuous batching (also known as iteration-level scheduling) moves scheduling to the decode-step boundary. After each generation step (or after each prefill chunk in engines that chunk prefills), the server checks which requests finished, frees their slots, and admits new work if memory and token-budget constraints allow.

Static batching is like a delivery truck that waits until every warehouse order is packed before leaving the dock. Even if the first parcel was ready ten minutes ago, that slot stays blocked until the slowest parcel is ready. Continuous batching is like a cross-dock lane with fixed loading slots: as soon as one parcel leaves the belt, the next queued parcel takes its place. The lane stays full, maximizing useful work per step.

Completed requests leave immediately, and queued requests can join on the next iteration. The scheduler effectively rebuilds the active running set for the next forward pass.

StepBatch CompositionEvent
0[Req1, Req2, Req3]Initial batch
50[Req1, Req2, Req4 (new)]Req3 done, Req4 joins
100[Req5 (new), Req2, Req4]Req1 done, Req5 joins
400[Req5, Req6 (new), Req4]Req2 done, Req6 joins

A Worked Example: Three Support Tickets

Make the waste concrete with numbers. Suppose the GPU can run three requests at once, and three support tickets arrive:

RequestTaskOutput needed
Ticket ARefund status10 tokens
Ticket BShipping delay explanation50 tokens
Ticket CFull order-history analysis200 tokens

Static batching runs all three together. Ticket A finishes after 10 iterations, Ticket B after 50, and Ticket C after 200. For iterations 11 through 200, the slots for A and B sit idle. Total GPU iterations for this batch: 200. Wasted slot-iterations: 190 for A's slot plus 150 for B's slot = 340 idle slots.

Continuous batching starts with [A, B, C]. At iteration 10, Ticket A finishes and a new ticket (Ticket D) joins immediately. At iteration 50, Ticket B finishes and Ticket E joins. The GPU keeps all three slots busy until the queue empties. The only idle iterations happen when the waiting queue is empty.

In this tiny example, continuous batching fills slots that static batching would leave idle, so the server can answer substantially more customer questions per GPU hour without a hardware upgrade.

Under sustained load, slot reuse avoids idle work from completed short requests. Actual utilization still depends on token budgets, memory pressure, phase mix, and kernel efficiency.

This first calculation isolates slot reuse from GPU details. It treats each active output token as one useful slot-iteration:

static-versus-continuous-slot-use.py
1output_lengths = [10, 50, 200] 2batch_slots = len(output_lengths) 3static_capacity = max(output_lengths) * batch_slots 4useful_work = sum(output_lengths) 5idle_work = static_capacity - useful_work 6 7print(f"useful slot-iterations: {useful_work}") 8print(f"static idle slot-iterations: {idle_work}") 9print(f"static utilization: {useful_work / static_capacity:.1%}")
Output
1useful slot-iterations: 260 2static idle slot-iterations: 340 3static utilization: 43.3%
Continuous batching timeline showing requests entering and leaving active decode slots at iteration boundaries. Continuous batching timeline showing requests entering and leaving active decode slots at iteration boundaries.
At each iteration boundary, finished requests leave and queued requests can enter under the scheduler's memory and token budgets.

One subtle but important detail: modern schedulers are usually limited by a token budget or KV-cache budget, rather than a raw "number of requests" cap. One long prefill can consume more scheduler budget than many short decode-only requests.

Continuous batching pairs naturally with a flexible memory manager. Because requests come and go, their Key-Value (KV) caches cannot be stored efficiently in contiguous tensors. This is why PagedAttention (block-based memory management) is such a practical companion to continuous batching.


Memory Management: The Hidden Enabler

To understand why continuous batching was difficult to implement initially, we need to look at memory management. Recall that during the decode phase, the model stores a Key-Value (KV) cache: a running record of attention states for every token generated so far. Because requests leave and join the batch at unpredictable times, their KV caches can't be stored efficiently in contiguous tensors.

The Fragmentation Problem

In standard Transformer implementations, the KV cache for a request is often treated like a growing contiguous buffer sized for a maximum sequence length.

When requests have different lengths and arrive at different times:

  1. Internal Fragmentation: You have to pre-allocate memory for max_seq_len (e.g., 4096 tokens) even if the request generates 50 tokens.
  2. External Fragmentation: As requests finish and leave, they create variable-sized "holes" in GPU memory that new requests might not fit into cleanly.

The KV cache for a single sequence grows with every decode step. For a 70B-class model with GQA (80 layers, 8 KV heads, head dimension 128, FP16), each new token adds about 320 KiB of KV state. A 4,096-token sequence therefore needs roughly 1.25 GiB of cache. If the same model stored a full KV head set instead of GQA, the cache would balloon to roughly 10 GiB at 4K tokens. That's why fragmentation hurts so much: even modest over-allocation can strand gigabytes of VRAM that could have admitted other requests.

A useful size estimate follows this formula:

MKV=2×L×s×Hkv×d×bM_{KV} = 2 \times L \times s \times H_{kv} \times d \times bMKV​=2×L×s×Hkv​×d×b

Where LLL is the number of layers, sss is the sequence length, HkvH_{kv}Hkv​ is the number of KV heads, ddd is the head dimension, and bbb is bytes per element (2 for FP16). The leading 2 accounts for both key and value tensors.

Use the formula directly before choosing a concurrency target:

kv-cache-budget.py
1layers, kv_heads, head_dim, bytes_per_value = 80, 8, 128, 2 2tokens = 4096 3bytes_per_token = 2 * layers * kv_heads * head_dim * bytes_per_value 4cache_bytes = bytes_per_token * tokens 5 6print(f"KV state per token: {bytes_per_token / 1024:.0f} KiB") 7print(f"KV state at {tokens:,} tokens: {cache_bytes / 1024**3:.2f} GiB")
Output
1KV state per token: 320 KiB 2KV state at 4,096 tokens: 1.25 GiB

PagedAttention Solution

PagedAttention[2] solves this by borrowing the concept of virtual memory from operating systems, as illustrated below.

PagedAttention maps logical KV-cache blocks to non-contiguous physical GPU blocks through a small block table. PagedAttention maps logical KV-cache blocks to non-contiguous physical GPU blocks through a small block table.
PagedAttention lets the scheduler allocate fixed-size KV blocks wherever space is available. Logical sequence order stays intact through the block table.
  1. KV Blocks: The KV cache is divided into fixed-size blocks (e.g., 16 or 32 tokens).
  2. Non-Contiguous Allocation: Blocks can be stored anywhere in GPU memory (VRAM). They don't need to be contiguous.
  3. Block Table: Each request keeps a mapping table from logical page to physical block that tells the runtime where its KV blocks live.

This allows the scheduler to allocate memory dynamically as tokens are generated. The vLLM paper reports less than 4% waste in its evaluated allocation design, in contrast with substantially higher reservation waste in its baseline systems.[2] Fixed blocks make dynamic admission practical without requiring large contiguous allocations.

This toy allocation computes the last-block slack for a few active sequences:

paged-kv-slack.py
1block_size = 16 2sequence_lengths = [17, 31, 48, 65] 3reserved = sum(((length + block_size - 1) // block_size) * block_size for length in sequence_lengths) 4used = sum(sequence_lengths) 5slack = reserved - used 6 7print(f"active tokens: {used}") 8print(f"reserved block slots: {reserved}") 9print(f"last-block slack: {slack} tokens ({slack / reserved:.1%})")
Output
1active tokens: 161 2reserved block slots: 192 3last-block slack: 31 tokens (16.1%)

PagedAttention is a memory management technique for non-contiguous KV blocks. FlashAttention is a compute optimization that tiles attention to reduce High Bandwidth Memory (HBM) I/O. They are orthogonal techniques, but systems such as vLLM often combine them to improve both memory capacity and compute speed.


Prefill vs. Decode Phases

LLM serving has two distinct computational phases with opposing characteristics:

Prefill (Prompt Processing)

  • Processes all input tokens in parallel (like training).
  • Often compute-heavy because it processes many prompt tokens in matrix multiplications.
  • Runs once per request at the start.
  • Latency goal: Minimize Time To First Token (TTFT).

Decode (Token Generation)

  • Generates tokens one at a time (autoregressive).
  • Often memory-bandwidth limited at practical batch sizes because each step processes few new tokens while reading model and KV state.
  • Runs many times per request (once per output token).
  • Latency goal: Minimize inter-token latency. Many dashboards report this as TBT or TPOT, while papers such as DistServe define TPOT as the average time per output token after the first.[3]

The following table summarizes the key differences between the prefill and decode phases:

PhaseArithmetic IntensityBottleneckBatching Strategy
PrefillOften highFrequently compute throughputBatch prompt tokens
DecodeOften lowerFrequently memory bandwidthBatch active streams

The Interference Problem

Blindly mixing a full long prefill with decode work can harm inter-token latency. A large prefill request (for example, processing a 4K-token document) can delay decode steps for concurrent streams. Later in the article we'll look at chunked prefill, which bounds how much prefill work competes in one scheduling step.

Token-budget scheduling makes that bound visible:

decode-first-token-budget.py
1max_num_batched_tokens = 1024 2decode_requests = 96 3pending_prefill_tokens = 1800 4prefill_budget = max_num_batched_tokens - decode_requests 5scheduled_prefill = min(pending_prefill_tokens, prefill_budget) 6 7print(f"decode tokens scheduled first: {decode_requests}") 8print(f"prefill tokens scheduled now: {scheduled_prefill}") 9print(f"prefill tokens left for later: {pending_prefill_tokens - scheduled_prefill}")
Output
1decode tokens scheduled first: 96 2prefill tokens scheduled now: 928 3prefill tokens left for later: 872

Solutions include:

  1. Chunked prefill: Break long prefills into chunks, interleave with decode steps.
  2. Disaggregated prefill: Separate prefill and decode onto different GPU pools.[3]
  3. Priority scheduling: Prioritize decode steps to minimize latency.

Scheduling Policies

The scheduler's job is to decide which requests to run in the next iteration.

The scheduler is like a warehouse dispatch desk. First-Come-First-Served (FCFS) loads orders in arrival order, which is fair, but a one-item replacement waits behind a huge wholesale pallet. Shortest-Job-First (SJF) prioritizes quick orders first, which improves average wait time. Preemption is like pausing a low-priority batch to ship an urgent replacement, then resuming later. Each scheduling policy is a tradeoff between fairness, throughput, and latency.

First-Come-First-Served (FCFS)

It's the most straightforward approach and simple to reason about. Requests are admitted in arrival order, which is usually fair, but it isn't optimal for overall system throughput. A long-running request at the front of the queue can block many shorter requests behind it, increasing the average latency across the system.

Shortest-Job-First (SJF)

The scheduler prioritizes requests that are expected to require fewer output tokens. In theory, this can improve average latency because short responses (like a simple "Yes" or "No") don't get stuck behind a long refund-policy explanation. In practice, output length is unknown ahead of time, so pure SJF is rare in LLM serving. Real systems approximate it with heuristics such as prompt-length bucketing, tenant priorities, or token-budget limits.

This queue calculation shows the benefit SJF would have if output lengths were known. Production engines need estimates, priorities, or fairness controls because they don't know these lengths in advance:

shortest-job-first-bound.py
1jobs = [("long", 50), ("quick", 5), ("medium", 20)] 2 3def average_completion(order): 4 elapsed = 0 5 completion_times = [] 6 for _, tokens in order: 7 elapsed += tokens 8 completion_times.append(elapsed) 9 return sum(completion_times) / len(completion_times) 10 11fcfs = average_completion(jobs) 12oracle_sjf = average_completion(sorted(jobs, key=lambda job: job[1])) 13print(f"FCFS average completion: {fcfs:.1f} steps") 14print(f"oracle SJF average completion: {oracle_sjf:.1f} steps") 15print("production caveat: output length is not known up front")
Output
1FCFS average completion: 60.0 steps 2oracle SJF average completion: 35.0 steps 3production caveat: output length is not known up front

Priority and Decode-First Scheduling

Many production schedulers care more about keeping existing streams smooth than about starting every new request immediately. A user notices TPOT jitter right away, while a modest TTFT increase on a newly arrived long prompt is often less damaging. That leads to policies such as decode-first scheduling, per-tenant priorities, and SLO-aware admission control.[3]

Preemption Strategies

When GPU memory is full (KV cache blocks are exhausted), the scheduler must either pause work, evict state, or stop admitting new requests. Two common recovery strategies are:

  1. Swap Out (CPU Offload): Move the victim request's KV cache from GPU VRAM to CPU RAM.
    • Pros: Saves progress; cheap to resume if PCIe bandwidth is available.
    • Cons: High latency penalty for the swapped-out request.
  2. Recomputation: Drop the KV cache entirely and re-run the prefill phase when the request is rescheduled.
    • Pros: Zero memory overhead on CPU; simple to implement.
    • Cons: Wastes GPU compute; works best when the prompt is short (low prefill cost).

These are design options, not universal defaults. The cheaper choice depends on prompt length, host-device bandwidth, and how expensive replaying prefill would be. Recomputation can be better when prompts are short enough that rebuilding state costs less than moving KV blocks off the GPU. Swapping matters when discarded state would be expensive to rebuild. If you tune a particular engine, check its current scheduler documentation and trace preemption events rather than assuming one policy is active.

Here is a simplified Python simulation of a continuous batch scheduler. It demonstrates the core iteration loop, including consistent KV-block accounting as requests grow token by token and a swap-out decision when the next step can't fit. The schedule_step method is called before every forward pass, and finish_decode_step is called after that pass emits one token for each active request.

preemption-strategies.py
1from dataclasses import dataclass 2 3@dataclass 4class Request: 5 id: str 6 prompt_len: int 7 max_new_tokens: int 8 priority: int = 0 # Higher value = higher priority 9 generated_tokens: int = 0 10 allocated_blocks: int = 0 11 swapped_to_cpu: bool = False 12 13 @property 14 def total_tokens(self) -> int: 15 return self.prompt_len + self.generated_tokens 16 17 def is_finished(self) -> bool: 18 """Check if request has reached EOS or max_new_tokens.""" 19 return self.generated_tokens >= self.max_new_tokens 20 21class ContinuousBatchScheduler: 22 """ 23 Simplified iteration-level scheduler. 24 25 Real engines also track per-request token budgets, prefix-cache hits, 26 EOS detection, and beam-search groups. This example focuses on the core 27 loop: retire finished requests, resume paused work, admit new work, then 28 preempt if the next decode step would overflow KV-cache capacity. 29 """ 30 BLOCK_SIZE = 16 31 32 def __init__(self, max_batch_size: int, max_memory_blocks: int): 33 self.max_batch_size = max_batch_size 34 self.max_memory_blocks = max_memory_blocks 35 self.running: list[Request] = [] # Currently generating on GPU 36 self.waiting: list[Request] = [] # Queued requests 37 self.swapped: list[Request] = [] # Preempted to CPU RAM 38 self.used_memory_blocks = 0 39 self.preemptions = 0 40 41 def schedule_step(self) -> list[Request]: 42 # 1. Retire finished requests and free their KV cache. 43 active = [] 44 for req in self.running: 45 if req.is_finished(): 46 self._free_kv_cache(req) 47 else: 48 active.append(req) 49 self.running = active 50 51 # 2. Resume preempted requests first so they don't starve. 52 self._admit(self.swapped, from_cpu=True) 53 54 # 3. Admit fresh waiting requests. 55 self._admit(self.waiting, from_cpu=False) 56 57 # 4. Preempt only if the next decode step would exceed the safe budget. 58 while self.running and not self._can_run_next_decode_step(self.running): 59 victim = min(self.running, key=lambda r: (r.priority, r.generated_tokens)) 60 self.running.remove(victim) 61 self._swap_out_to_cpu(victim) 62 victim.swapped_to_cpu = True 63 self.swapped.append(victim) 64 self.preemptions += 1 65 66 return self.running 67 68 def finish_decode_step(self) -> None: 69 # Call this after the model emits one token for each running request. 70 for req in self.running: 71 req.generated_tokens += 1 72 required_blocks = self._blocks_for_tokens(req.total_tokens) 73 growth = required_blocks - req.allocated_blocks 74 if growth > 0: 75 self.used_memory_blocks += growth 76 req.allocated_blocks = required_blocks 77 78 def _admit(self, queue: list[Request], from_cpu: bool) -> None: 79 queue.sort(key=lambda r: r.priority, reverse=True) 80 81 while ( 82 queue 83 and len(self.running) < self.max_batch_size 84 and self._can_allocate_current_state(queue[0]) 85 ): 86 req = queue.pop(0) 87 if from_cpu: 88 self._swap_in_from_cpu(req) 89 else: 90 self._allocate_kv_cache(req) 91 req.swapped_to_cpu = False 92 self.running.append(req) 93 94 def _blocks_for_tokens(self, token_count: int) -> int: 95 return (token_count + self.BLOCK_SIZE - 1) // self.BLOCK_SIZE 96 97 def _can_allocate_current_state(self, req: Request) -> bool: 98 return self.used_memory_blocks + self._blocks_for_tokens(req.total_tokens) <= self.max_memory_blocks 99 100 def _allocate_kv_cache(self, req: Request) -> None: 101 req.allocated_blocks = self._blocks_for_tokens(req.total_tokens) 102 self.used_memory_blocks += req.allocated_blocks 103 104 def _free_kv_cache(self, req: Request) -> None: 105 self.used_memory_blocks -= req.allocated_blocks 106 req.allocated_blocks = 0 107 108 def _swap_out_to_cpu(self, req: Request) -> None: 109 self._free_kv_cache(req) 110 111 def _swap_in_from_cpu(self, req: Request) -> None: 112 self._allocate_kv_cache(req) 113 114 def _projected_growth_for_next_token(self, req: Request) -> int: 115 next_blocks = self._blocks_for_tokens(req.total_tokens + 1) 116 return max(0, next_blocks - req.allocated_blocks) 117 118 def _can_run_next_decode_step(self, batch: list[Request]) -> bool: 119 growth = sum(self._projected_growth_for_next_token(req) for req in batch) 120 return self.used_memory_blocks + growth <= self.max_memory_blocks 121 122scheduler = ContinuousBatchScheduler(max_batch_size=3, max_memory_blocks=16) 123scheduler.waiting.extend([ 124 Request("req_A", prompt_len=8, max_new_tokens=3), 125 Request("req_B", prompt_len=8, max_new_tokens=1), 126 Request("req_C", prompt_len=8, max_new_tokens=2), 127 Request("req_D", prompt_len=8, max_new_tokens=2), 128 Request("req_E", prompt_len=8, max_new_tokens=1), 129]) 130 131timeline: list[list[str]] = [] 132peak_blocks = 0 133for _ in range(6): 134 active = scheduler.schedule_step() 135 timeline.append([req.id for req in active]) 136 peak_blocks = max(peak_blocks, scheduler.used_memory_blocks) 137 scheduler.finish_decode_step() 138 peak_blocks = max(peak_blocks, scheduler.used_memory_blocks) 139 140print(timeline[:3]) 141print(f"peak blocks: {peak_blocks}/{scheduler.max_memory_blocks}") 142 143assert timeline[0] == ["req_A", "req_B", "req_C"] 144assert timeline[1] == ["req_A", "req_C", "req_D"] 145assert "req_E" in timeline[2] 146assert peak_blocks <= scheduler.max_memory_blocks 147 148# A tight KV-block budget forces an actual preemption before decode. 149pressure = ContinuousBatchScheduler(max_batch_size=2, max_memory_blocks=2) 150pressure.waiting.extend([ 151 Request("urgent", prompt_len=16, max_new_tokens=2, priority=1), 152 Request("background", prompt_len=16, max_new_tokens=2, priority=0), 153]) 154active_under_pressure = pressure.schedule_step() 155print([req.id for req in active_under_pressure], f"preemptions={pressure.preemptions}") 156assert [req.id for req in active_under_pressure] == ["urgent"] 157assert pressure.preemptions == 1
Output
1[['req_A', 'req_B', 'req_C'], ['req_A', 'req_C', 'req_D'], ['req_A', 'req_D', 'req_E']] 2peak blocks: 3/16 3['urgent'] preemptions=1

The key observation is that schedule_step() rebuilds the active set at every iteration, scanning the running requests and admitting queued work under the batch and KV-cache budgets.

In a real server, the control loop alternates between schedule_step() and finish_decode_step(): schedule the next forward pass, run it once, record the extra token and any newly needed KV block, then repeat. Real schedulers add more machinery on top of this loop, such as prefix-cache reuse, cancellation handling, speculative decoding, and token-budget accounting. A recomputation-based preemption path would drop the victim's KV cache and put it back into waiting instead of parking it in swapped. The tight-budget assertion proves that this example's preemption branch runs rather than merely existing in the code.


Serving Engines and System Designs

The ideas in this article show up across most modern inference stacks, but some entries below are production engines while others are research systems that shaped later engine design:

FrameworkWhat it emphasizesWhere it shines
vLLM[2]PagedAttention and continuous batching for high-throughput servingGeneral-purpose high-throughput serving
TensorRT-LLM[4]NVIDIA-optimized kernels, in-flight batching, paged KV cacheTeams standardized on the NVIDIA serving stack
SGLang[5]Continuous batching plus aggressive prefix reuse (RadixAttention)Workloads with repeated prefixes or structured generation
DistServe[3]Explicit prefill/decode disaggregation and SLO-aware schedulingDeployments where TTFT and TPOT targets dominate raw TPS

Other servers implement many of the same ideas too. Feature matrices change quickly, so treat engine comparisons as moving targets and verify the current docs before making a production choice.


Throughput vs. Goodput

Benchmark numbers in this space are highly workload-dependent. The meaningful comparison isn't "Which server is fastest?" but "Which system stays within TTFT and TPOT targets for my traffic mix?"

The literature still gives useful anchor points:

SourceWhat it measuredWhy it matters
Orca[1]Iteration-level scheduling against request-level baselinesShowed that per-iteration scheduling can improve throughput and latency together
vLLM[2]PagedAttention plus continuous batching against prior serving systemsReported 2-4x higher throughput at the same latency level compared with systems such as FasterTransformer and Orca in the paper's evaluated setups
DistServe[3]Phase-disaggregated serving under TTFT/TPOT SLOsReported up to 7.4x more request capacity or 12.6x tighter SLOs while keeping more than 90% of requests within latency constraints

Goodput: The Production Metric

Raw throughput can be misleading if a large fraction of requests miss their latency target. DistServe frames goodput as the maximum request rate that can be served while still meeting TTFT and TPOT objectives at a chosen SLO-attainment target.[3] A simple mental model is:

Goodput=requests processed within SLAtotal time\text{Goodput} = \frac{\text{requests processed within SLA}}{\text{total time}}Goodput=total timerequests processed within SLA​

Unlike raw throughput (which counts all completed requests), goodput only counts requests that satisfy the latency target. In interview settings, this is the right framing: optimize tokens/sec only after you define the TTFT, TPOT, and end-to-end latency targets that matter for users.

Calculate goodput from an explicit latency objective rather than reporting only completions:

goodput-under-slo.py
1window_seconds = 10 2latencies_ms = [ 3 {"ttft": 120, "tpot": 25}, 4 {"ttft": 450, "tpot": 22}, 5 {"ttft": 180, "tpot": 42}, 6 {"ttft": 190, "tpot": 27}, 7] 8ttft_target, tpot_target = 300, 30 9within_slo = [ 10 item for item in latencies_ms 11 if item["ttft"] <= ttft_target and item["tpot"] <= tpot_target 12] 13 14print(f"raw throughput: {len(latencies_ms) / window_seconds:.2f} req/s") 15print(f"goodput: {len(within_slo) / window_seconds:.2f} req/s") 16print(f"SLO-compliant requests: {len(within_slo)}/{len(latencies_ms)}")
Output
1raw throughput: 0.40 req/s 2goodput: 0.20 req/s 3SLO-compliant requests: 2/4


Advanced Scheduling: Chunked Prefill

One remaining bottleneck in naive continuous batching is prefill/decode interference. Sarathi-Serve addresses this with chunked-prefills plus decode-maximal batching.[6] A single long prefill can delay other requests waiting for a decode step, producing visible gaps in token output when the phase mix is poorly scheduled.

Mechanism

Chunked prefill splits a long prefill phase into smaller blocks (for example, 512-token or 1,024-token chunks). Sarathi then uses decode-maximal batching to pair one prefill chunk with as many decode requests as possible, so the chunk keeps the GPU compute units busy while the decodes "piggyback" on the same batch. This flow shows how chunking prevents latency spikes:

Chunked prefill splits a long prompt into smaller chunks so decode streams can run between chunks instead of stalling behind one long prefill. Chunked prefill splits a long prompt into smaller chunks so decode streams can run between chunks instead of stalling behind one long prefill.
Chunked prefill trades one large latency spike for smaller prefill chunks interleaved with decode work. Existing streams keep receiving tokens.

Why This Matters

Without chunked prefill, a long prefill can occupy a long scheduling interval. During this window, concurrent decode requests wait for their next token. Users can see irregular token output even if average TBT looks acceptable.

With chunked prefill, the same 4,000-token document can be split into smaller pieces. With a 512-token chunk size, for example, the scheduler processes about 8 chunks and gains scheduling boundaries at which decode work can run. The long prompt may pay overhead from additional scheduling and kernel boundaries, but one admitted prefill chunk no longer consumes the entire prompt in one step.

Current vLLM V1 documentation describes chunked prefill as enabled by default: the scheduler prioritizes decode requests and uses the remaining max_num_batched_tokens budget for prefills, chunking a prefill that doesn't fit. Its tuning guidance notes the tradeoff: larger token budgets tend to improve TTFT and throughput, while smaller budgets can improve inter-token latency.[7]

Here is the same budget tradeoff as a small schedule:

chunked-prefill-plan.py
1prompt_tokens = 4000 2chunk_size = 512 3chunks = [] 4remaining = prompt_tokens 5while remaining: 6 current = min(chunk_size, remaining) 7 chunks.append(current) 8 remaining -= current 9 10print(f"chunk count: {len(chunks)}") 11print(f"first/last chunk: {chunks[0]}/{chunks[-1]} tokens") 12print(f"decode scheduling boundaries: {len(chunks) - 1}")
Output
1chunk count: 8 2first/last chunk: 512/416 tokens 3decode scheduling boundaries: 7

Disaggregated Serving

A more advanced approach, disaggregated serving, splits prefill and decode across separate worker pools. Splitwise[8] and DistServe[3] study this design explicitly. The motivation is simple: prefill often wants dense matrix-math throughput, while decode often wants KV-cache capacity and memory bandwidth.

Disaggregation isn't the right default for every deployment. For short prompts, small clusters, or workloads with high prefix-cache hit rates, running prefill on the decode worker can be simpler than paying a KV-transfer hop. The right comparison is goodput under a defined TTFT and TPOT objective, including transfer overhead.

Architecture

By decoupling these two phases, we can construct a specialized pipeline where requests are routed to the most appropriate hardware pool for their current state. The following diagram illustrates this disaggregated architecture:

Disaggregated serving routes prompts to prefill workers, transfers KV state, then streams tokens from decode workers tuned for memory bandwidth. Disaggregated serving routes prompts to prefill workers, transfers KV state, then streams tokens from decode workers tuned for memory bandwidth.
Disaggregation splits serving by phase. Prefill workers optimize dense prompt processing; decode workers optimize KV-cache capacity and token streaming.

Benefits

  • Prefill workers: Can be sized and tuned for high matrix-math throughput.

  • Decode workers: Can be sized for KV-cache capacity and memory bandwidth without worrying about prefill interference.

  • Scaling: You can scale prefill and decode pools independently based on whether your traffic skews toward long prompts or long generations.

Cost

The KV cache (or equivalent prefill state) must be transferred between pools. Whether that transfer is worth it depends on prompt length, interconnect speed, and your TTFT/TPOT SLOs. DistServe is a good reminder that once you optimize for goodput instead of raw TPS, paying that transfer cost can still be the right tradeoff.[3]

Estimate the handoff bytes before assuming split pools improve latency:

kv-handoff-cost.py
1layers, kv_heads, head_dim, bytes_per_value = 80, 8, 128, 2 2prompt_tokens = 4096 3link_gib_per_second = 50 4kv_bytes = 2 * layers * kv_heads * head_dim * bytes_per_value * prompt_tokens 5transfer_ms = kv_bytes / (link_gib_per_second * 1024**3) * 1000 6 7print(f"KV handoff size: {kv_bytes / 1024**3:.2f} GiB") 8print(f"idealized transfer time at {link_gib_per_second} GiB/s: {transfer_ms:.1f} ms") 9print("production check: add protocol, queueing, and synchronization overhead")
Output
1KV handoff size: 1.25 GiB 2idealized transfer time at 50 GiB/s: 25.0 ms 3production check: add protocol, queueing, and synchronization overhead

Common Mistakes in Production

Even experienced engineers trip over a few predictable traps when tuning a continuous batching setup.

Static Batching Mindset

Symptom: You design the scheduler as if all requests in a batch must start and finish together.

Cause: Traditional model-serving systems often batch fixed-shape requests, so the batch is treated as one request-level unit of work.

Fix: Treat decoding as an iteration-level loop. Finished requests leave at the next boundary, cancellations free their KV blocks, and waiting requests enter when batch and token budgets allow.

The OOM Oversimplification

Symptom: You increase max_batch_size and the server crashes with an out-of-memory error mid-generation.

Cause: You treated memory as a fixed cost per request, ignoring that the KV cache grows with every output token. A batch of ten short prompts fits easily, but if those prompts turn into long generations, the cache doubles or triples in size.

Fix: Tune by KV cache blocks or token budget, not by raw request count. Monitor peak memory during the longest generations in your workload, and leave headroom for bursts.

Ignoring Time to First Token (TTFT)

Symptom: Throughput benchmarks look great, but users complain that new requests take forever to start responding.

Cause: The scheduler is so focused on keeping the decode batch full that it delays admitting new prefills. A long queue of decode work starves the waiting queue.

Fix: Set a maximum decode-only iteration budget or a waiting-timeout. If a request sits in the queue longer than your TTFT target, preempt a low-priority decode stream and admit the new prefill. The right balance depends on your product: a chat app needs low TTFT, while a background summarization pipeline can tolerate a longer wait.


Try It Yourself: The Manual Scheduler

Here's a concrete exercise to test whether the concept has stuck.

Scenario: You have a GPU batch size of 3. Five support tickets arrive:

TicketPrefill lengthExpected output
T110 tokens20 tokens
T25 tokens40 tokens
T38 tokens15 tokens
T412 tokens30 tokens
T56 tokens10 tokens

Task: Walk through the first 50 iterations by hand.

  1. Which three tickets start in the initial batch?
  2. At which iteration does the first ticket finish?
  3. Which ticket replaces it?
  4. How many total decode iterations would static batching require if the initial batch were [T1, T2, T3]?
  5. How many iterations does continuous batching save?

Answer sketch: T3 finishes first, at iteration 15, making room for T4. T1 finishes at iteration 20, making room for T5. T2 finishes at iteration 40. T5, which started at iteration 20, finishes at iteration 30. T4, which started at iteration 15, finishes at iteration 45. Static batching needs 40 iterations for the first batch and 30 for the second, totaling 70. Continuous batching finishes all five tickets in 45 iterations, saving 25 iterations (roughly 35% faster).

Check the hand calculation with a three-slot event simulation:

manual-scheduler-answer.py
1import heapq 2 3durations = [20, 40, 15, 30, 10] 4slots = 3 5static_iterations = sum(max(durations[i:i + slots]) for i in range(0, len(durations), slots)) 6 7active = durations[:slots] 8heapq.heapify(active) 9for duration in durations[slots:]: 10 available_at = heapq.heappop(active) 11 heapq.heappush(active, available_at + duration) 12continuous_iterations = max(active) 13 14print(f"static iterations: {static_iterations}") 15print(f"continuous iterations: {continuous_iterations}") 16print(f"iterations saved: {static_iterations - continuous_iterations}")
Output
1static iterations: 70 2continuous iterations: 45 3iterations saved: 25

Key takeaways

  1. Static batching wastes capacity when request lengths differ. Finished requests leave idle holes until the slowest request completes.

  2. Continuous batching is common in high-throughput LLM serving. It schedules at the iteration level, adding and removing requests dynamically.

  3. Memory management and scheduling are inseparable. Dynamic admission only works well if KV-cache allocation is flexible.

  4. Prefill/decode interference is often the next bottleneck. Chunked prefill and disaggregation both exist to keep decode latency smooth.

  5. Metrics matter. Don't optimize only for tokens/sec. Optimize for Goodput, TTFT, and TPOT under your actual workload.

The evolution of LLM serving is a move toward finer-grained control: batch-level static scheduling, then iteration-level continuous scheduling, then chunked prefill to reduce interference, then phase-level disaggregation when prefill and decode need different hardware pools.


What you should be able to defend

After working through this chapter, you should be able to:

  1. Explain static-batch waste. Show why heterogeneous output lengths create idle slots and, in pipelined deployments, pipeline bubbles.
  2. Describe iteration-level scheduling. Walk through finished-request removal, queued-request admission, and active-set rebuilding at decode boundaries.
  3. Analyze prefill/decode interference. Explain why compute-heavy prefills can stall memory-bound decode streams.
  4. Connect scheduling to memory. Explain why PagedAttention and fixed-size KV blocks make continuous admission practical.
  5. Compare policies. Discuss FCFS, priority scheduling, decode-first scheduling, swapping, and recomputation.
  6. Use goodput correctly. Define goodput as request throughput under TTFT and TPOT/SLA targets, not raw tokens per second.

Production questions

How does Sarathi-Serve reduce prefill-decode interference?

Sarathi-Serve splits long prefills into smaller chunks, then builds decode-maximal batches around those chunks. One prefill chunk can run with as many decode requests as possible, so a single long prompt does not block every active stream for a large uninterrupted prefill window. The result is smoother TBT/TPOT and lower tail latency for concurrent users.[6]

What is the optimal batch size for a given GPU?

There is no universal batch size. The practical limit depends on weight memory, KV-cache growth, prompt length distribution, output length distribution, and scheduler token budget. Profile on target hardware and choose the largest request or token budget that improves throughput without missing TTFT, TPOT, or end-to-end latency SLOs.

How should cancellations work mid-generation?

A canceled request should leave the running set at the next scheduler boundary. Its KV-cache pages should be returned to the memory pool immediately, so the scheduler can admit other work instead of carrying a dead sequence until the original batch would have finished.

How does disaggregated inference separate prefill and decode?

Disaggregated inference routes prompt processing to a prefill pool and token generation to a decode pool. After prefill finishes, the KV state moves to a decode worker. That lets prefill workers optimize for dense matrix math and decode workers optimize for KV-cache capacity and memory bandwidth, at the cost of an extra state-transfer hop.[3]

Why is PagedAttention closely tied to continuous batching?

Continuous batching works much better when KV caches are allocated in fixed-size blocks instead of large contiguous tensors. Requests start, grow, finish, and cancel at unpredictable times. PagedAttention maps logical token blocks to non-contiguous physical blocks, so the scheduler can admit new work without needing large contiguous VRAM regions.[2]

Course handoff

You now understand why LLM serving is fundamentally an iteration-level scheduling problem, not a request-level one. You can explain the difference between static and continuous batching, calculate the throughput gain from slot reuse, and identify why flexible memory management and careful prefill/decode scheduling are needed to keep latency smooth.

The next article, Scaling LLM Inference, builds directly on this foundation. It explains why decode is memory-bandwidth bound, how KV cache design and PagedAttention shape the batch size limit, and how speculative decoding adds another layer of throughput optimization. You'll need the mental model of the iteration loop and the token budget you built here to understand why speculative decoding does more than speed up individual requests; it changes the math for the whole batch.

Next Step
Continue to Scaling LLM Inference

Scheduling explains how requests share GPU time; scaling inference combines that with KV-cache memory, batching policy, and deployment topology to handle real traffic.

PreviousFlashAttention & Memory Efficiency
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Orca: A Distributed Serving System for Transformer-Based Generative Models.

Yu, G.-I., et al. · 2022 · OSDI 2022

Efficient Memory Management for Large Language Model Serving with PagedAttention.

Kwon, W., et al. · 2023 · SOSP 2023

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving.

Zhong, Y., et al. · 2024 · OSDI 2024

TensorRT-LLM: A High-Performance Inference Framework for LLMs.

NVIDIA · 2024

SGLang: Efficient Execution of Structured Language Model Programs.

Zheng, L., et al. · 2023

Sarathi-Serve: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.

Agrawal, A., et al. · 2023 · arXiv preprint

Optimization and Tuning.

vLLM · 2026

Splitwise: Efficient Generative LLM Inference Using Phase Splitting.

Patel, P., et al. · 2023