Understand how LLM schedulers use continuous batching, chunked prefill, and prefill-decode disaggregation to improve throughput without violating TTFT or inter-token latency targets.
The previous chapter showed how FlashAttention reduces memory traffic inside one attention kernel. This chapter moves up one layer: how does an inference server decide which requests share each forward pass?
Continuous batching is a serving scheduler problem: keep the GPU busy while requests arrive with different prompt lengths, output lengths, and deadlines. This chapter shows how production LLM servers batch work without waiting for every request to finish at the same time.
Imagine an order-support chatbot on a busy shopping holiday. Some customers ask for a quick refund status (two sentences). Others paste their entire order history and ask for a detailed analysis (ten paragraphs). Hundreds of these requests hit the server at once.
A simple request-level baseline handles this with static batching. It fills a batch of requests, runs them together, and waits for the longest one to finish before starting the next batch. If one customer asks for a ten-paragraph analysis, completed short answers leave idle batch slots until the long generation ends.
Continuous batching[1][2] fixes this waste by letting the server reuse a slot at the next token-generation step (the next iteration) after a request finishes. Under sustained mixed-length workloads, that can increase useful GPU work per step without changing the model itself. It pairs naturally with flexible Key-Value (KV) cache management, because admission still depends on available cache memory.
To understand why static batching is so wasteful, recall two facts about how LLMs generate text.
First, LLMs are autoregressive: they produce text one token at a time. Processing the input prompt (the prefill phase) can be done in parallel, but generating each new token (the decode phase) happens sequentially. A short refund question might need only ten decode steps. A long order analysis might need two hundred.
Second, the output token count usually isn't known ahead of time. The engine stops when it hits a stop token or a length limit. That means the scheduler can't know in advance when a request will finish.
These two facts make LLM serving fundamentally different from image-classification serving, where every request has the same fixed shape and finishes in a single forward pass.
In fixed-shape model serving, batching is simpler because requests may complete in one forward pass. In LLM serving, requests have varying prompt and output lengths. Because output token count usually isn't known until generation stops (either at a stop token or a length limit), the engine can't know in advance when a request will finish.
With static batching, all requests in a batch must wait for the longest request to complete before the next batch can start. This synchronous execution model forces the engine to maintain the batch shape until each sequence is fully generated. The result is large idle capacity and, in pipelined deployments, pipeline bubbles:
| Request | Tokens | Status |
|---|---|---|
| Request 1 | 100 tokens | Done at step 100, waits 300 more steps |
| Request 2 | 400 tokens | Finishes at step 400 |
| Request 3 | 50 tokens | Done at step 50, waits 350 more steps |
Most slots become idle as soon as their request finishes, so utilization can collapse in heterogeneous workloads. Requests 1 and 3 are done early, but their GPU memory stays reserved until Request 2 completes. No new requests can join because the batch shape is fixed. Throughput falls even though the hardware still has work it could be doing.
Continuous batching (also known as iteration-level scheduling) moves scheduling to the decode-step boundary. After each generation step (or after each prefill chunk in engines that chunk prefills), the server checks which requests finished, frees their slots, and admits new work if memory and token-budget constraints allow.
Static batching is like a delivery truck that waits until every warehouse order is packed before leaving the dock. Even if the first parcel was ready ten minutes ago, that slot stays blocked until the slowest parcel is ready. Continuous batching is like a cross-dock lane with fixed loading slots: as soon as one parcel leaves the belt, the next queued parcel takes its place. The lane stays full, maximizing useful work per step.
Completed requests leave immediately, and queued requests can join on the next iteration. The scheduler effectively rebuilds the active running set for the next forward pass.
| Step | Batch Composition | Event |
|---|---|---|
| 0 | [Req1, Req2, Req3] | Initial batch |
| 50 | [Req1, Req2, Req4 (new)] | Req3 done, Req4 joins |
| 100 | [Req5 (new), Req2, Req4] | Req1 done, Req5 joins |
| 400 | [Req5, Req6 (new), Req4] | Req2 done, Req6 joins |
Make the waste concrete with numbers. Suppose the GPU can run three requests at once, and three support tickets arrive:
| Request | Task | Output needed |
|---|---|---|
| Ticket A | Refund status | 10 tokens |
| Ticket B | Shipping delay explanation | 50 tokens |
| Ticket C | Full order-history analysis | 200 tokens |
Static batching runs all three together. Ticket A finishes after 10 iterations, Ticket B after 50, and Ticket C after 200. For iterations 11 through 200, the slots for A and B sit idle. Total GPU iterations for this batch: 200. Wasted slot-iterations: 190 for A's slot plus 150 for B's slot = 340 idle slots.
Continuous batching starts with [A, B, C]. At iteration 10, Ticket A finishes and a new ticket (Ticket D) joins immediately. At iteration 50, Ticket B finishes and Ticket E joins. The GPU keeps all three slots busy until the queue empties. The only idle iterations happen when the waiting queue is empty.
In this tiny example, continuous batching fills slots that static batching would leave idle, so the server can answer substantially more customer questions per GPU hour without a hardware upgrade.
Under sustained load, slot reuse avoids idle work from completed short requests. Actual utilization still depends on token budgets, memory pressure, phase mix, and kernel efficiency.
This first calculation isolates slot reuse from GPU details. It treats each active output token as one useful slot-iteration:
1output_lengths = [10, 50, 200]
2batch_slots = len(output_lengths)
3static_capacity = max(output_lengths) * batch_slots
4useful_work = sum(output_lengths)
5idle_work = static_capacity - useful_work
6
7print(f"useful slot-iterations: {useful_work}")
8print(f"static idle slot-iterations: {idle_work}")
9print(f"static utilization: {useful_work / static_capacity:.1%}")1useful slot-iterations: 260
2static idle slot-iterations: 340
3static utilization: 43.3%
One subtle but important detail: modern schedulers are usually limited by a token budget or KV-cache budget, rather than a raw "number of requests" cap. One long prefill can consume more scheduler budget than many short decode-only requests.
Continuous batching pairs naturally with a flexible memory manager. Because requests come and go, their Key-Value (KV) caches cannot be stored efficiently in contiguous tensors. This is why PagedAttention (block-based memory management) is such a practical companion to continuous batching.
To understand why continuous batching was difficult to implement initially, we need to look at memory management. Recall that during the decode phase, the model stores a Key-Value (KV) cache: a running record of attention states for every token generated so far. Because requests leave and join the batch at unpredictable times, their KV caches can't be stored efficiently in contiguous tensors.
In standard Transformer implementations, the KV cache for a request is often treated like a growing contiguous buffer sized for a maximum sequence length.
When requests have different lengths and arrive at different times:
max_seq_len (e.g., 4096 tokens) even if the request generates 50 tokens.The KV cache for a single sequence grows with every decode step. For a 70B-class model with GQA (80 layers, 8 KV heads, head dimension 128, FP16), each new token adds about 320 KiB of KV state. A 4,096-token sequence therefore needs roughly 1.25 GiB of cache. If the same model stored a full KV head set instead of GQA, the cache would balloon to roughly 10 GiB at 4K tokens. That's why fragmentation hurts so much: even modest over-allocation can strand gigabytes of VRAM that could have admitted other requests.
A useful size estimate follows this formula:
Where is the number of layers, is the sequence length, is the number of KV heads, is the head dimension, and is bytes per element (2 for FP16). The leading 2 accounts for both key and value tensors.
Use the formula directly before choosing a concurrency target:
1layers, kv_heads, head_dim, bytes_per_value = 80, 8, 128, 2
2tokens = 4096
3bytes_per_token = 2 * layers * kv_heads * head_dim * bytes_per_value
4cache_bytes = bytes_per_token * tokens
5
6print(f"KV state per token: {bytes_per_token / 1024:.0f} KiB")
7print(f"KV state at {tokens:,} tokens: {cache_bytes / 1024**3:.2f} GiB")1KV state per token: 320 KiB
2KV state at 4,096 tokens: 1.25 GiBPagedAttention[2] solves this by borrowing the concept of virtual memory from operating systems, as illustrated below.
This allows the scheduler to allocate memory dynamically as tokens are generated. The vLLM paper reports less than 4% waste in its evaluated allocation design, in contrast with substantially higher reservation waste in its baseline systems.[2] Fixed blocks make dynamic admission practical without requiring large contiguous allocations.
This toy allocation computes the last-block slack for a few active sequences:
1block_size = 16
2sequence_lengths = [17, 31, 48, 65]
3reserved = sum(((length + block_size - 1) // block_size) * block_size for length in sequence_lengths)
4used = sum(sequence_lengths)
5slack = reserved - used
6
7print(f"active tokens: {used}")
8print(f"reserved block slots: {reserved}")
9print(f"last-block slack: {slack} tokens ({slack / reserved:.1%})")1active tokens: 161
2reserved block slots: 192
3last-block slack: 31 tokens (16.1%)PagedAttention is a memory management technique for non-contiguous KV blocks. FlashAttention is a compute optimization that tiles attention to reduce High Bandwidth Memory (HBM) I/O. They are orthogonal techniques, but systems such as vLLM often combine them to improve both memory capacity and compute speed.
LLM serving has two distinct computational phases with opposing characteristics:
The following table summarizes the key differences between the prefill and decode phases:
| Phase | Arithmetic Intensity | Bottleneck | Batching Strategy |
|---|---|---|---|
| Prefill | Often high | Frequently compute throughput | Batch prompt tokens |
| Decode | Often lower | Frequently memory bandwidth | Batch active streams |
Blindly mixing a full long prefill with decode work can harm inter-token latency. A large prefill request (for example, processing a 4K-token document) can delay decode steps for concurrent streams. Later in the article we'll look at chunked prefill, which bounds how much prefill work competes in one scheduling step.
Token-budget scheduling makes that bound visible:
1max_num_batched_tokens = 1024
2decode_requests = 96
3pending_prefill_tokens = 1800
4prefill_budget = max_num_batched_tokens - decode_requests
5scheduled_prefill = min(pending_prefill_tokens, prefill_budget)
6
7print(f"decode tokens scheduled first: {decode_requests}")
8print(f"prefill tokens scheduled now: {scheduled_prefill}")
9print(f"prefill tokens left for later: {pending_prefill_tokens - scheduled_prefill}")1decode tokens scheduled first: 96
2prefill tokens scheduled now: 928
3prefill tokens left for later: 872Solutions include:
The scheduler's job is to decide which requests to run in the next iteration.
The scheduler is like a warehouse dispatch desk. First-Come-First-Served (FCFS) loads orders in arrival order, which is fair, but a one-item replacement waits behind a huge wholesale pallet. Shortest-Job-First (SJF) prioritizes quick orders first, which improves average wait time. Preemption is like pausing a low-priority batch to ship an urgent replacement, then resuming later. Each scheduling policy is a tradeoff between fairness, throughput, and latency.
It's the most straightforward approach and simple to reason about. Requests are admitted in arrival order, which is usually fair, but it isn't optimal for overall system throughput. A long-running request at the front of the queue can block many shorter requests behind it, increasing the average latency across the system.
The scheduler prioritizes requests that are expected to require fewer output tokens. In theory, this can improve average latency because short responses (like a simple "Yes" or "No") don't get stuck behind a long refund-policy explanation. In practice, output length is unknown ahead of time, so pure SJF is rare in LLM serving. Real systems approximate it with heuristics such as prompt-length bucketing, tenant priorities, or token-budget limits.
This queue calculation shows the benefit SJF would have if output lengths were known. Production engines need estimates, priorities, or fairness controls because they don't know these lengths in advance:
1jobs = [("long", 50), ("quick", 5), ("medium", 20)]
2
3def average_completion(order):
4 elapsed = 0
5 completion_times = []
6 for _, tokens in order:
7 elapsed += tokens
8 completion_times.append(elapsed)
9 return sum(completion_times) / len(completion_times)
10
11fcfs = average_completion(jobs)
12oracle_sjf = average_completion(sorted(jobs, key=lambda job: job[1]))
13print(f"FCFS average completion: {fcfs:.1f} steps")
14print(f"oracle SJF average completion: {oracle_sjf:.1f} steps")
15print("production caveat: output length is not known up front")1FCFS average completion: 60.0 steps
2oracle SJF average completion: 35.0 steps
3production caveat: output length is not known up frontMany production schedulers care more about keeping existing streams smooth than about starting every new request immediately. A user notices TPOT jitter right away, while a modest TTFT increase on a newly arrived long prompt is often less damaging. That leads to policies such as decode-first scheduling, per-tenant priorities, and SLO-aware admission control.[3]
When GPU memory is full (KV cache blocks are exhausted), the scheduler must either pause work, evict state, or stop admitting new requests. Two common recovery strategies are:
These are design options, not universal defaults. The cheaper choice depends on prompt length, host-device bandwidth, and how expensive replaying prefill would be. Recomputation can be better when prompts are short enough that rebuilding state costs less than moving KV blocks off the GPU. Swapping matters when discarded state would be expensive to rebuild. If you tune a particular engine, check its current scheduler documentation and trace preemption events rather than assuming one policy is active.
Here is a simplified Python simulation of a continuous batch scheduler. It demonstrates the core iteration loop, including consistent KV-block accounting as requests grow token by token and a swap-out decision when the next step can't fit. The schedule_step method is called before every forward pass, and finish_decode_step is called after that pass emits one token for each active request.
1from dataclasses import dataclass
2
3@dataclass
4class Request:
5 id: str
6 prompt_len: int
7 max_new_tokens: int
8 priority: int = 0 # Higher value = higher priority
9 generated_tokens: int = 0
10 allocated_blocks: int = 0
11 swapped_to_cpu: bool = False
12
13 @property
14 def total_tokens(self) -> int:
15 return self.prompt_len + self.generated_tokens
16
17 def is_finished(self) -> bool:
18 """Check if request has reached EOS or max_new_tokens."""
19 return self.generated_tokens >= self.max_new_tokens
20
21class ContinuousBatchScheduler:
22 """
23 Simplified iteration-level scheduler.
24
25 Real engines also track per-request token budgets, prefix-cache hits,
26 EOS detection, and beam-search groups. This example focuses on the core
27 loop: retire finished requests, resume paused work, admit new work, then
28 preempt if the next decode step would overflow KV-cache capacity.
29 """
30 BLOCK_SIZE = 16
31
32 def __init__(self, max_batch_size: int, max_memory_blocks: int):
33 self.max_batch_size = max_batch_size
34 self.max_memory_blocks = max_memory_blocks
35 self.running: list[Request] = [] # Currently generating on GPU
36 self.waiting: list[Request] = [] # Queued requests
37 self.swapped: list[Request] = [] # Preempted to CPU RAM
38 self.used_memory_blocks = 0
39 self.preemptions = 0
40
41 def schedule_step(self) -> list[Request]:
42 # 1. Retire finished requests and free their KV cache.
43 active = []
44 for req in self.running:
45 if req.is_finished():
46 self._free_kv_cache(req)
47 else:
48 active.append(req)
49 self.running = active
50
51 # 2. Resume preempted requests first so they don't starve.
52 self._admit(self.swapped, from_cpu=True)
53
54 # 3. Admit fresh waiting requests.
55 self._admit(self.waiting, from_cpu=False)
56
57 # 4. Preempt only if the next decode step would exceed the safe budget.
58 while self.running and not self._can_run_next_decode_step(self.running):
59 victim = min(self.running, key=lambda r: (r.priority, r.generated_tokens))
60 self.running.remove(victim)
61 self._swap_out_to_cpu(victim)
62 victim.swapped_to_cpu = True
63 self.swapped.append(victim)
64 self.preemptions += 1
65
66 return self.running
67
68 def finish_decode_step(self) -> None:
69 # Call this after the model emits one token for each running request.
70 for req in self.running:
71 req.generated_tokens += 1
72 required_blocks = self._blocks_for_tokens(req.total_tokens)
73 growth = required_blocks - req.allocated_blocks
74 if growth > 0:
75 self.used_memory_blocks += growth
76 req.allocated_blocks = required_blocks
77
78 def _admit(self, queue: list[Request], from_cpu: bool) -> None:
79 queue.sort(key=lambda r: r.priority, reverse=True)
80
81 while (
82 queue
83 and len(self.running) < self.max_batch_size
84 and self._can_allocate_current_state(queue[0])
85 ):
86 req = queue.pop(0)
87 if from_cpu:
88 self._swap_in_from_cpu(req)
89 else:
90 self._allocate_kv_cache(req)
91 req.swapped_to_cpu = False
92 self.running.append(req)
93
94 def _blocks_for_tokens(self, token_count: int) -> int:
95 return (token_count + self.BLOCK_SIZE - 1) // self.BLOCK_SIZE
96
97 def _can_allocate_current_state(self, req: Request) -> bool:
98 return self.used_memory_blocks + self._blocks_for_tokens(req.total_tokens) <= self.max_memory_blocks
99
100 def _allocate_kv_cache(self, req: Request) -> None:
101 req.allocated_blocks = self._blocks_for_tokens(req.total_tokens)
102 self.used_memory_blocks += req.allocated_blocks
103
104 def _free_kv_cache(self, req: Request) -> None:
105 self.used_memory_blocks -= req.allocated_blocks
106 req.allocated_blocks = 0
107
108 def _swap_out_to_cpu(self, req: Request) -> None:
109 self._free_kv_cache(req)
110
111 def _swap_in_from_cpu(self, req: Request) -> None:
112 self._allocate_kv_cache(req)
113
114 def _projected_growth_for_next_token(self, req: Request) -> int:
115 next_blocks = self._blocks_for_tokens(req.total_tokens + 1)
116 return max(0, next_blocks - req.allocated_blocks)
117
118 def _can_run_next_decode_step(self, batch: list[Request]) -> bool:
119 growth = sum(self._projected_growth_for_next_token(req) for req in batch)
120 return self.used_memory_blocks + growth <= self.max_memory_blocks
121
122scheduler = ContinuousBatchScheduler(max_batch_size=3, max_memory_blocks=16)
123scheduler.waiting.extend([
124 Request("req_A", prompt_len=8, max_new_tokens=3),
125 Request("req_B", prompt_len=8, max_new_tokens=1),
126 Request("req_C", prompt_len=8, max_new_tokens=2),
127 Request("req_D", prompt_len=8, max_new_tokens=2),
128 Request("req_E", prompt_len=8, max_new_tokens=1),
129])
130
131timeline: list[list[str]] = []
132peak_blocks = 0
133for _ in range(6):
134 active = scheduler.schedule_step()
135 timeline.append([req.id for req in active])
136 peak_blocks = max(peak_blocks, scheduler.used_memory_blocks)
137 scheduler.finish_decode_step()
138 peak_blocks = max(peak_blocks, scheduler.used_memory_blocks)
139
140print(timeline[:3])
141print(f"peak blocks: {peak_blocks}/{scheduler.max_memory_blocks}")
142
143assert timeline[0] == ["req_A", "req_B", "req_C"]
144assert timeline[1] == ["req_A", "req_C", "req_D"]
145assert "req_E" in timeline[2]
146assert peak_blocks <= scheduler.max_memory_blocks
147
148# A tight KV-block budget forces an actual preemption before decode.
149pressure = ContinuousBatchScheduler(max_batch_size=2, max_memory_blocks=2)
150pressure.waiting.extend([
151 Request("urgent", prompt_len=16, max_new_tokens=2, priority=1),
152 Request("background", prompt_len=16, max_new_tokens=2, priority=0),
153])
154active_under_pressure = pressure.schedule_step()
155print([req.id for req in active_under_pressure], f"preemptions={pressure.preemptions}")
156assert [req.id for req in active_under_pressure] == ["urgent"]
157assert pressure.preemptions == 11[['req_A', 'req_B', 'req_C'], ['req_A', 'req_C', 'req_D'], ['req_A', 'req_D', 'req_E']]
2peak blocks: 3/16
3['urgent'] preemptions=1The key observation is that schedule_step() rebuilds the active set at every iteration, scanning the running requests and admitting queued work under the batch and KV-cache budgets.
In a real server, the control loop alternates between schedule_step() and finish_decode_step(): schedule the next forward pass, run it once, record the extra token and any newly needed KV block, then repeat. Real schedulers add more machinery on top of this loop, such as prefix-cache reuse, cancellation handling, speculative decoding, and token-budget accounting. A recomputation-based preemption path would drop the victim's KV cache and put it back into waiting instead of parking it in swapped. The tight-budget assertion proves that this example's preemption branch runs rather than merely existing in the code.
The ideas in this article show up across most modern inference stacks, but some entries below are production engines while others are research systems that shaped later engine design:
| Framework | What it emphasizes | Where it shines |
|---|---|---|
| vLLM[2] | PagedAttention and continuous batching for high-throughput serving | General-purpose high-throughput serving |
| TensorRT-LLM[4] | NVIDIA-optimized kernels, in-flight batching, paged KV cache | Teams standardized on the NVIDIA serving stack |
| SGLang[5] | Continuous batching plus aggressive prefix reuse (RadixAttention) | Workloads with repeated prefixes or structured generation |
| DistServe[3] | Explicit prefill/decode disaggregation and SLO-aware scheduling | Deployments where TTFT and TPOT targets dominate raw TPS |
Other servers implement many of the same ideas too. Feature matrices change quickly, so treat engine comparisons as moving targets and verify the current docs before making a production choice.
Benchmark numbers in this space are highly workload-dependent. The meaningful comparison isn't "Which server is fastest?" but "Which system stays within TTFT and TPOT targets for my traffic mix?"
The literature still gives useful anchor points:
| Source | What it measured | Why it matters |
|---|---|---|
| Orca[1] | Iteration-level scheduling against request-level baselines | Showed that per-iteration scheduling can improve throughput and latency together |
| vLLM[2] | PagedAttention plus continuous batching against prior serving systems | Reported 2-4x higher throughput at the same latency level compared with systems such as FasterTransformer and Orca in the paper's evaluated setups |
| DistServe[3] | Phase-disaggregated serving under TTFT/TPOT SLOs | Reported up to 7.4x more request capacity or 12.6x tighter SLOs while keeping more than 90% of requests within latency constraints |
Raw throughput can be misleading if a large fraction of requests miss their latency target. DistServe frames goodput as the maximum request rate that can be served while still meeting TTFT and TPOT objectives at a chosen SLO-attainment target.[3] A simple mental model is:
Unlike raw throughput (which counts all completed requests), goodput only counts requests that satisfy the latency target. In interview settings, this is the right framing: optimize tokens/sec only after you define the TTFT, TPOT, and end-to-end latency targets that matter for users.
Calculate goodput from an explicit latency objective rather than reporting only completions:
1window_seconds = 10
2latencies_ms = [
3 {"ttft": 120, "tpot": 25},
4 {"ttft": 450, "tpot": 22},
5 {"ttft": 180, "tpot": 42},
6 {"ttft": 190, "tpot": 27},
7]
8ttft_target, tpot_target = 300, 30
9within_slo = [
10 item for item in latencies_ms
11 if item["ttft"] <= ttft_target and item["tpot"] <= tpot_target
12]
13
14print(f"raw throughput: {len(latencies_ms) / window_seconds:.2f} req/s")
15print(f"goodput: {len(within_slo) / window_seconds:.2f} req/s")
16print(f"SLO-compliant requests: {len(within_slo)}/{len(latencies_ms)}")1raw throughput: 0.40 req/s
2goodput: 0.20 req/s
3SLO-compliant requests: 2/4One remaining bottleneck in naive continuous batching is prefill/decode interference. Sarathi-Serve addresses this with chunked-prefills plus decode-maximal batching.[6] A single long prefill can delay other requests waiting for a decode step, producing visible gaps in token output when the phase mix is poorly scheduled.
Chunked prefill splits a long prefill phase into smaller blocks (for example, 512-token or 1,024-token chunks). Sarathi then uses decode-maximal batching to pair one prefill chunk with as many decode requests as possible, so the chunk keeps the GPU compute units busy while the decodes "piggyback" on the same batch. This flow shows how chunking prevents latency spikes:
Without chunked prefill, a long prefill can occupy a long scheduling interval. During this window, concurrent decode requests wait for their next token. Users can see irregular token output even if average TBT looks acceptable.
With chunked prefill, the same 4,000-token document can be split into smaller pieces. With a 512-token chunk size, for example, the scheduler processes about 8 chunks and gains scheduling boundaries at which decode work can run. The long prompt may pay overhead from additional scheduling and kernel boundaries, but one admitted prefill chunk no longer consumes the entire prompt in one step.
Current vLLM V1 documentation describes chunked prefill as enabled by default: the scheduler prioritizes decode requests and uses the remaining max_num_batched_tokens budget for prefills, chunking a prefill that doesn't fit. Its tuning guidance notes the tradeoff: larger token budgets tend to improve TTFT and throughput, while smaller budgets can improve inter-token latency.[7]
Here is the same budget tradeoff as a small schedule:
1prompt_tokens = 4000
2chunk_size = 512
3chunks = []
4remaining = prompt_tokens
5while remaining:
6 current = min(chunk_size, remaining)
7 chunks.append(current)
8 remaining -= current
9
10print(f"chunk count: {len(chunks)}")
11print(f"first/last chunk: {chunks[0]}/{chunks[-1]} tokens")
12print(f"decode scheduling boundaries: {len(chunks) - 1}")1chunk count: 8
2first/last chunk: 512/416 tokens
3decode scheduling boundaries: 7A more advanced approach, disaggregated serving, splits prefill and decode across separate worker pools. Splitwise[8] and DistServe[3] study this design explicitly. The motivation is simple: prefill often wants dense matrix-math throughput, while decode often wants KV-cache capacity and memory bandwidth.
Disaggregation isn't the right default for every deployment. For short prompts, small clusters, or workloads with high prefix-cache hit rates, running prefill on the decode worker can be simpler than paying a KV-transfer hop. The right comparison is goodput under a defined TTFT and TPOT objective, including transfer overhead.
By decoupling these two phases, we can construct a specialized pipeline where requests are routed to the most appropriate hardware pool for their current state. The following diagram illustrates this disaggregated architecture:
Prefill workers: Can be sized and tuned for high matrix-math throughput.
Decode workers: Can be sized for KV-cache capacity and memory bandwidth without worrying about prefill interference.
Scaling: You can scale prefill and decode pools independently based on whether your traffic skews toward long prompts or long generations.
The KV cache (or equivalent prefill state) must be transferred between pools. Whether that transfer is worth it depends on prompt length, interconnect speed, and your TTFT/TPOT SLOs. DistServe is a good reminder that once you optimize for goodput instead of raw TPS, paying that transfer cost can still be the right tradeoff.[3]
Estimate the handoff bytes before assuming split pools improve latency:
1layers, kv_heads, head_dim, bytes_per_value = 80, 8, 128, 2
2prompt_tokens = 4096
3link_gib_per_second = 50
4kv_bytes = 2 * layers * kv_heads * head_dim * bytes_per_value * prompt_tokens
5transfer_ms = kv_bytes / (link_gib_per_second * 1024**3) * 1000
6
7print(f"KV handoff size: {kv_bytes / 1024**3:.2f} GiB")
8print(f"idealized transfer time at {link_gib_per_second} GiB/s: {transfer_ms:.1f} ms")
9print("production check: add protocol, queueing, and synchronization overhead")1KV handoff size: 1.25 GiB
2idealized transfer time at 50 GiB/s: 25.0 ms
3production check: add protocol, queueing, and synchronization overheadEven experienced engineers trip over a few predictable traps when tuning a continuous batching setup.
Symptom: You design the scheduler as if all requests in a batch must start and finish together.
Cause: Traditional model-serving systems often batch fixed-shape requests, so the batch is treated as one request-level unit of work.
Fix: Treat decoding as an iteration-level loop. Finished requests leave at the next boundary, cancellations free their KV blocks, and waiting requests enter when batch and token budgets allow.
Symptom: You increase max_batch_size and the server crashes with an out-of-memory error mid-generation.
Cause: You treated memory as a fixed cost per request, ignoring that the KV cache grows with every output token. A batch of ten short prompts fits easily, but if those prompts turn into long generations, the cache doubles or triples in size.
Fix: Tune by KV cache blocks or token budget, not by raw request count. Monitor peak memory during the longest generations in your workload, and leave headroom for bursts.
Symptom: Throughput benchmarks look great, but users complain that new requests take forever to start responding.
Cause: The scheduler is so focused on keeping the decode batch full that it delays admitting new prefills. A long queue of decode work starves the waiting queue.
Fix: Set a maximum decode-only iteration budget or a waiting-timeout. If a request sits in the queue longer than your TTFT target, preempt a low-priority decode stream and admit the new prefill. The right balance depends on your product: a chat app needs low TTFT, while a background summarization pipeline can tolerate a longer wait.
Here's a concrete exercise to test whether the concept has stuck.
Scenario: You have a GPU batch size of 3. Five support tickets arrive:
| Ticket | Prefill length | Expected output |
|---|---|---|
| T1 | 10 tokens | 20 tokens |
| T2 | 5 tokens | 40 tokens |
| T3 | 8 tokens | 15 tokens |
| T4 | 12 tokens | 30 tokens |
| T5 | 6 tokens | 10 tokens |
Task: Walk through the first 50 iterations by hand.
Answer sketch: T3 finishes first, at iteration 15, making room for T4. T1 finishes at iteration 20, making room for T5. T2 finishes at iteration 40. T5, which started at iteration 20, finishes at iteration 30. T4, which started at iteration 15, finishes at iteration 45. Static batching needs 40 iterations for the first batch and 30 for the second, totaling 70. Continuous batching finishes all five tickets in 45 iterations, saving 25 iterations (roughly 35% faster).
Check the hand calculation with a three-slot event simulation:
1import heapq
2
3durations = [20, 40, 15, 30, 10]
4slots = 3
5static_iterations = sum(max(durations[i:i + slots]) for i in range(0, len(durations), slots))
6
7active = durations[:slots]
8heapq.heapify(active)
9for duration in durations[slots:]:
10 available_at = heapq.heappop(active)
11 heapq.heappush(active, available_at + duration)
12continuous_iterations = max(active)
13
14print(f"static iterations: {static_iterations}")
15print(f"continuous iterations: {continuous_iterations}")
16print(f"iterations saved: {static_iterations - continuous_iterations}")1static iterations: 70
2continuous iterations: 45
3iterations saved: 25Static batching wastes capacity when request lengths differ. Finished requests leave idle holes until the slowest request completes.
Continuous batching is common in high-throughput LLM serving. It schedules at the iteration level, adding and removing requests dynamically.
Memory management and scheduling are inseparable. Dynamic admission only works well if KV-cache allocation is flexible.
Prefill/decode interference is often the next bottleneck. Chunked prefill and disaggregation both exist to keep decode latency smooth.
Metrics matter. Don't optimize only for tokens/sec. Optimize for Goodput, TTFT, and TPOT under your actual workload.
The evolution of LLM serving is a move toward finer-grained control: batch-level static scheduling, then iteration-level continuous scheduling, then chunked prefill to reduce interference, then phase-level disaggregation when prefill and decode need different hardware pools.
After working through this chapter, you should be able to:
Sarathi-Serve splits long prefills into smaller chunks, then builds decode-maximal batches around those chunks. One prefill chunk can run with as many decode requests as possible, so a single long prompt does not block every active stream for a large uninterrupted prefill window. The result is smoother TBT/TPOT and lower tail latency for concurrent users.[6]
There is no universal batch size. The practical limit depends on weight memory, KV-cache growth, prompt length distribution, output length distribution, and scheduler token budget. Profile on target hardware and choose the largest request or token budget that improves throughput without missing TTFT, TPOT, or end-to-end latency SLOs.
A canceled request should leave the running set at the next scheduler boundary. Its KV-cache pages should be returned to the memory pool immediately, so the scheduler can admit other work instead of carrying a dead sequence until the original batch would have finished.
Disaggregated inference routes prompt processing to a prefill pool and token generation to a decode pool. After prefill finishes, the KV state moves to a decode worker. That lets prefill workers optimize for dense matrix math and decode workers optimize for KV-cache capacity and memory bandwidth, at the cost of an extra state-transfer hop.[3]
Continuous batching works much better when KV caches are allocated in fixed-size blocks instead of large contiguous tensors. Requests start, grow, finish, and cancel at unpredictable times. PagedAttention maps logical token blocks to non-contiguous physical blocks, so the scheduler can admit new work without needing large contiguous VRAM regions.[2]
You now understand why LLM serving is fundamentally an iteration-level scheduling problem, not a request-level one. You can explain the difference between static and continuous batching, calculate the throughput gain from slot reuse, and identify why flexible memory management and careful prefill/decode scheduling are needed to keep latency smooth.
The next article, Scaling LLM Inference, builds directly on this foundation. It explains why decode is memory-bandwidth bound, how KV cache design and PagedAttention shape the batch size limit, and how speculative decoding adds another layer of throughput optimization. You'll need the mental model of the iteration loop and the token budget you built here to understand why speculative decoding does more than speed up individual requests; it changes the math for the whole batch.
Orca: A Distributed Serving System for Transformer-Based Generative Models.
Yu, G.-I., et al. · 2022 · OSDI 2022
Efficient Memory Management for Large Language Model Serving with PagedAttention.
Kwon, W., et al. · 2023 · SOSP 2023
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving.
Zhong, Y., et al. · 2024 · OSDI 2024
TensorRT-LLM: A High-Performance Inference Framework for LLMs.
NVIDIA · 2024
SGLang: Efficient Execution of Structured Language Model Programs.
Zheng, L., et al. · 2023
Sarathi-Serve: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.
Agrawal, A., et al. · 2023 · arXiv preprint
Optimization and Tuning.
vLLM · 2026
Splitwise: Efficient Generative LLM Inference Using Phase Splitting.
Patel, P., et al. · 2023