LearnInference & Production ScaleContinuous Batching & Scheduling

🚀HardInference Optimization

Continuous Batching & Scheduling

Understand how LLM schedulers use continuous batching, chunked prefill, and prefill-decode disaggregation to improve throughput without violating TTFT or inter-token latency targets.

34 min read

Learning path

Step 128 of 155 in the full curriculum

FlashAttention & Memory Efficiency Scaling LLM Inference

Continuous Batching & Scheduling

The previous chapter showed how FlashAttention reduces memory traffic inside one attention kernel. This chapter moves up one layer: how does an inference server decide which requests share each forward pass?

Continuous batching is a serving scheduler problem: keep the GPU busy while requests arrive with different prompt lengths, output lengths, and deadlines. This chapter shows how production LLM servers batch work without waiting for every request to finish at the same time.

Imagine an order-support chatbot on a busy shopping holiday. Some customers ask for a quick refund status (two sentences). Others paste their entire order history and ask for a detailed analysis (ten paragraphs). Hundreds of these requests hit the server at once.

A simple request-level baseline handles this with static batching. It fills a batch of requests, runs them together, and waits for the longest one to finish before starting the next batch. If one customer asks for a ten-paragraph analysis, completed short answers leave idle batch slots until the long generation ends.

Continuous batching^[1]^[2] fixes this waste by letting the server reuse a slot at the next token-generation step (the next iteration) after a request finishes. Under sustained mixed-length workloads, that can increase useful GPU work per step without changing the model itself. It pairs naturally with flexible Key-Value (KV) cache management, because admission still depends on available cache memory.

Static batching leaves finished request slots idle until the longest request ends, while continuous batching reuses slots at decode boundaries. — Continuous batching improves serving throughput by reusing finished request slots between decode steps.

Why Batching Is Tricky for LLMs

To understand why static batching is so wasteful, recall two facts about how LLMs generate text.

First, LLMs are autoregressive: they produce text one token at a time. Processing the input prompt (the prefill phase) can be done in parallel, but generating each new token (the decode phase) happens sequentially. A short refund question might need only ten decode steps. A long order analysis might need two hundred.

Second, the output token count usually isn't known ahead of time. The engine stops when it hits a stop token or a length limit. That means the scheduler can't know in advance when a request will finish.

These two facts make LLM serving fundamentally different from image-classification serving, where every request has the same fixed shape and finishes in a single forward pass.

The Problem with Static Batching

In fixed-shape model serving, batching is simpler because requests may complete in one forward pass. In LLM serving, requests have varying prompt and output lengths. Because output token count usually isn't known until generation stops (either at a stop token or a length limit), the engine can't know in advance when a request will finish.

With static batching, all requests in a batch must wait for the longest request to complete before the next batch can start. This synchronous execution model forces the engine to maintain the batch shape until each sequence is fully generated. The result is large idle capacity and, in pipelined deployments, pipeline bubbles:

Request	Tokens	Status
Request 1	100 tokens	Done at step 100, waits 300 more steps
Request 2	400 tokens	Finishes at step 400
Request 3	50 tokens	Done at step 50, waits 350 more steps

Most slots become idle as soon as their request finishes, so utilization can collapse in heterogeneous workloads. Requests 1 and 3 are done early, but their GPU memory stays reserved until Request 2 completes. No new requests can join because the batch shape is fixed. Throughput falls even though the hardware still has work it could be doing.

Continuous Batching (Iteration-Level Scheduling)

Continuous batching (also known as iteration-level scheduling) moves scheduling to the decode-step boundary. After each generation step (or after each prefill chunk in engines that chunk prefills), the server checks which requests finished, frees their slots, and admits new work if memory and token-budget constraints allow.

Static batching is like a delivery truck that waits until every warehouse order is packed before leaving the dock. Even if the first parcel was ready ten minutes ago, that slot stays blocked until the slowest parcel is ready. Continuous batching is like a cross-dock lane with fixed loading slots: as soon as one parcel leaves the belt, the next queued parcel takes its place. The lane stays full, maximizing useful work per step.

Completed requests leave immediately, and queued requests can join on the next iteration. The scheduler effectively rebuilds the active running set for the next forward pass.

Step	Batch Composition	Event
0	[Req1, Req2, Req3]	Initial batch
50	[Req1, Req2, Req4 (new)]	Req3 done, Req4 joins
100	[Req5 (new), Req2, Req4]	Req1 done, Req5 joins
400	[Req5, Req6 (new), Req4]	Req2 done, Req6 joins

A Worked Example: Three Support Tickets

Make the waste concrete with numbers. Suppose the GPU can run three requests at once, and three support tickets arrive:

Request	Task	Output needed
Ticket A	Refund status	10 tokens
Ticket B	Shipping delay explanation	50 tokens
Ticket C	Full order-history analysis	200 tokens

Static batching runs all three together. Ticket A finishes after 10 iterations, Ticket B after 50, and Ticket C after 200. For iterations 11 through 200, the slots for A and B sit idle. Total GPU iterations for this batch: 200. Wasted slot-iterations: 190 for A's slot plus 150 for B's slot = 340 idle slots.

Continuous batching starts with [A, B, C]. At iteration 10, Ticket A finishes and a new ticket (Ticket D) joins immediately. At iteration 50, Ticket B finishes and Ticket E joins. The GPU keeps all three slots busy until the queue empties. The only idle iterations happen when the waiting queue is empty.

In this tiny example, continuous batching fills slots that static batching would leave idle, so the server can answer substantially more customer questions per GPU hour without a hardware upgrade.

Under sustained load, slot reuse avoids idle work from completed short requests. Actual utilization still depends on token budgets, memory pressure, phase mix, and kernel efficiency.

This first calculation isolates slot reuse from GPU details. It treats each active output token as one useful slot-iteration:

static-versus-continuous-slot-use.py

output_lengths = [10, 50, 200]
batch_slots = len(output_lengths)
static_capacity = max(output_lengths) * batch_slots
useful_work = sum(output_lengths)
idle_work = static_capacity - useful_work

print(f"useful slot-iterations: {useful_work}")
print(f"static idle slot-iterations: {idle_work}")
print(f"static utilization: {useful_work / static_capacity:.1%}")

Output

useful slot-iterations: 260
static idle slot-iterations: 340
static utilization: 43.3%

Continuous batching timeline showing requests entering and leaving active decode slots at iteration boundaries. — At each iteration boundary, finished requests leave and queued requests can enter under the scheduler's memory and token budgets.

One subtle but important detail: modern schedulers are usually limited by a token budget or KV-cache budget, rather than a raw "number of requests" cap. One long prefill can consume more scheduler budget than many short decode-only requests.

Continuous batching pairs naturally with a flexible memory manager. Because requests come and go, their Key-Value (KV) caches cannot be stored efficiently in contiguous tensors. This is why PagedAttention (block-based memory management) is such a practical companion to continuous batching.

Memory Management: The Hidden Enabler

To understand why continuous batching was difficult to implement initially, we need to look at memory management. Recall that during the decode phase, the model stores a Key-Value (KV) cache: a running record of attention states for every token generated so far. Because requests leave and join the batch at unpredictable times, their KV caches can't be stored efficiently in contiguous tensors.

The Fragmentation Problem

In standard Transformer implementations, the KV cache for a request is often treated like a growing contiguous buffer sized for a maximum sequence length.

When requests have different lengths and arrive at different times:

Internal Fragmentation: You have to pre-allocate memory for max_seq_len (e.g., 4096 tokens) even if the request generates 50 tokens.
External Fragmentation: As requests finish and leave, they create variable-sized "holes" in GPU memory that new requests might not fit into cleanly.

The KV cache for a single sequence grows with every decode step. For a 70B-class model with GQA (80 layers, 8 KV heads, head dimension 128, FP16), each new token adds about 320 KiB of KV state. A 4,096-token sequence therefore needs roughly 1.25 GiB of cache. If the same model stored a full KV head set instead of GQA, the cache would balloon to roughly 10 GiB at 4K tokens. That's why fragmentation hurts so much: even modest over-allocation can strand gigabytes of VRAM that could have admitted other requests.

A useful size estimate follows this formula:

M_{KV} = 2 \times L \times s \times H_{kv} \times d \times b

Where $L$ is the number of layers, $s$ is the sequence length, $H_{kv}$ is the number of KV heads, $d$ is the head dimension, and $b$ is bytes per element (2 for FP16). The leading 2 accounts for both key and value tensors.

Use the formula directly before choosing a concurrency target:

kv-cache-budget.py

layers, kv_heads, head_dim, bytes_per_value = 80, 8, 128, 2
tokens = 4096
bytes_per_token = 2 * layers * kv_heads * head_dim * bytes_per_value
cache_bytes = bytes_per_token * tokens

print(f"KV state per token: {bytes_per_token / 1024:.0f} KiB")
print(f"KV state at {tokens:,} tokens: {cache_bytes / 1024**3:.2f} GiB")

Output

KV state per token: 320 KiB
KV state at 4,096 tokens: 1.25 GiB

PagedAttention Solution

PagedAttention^[2] solves this by borrowing the concept of virtual memory from operating systems, as illustrated below.

PagedAttention maps logical KV-cache blocks to non-contiguous physical GPU blocks through a small block table. — PagedAttention lets the scheduler allocate fixed-size KV blocks wherever space is available. Logical sequence order stays intact through the block table.

KV Blocks: The KV cache is divided into fixed-size blocks (e.g., 16 or 32 tokens).
Non-Contiguous Allocation: Blocks can be stored anywhere in GPU memory (VRAM). They don't need to be contiguous.
Block Table: Each request keeps a mapping table from logical page to physical block that tells the runtime where its KV blocks live.

This allows the scheduler to allocate memory dynamically as tokens are generated. The vLLM paper reports less than 4% waste in its evaluated allocation design, in contrast with substantially higher reservation waste in its baseline systems.^[2] Fixed blocks make dynamic admission practical without requiring large contiguous allocations.

This toy allocation computes the last-block slack for a few active sequences:

paged-kv-slack.py

block_size = 16
sequence_lengths = [17, 31, 48, 65]
reserved = sum(((length + block_size - 1) // block_size) * block_size for length in sequence_lengths)
used = sum(sequence_lengths)
slack = reserved - used

print(f"active tokens: {used}")
print(f"reserved block slots: {reserved}")
print(f"last-block slack: {slack} tokens ({slack / reserved:.1%})")

Output

active tokens: 161
reserved block slots: 192
last-block slack: 31 tokens (16.1%)

PagedAttention is a memory management technique for non-contiguous KV blocks. FlashAttention is a compute optimization that tiles attention to reduce High Bandwidth Memory (HBM) I/O. They are orthogonal techniques, but systems such as vLLM often combine them to improve both memory capacity and compute speed.

Prefill vs. Decode Phases

LLM serving has two distinct computational phases with opposing characteristics:

Prefill (Prompt Processing)

Processes all input tokens in parallel (like training).
Often compute-heavy because it processes many prompt tokens in matrix multiplications.
Runs once per request at the start.
Latency goal: Minimize Time To First Token (TTFT).

Decode (Token Generation)

Generates tokens one at a time (autoregressive).
Often memory-bandwidth limited at practical batch sizes because each step processes few new tokens while reading model and KV state.
Runs many times per request (once per output token).
Latency goal: Minimize inter-token latency. Many dashboards report this as TBT or TPOT, while papers such as DistServe define TPOT as the average time per output token after the first.^[3]

The following table summarizes the key differences between the prefill and decode phases:

Phase	Arithmetic Intensity	Bottleneck	Batching Strategy
Prefill	Often high	Frequently compute throughput	Batch prompt tokens
Decode	Often lower	Frequently memory bandwidth	Batch active streams

The Interference Problem

Blindly mixing a full long prefill with decode work can harm inter-token latency. A large prefill request (for example, processing a 4K-token document) can delay decode steps for concurrent streams. Later in the article we'll look at chunked prefill, which bounds how much prefill work competes in one scheduling step.

Token-budget scheduling makes that bound visible:

decode-first-token-budget.py

max_num_batched_tokens = 1024
decode_requests = 96
pending_prefill_tokens = 1800
prefill_budget = max_num_batched_tokens - decode_requests
scheduled_prefill = min(pending_prefill_tokens, prefill_budget)

print(f"decode tokens scheduled first: {decode_requests}")
print(f"prefill tokens scheduled now: {scheduled_prefill}")
print(f"prefill tokens left for later: {pending_prefill_tokens - scheduled_prefill}")

Output

decode tokens scheduled first: 96
prefill tokens scheduled now: 928
prefill tokens left for later: 872

Solutions include:

Chunked prefill: Break long prefills into chunks, interleave with decode steps.
Disaggregated prefill: Separate prefill and decode onto different GPU pools.^[3]
Priority scheduling: Prioritize decode steps to minimize latency.

Scheduling Policies

The scheduler's job is to decide which requests to run in the next iteration.

The scheduler is like a warehouse dispatch desk. First-Come-First-Served (FCFS) loads orders in arrival order, which is fair, but a one-item replacement waits behind a huge wholesale pallet. Shortest-Job-First (SJF) prioritizes quick orders first, which improves average wait time. Preemption is like pausing a low-priority batch to ship an urgent replacement, then resuming later. Each scheduling policy is a tradeoff between fairness, throughput, and latency.

First-Come-First-Served (FCFS)

It's the most straightforward approach and simple to reason about. Requests are admitted in arrival order, which is usually fair, but it isn't optimal for overall system throughput. A long-running request at the front of the queue can block many shorter requests behind it, increasing the average latency across the system.

Shortest-Job-First (SJF)

The scheduler prioritizes requests that are expected to require fewer output tokens. In theory, this can improve average latency because short responses (like a simple "Yes" or "No") don't get stuck behind a long refund-policy explanation. In practice, output length is unknown ahead of time, so pure SJF is rare in LLM serving. Real systems approximate it with heuristics such as prompt-length bucketing, tenant priorities, or token-budget limits.

This queue calculation shows the benefit SJF would have if output lengths were known. Production engines need estimates, priorities, or fairness controls because they don't know these lengths in advance:

shortest-job-first-bound.py

jobs = [("long", 50), ("quick", 5), ("medium", 20)]

def average_completion(order):
    elapsed = 0
    completion_times = []
    for _, tokens in order:
        elapsed += tokens
        completion_times.append(elapsed)
    return sum(completion_times) / len(completion_times)

fcfs = average_completion(jobs)
oracle_sjf = average_completion(sorted(jobs, key=lambda job: job[1]))
print(f"FCFS average completion: {fcfs:.1f} steps")
print(f"oracle SJF average completion: {oracle_sjf:.1f} steps")
print("production caveat: output length is not known up front")

Output

FCFS average completion: 60.0 steps
oracle SJF average completion: 35.0 steps
production caveat: output length is not known up front

Priority and Decode-First Scheduling

Many production schedulers care more about keeping existing streams smooth than about starting every new request immediately. A user notices TPOT jitter right away, while a modest TTFT increase on a newly arrived long prompt is often less damaging. That leads to policies such as decode-first scheduling, per-tenant priorities, and SLO-aware admission control.^[3]

Preemption Strategies

When GPU memory is full (KV cache blocks are exhausted), the scheduler must either pause work, evict state, or stop admitting new requests. Two common recovery strategies are:

Swap Out (CPU Offload): Move the victim request's KV cache from GPU VRAM to CPU RAM.
- Pros: Saves progress; cheap to resume if PCIe bandwidth is available.
- Cons: High latency penalty for the swapped-out request.
Recomputation: Drop the KV cache entirely and re-run the prefill phase when the request is rescheduled.
- Pros: Zero memory overhead on CPU; simple to implement.
- Cons: Wastes GPU compute; works best when the prompt is short (low prefill cost).

These are design options, not universal defaults. The cheaper choice depends on prompt length, host-device bandwidth, and how expensive replaying prefill would be. Recomputation can be better when prompts are short enough that rebuilding state costs less than moving KV blocks off the GPU. Swapping matters when discarded state would be expensive to rebuild. If you tune a particular engine, check its current scheduler documentation and trace preemption events rather than assuming one policy is active.

Here is a simplified Python simulation of a continuous batch scheduler. It demonstrates the core iteration loop, including consistent KV-block accounting as requests grow token by token and a swap-out decision when the next step can't fit. The schedule_step method is called before every forward pass, and finish_decode_step is called after that pass emits one token for each active request.

preemption-strategies.py

from dataclasses import dataclass

@dataclass
class Request:
    id: str
    prompt_len: int
    max_new_tokens: int
    priority: int = 0  # Higher value = higher priority
    generated_tokens: int = 0
    allocated_blocks: int = 0
    swapped_to_cpu: bool = False

    @property
    def total_tokens(self) -> int:
        return self.prompt_len + self.generated_tokens

    def is_finished(self) -> bool:
        """Check if request has reached EOS or max_new_tokens."""
        return self.generated_tokens >= self.max_new_tokens

class ContinuousBatchScheduler:
    """
    Simplified iteration-level scheduler.

    Real engines also track per-request token budgets, prefix-cache hits,
    EOS detection, and beam-search groups. This example focuses on the core
    loop: retire finished requests, resume paused work, admit new work, then
    preempt if the next decode step would overflow KV-cache capacity.
    """
    BLOCK_SIZE = 16

    def __init__(self, max_batch_size: int, max_memory_blocks: int):
        self.max_batch_size = max_batch_size
        self.max_memory_blocks = max_memory_blocks
        self.running: list[Request] = []  # Currently generating on GPU
        self.waiting: list[Request] = []  # Queued requests
        self.swapped: list[Request] = []  # Preempted to CPU RAM
        self.used_memory_blocks = 0
        self.preemptions = 0

    def schedule_step(self) -> list[Request]:
        # 1. Retire finished requests and free their KV cache.
        active = []
        for req in self.running:
            if req.is_finished():
                self._free_kv_cache(req)
            else:
                active.append(req)
        self.running = active

        # 2. Resume preempted requests first so they don't starve.
        self._admit(self.swapped, from_cpu=True)

        # 3. Admit fresh waiting requests.
        self._admit(self.waiting, from_cpu=False)

        # 4. Preempt only if the next decode step would exceed the safe budget.
        while self.running and not self._can_run_next_decode_step(self.running):
            victim = min(self.running, key=lambda r: (r.priority, r.generated_tokens))
            self.running.remove(victim)
            self._swap_out_to_cpu(victim)
            victim.swapped_to_cpu = True
            self.swapped.append(victim)
            self.preemptions += 1

        return self.running

    def finish_decode_step(self) -> None:
        # Call this after the model emits one token for each running request.
        for req in self.running:
            req.generated_tokens += 1
            required_blocks = self._blocks_for_tokens(req.total_tokens)
            growth = required_blocks - req.allocated_blocks
            if growth > 0:
                self.used_memory_blocks += growth
                req.allocated_blocks = required_blocks

    def _admit(self, queue: list[Request], from_cpu: bool) -> None:
        queue.sort(key=lambda r: r.priority, reverse=True)

        while (
            queue
            and len(self.running) < self.max_batch_size
            and self._can_allocate_current_state(queue[0])
        ):
            req = queue.pop(0)
            if from_cpu:
                self._swap_in_from_cpu(req)
            else:
                self._allocate_kv_cache(req)
            req.swapped_to_cpu = False
            self.running.append(req)

    def _blocks_for_tokens(self, token_count: int) -> int:
        return (token_count + self.BLOCK_SIZE - 1) // self.BLOCK_SIZE

    def _can_allocate_current_state(self, req: Request) -> bool:
        return self.used_memory_blocks + self._blocks_for_tokens(req.total_tokens) <= self.max_memory_blocks

    def _allocate_kv_cache(self, req: Request) -> None:
        req.allocated_blocks = self._blocks_for_tokens(req.total_tokens)
        self.used_memory_blocks += req.allocated_blocks

    def _free_kv_cache(self, req: Request) -> None:
        self.used_memory_blocks -= req.allocated_blocks
        req.allocated_blocks = 0

    def _swap_out_to_cpu(self, req: Request) -> None:
        self._free_kv_cache(req)

    def _swap_in_from_cpu(self, req: Request) -> None:
        self._allocate_kv_cache(req)

    def _projected_growth_for_next_token(self, req: Request) -> int:
        next_blocks = self._blocks_for_tokens(req.total_tokens + 1)
        return max(0, next_blocks - req.allocated_blocks)

    def _can_run_next_decode_step(self, batch: list[Request]) -> bool:
        growth = sum(self._projected_growth_for_next_token(req) for req in batch)
        return self.used_memory_blocks + growth <= self.max_memory_blocks

scheduler = ContinuousBatchScheduler(max_batch_size=3, max_memory_blocks=16)
scheduler.waiting.extend([
    Request("req_A", prompt_len=8, max_new_tokens=3),
    Request("req_B", prompt_len=8, max_new_tokens=1),
    Request("req_C", prompt_len=8, max_new_tokens=2),
    Request("req_D", prompt_len=8, max_new_tokens=2),
    Request("req_E", prompt_len=8, max_new_tokens=1),
])

timeline: list[list[str]] = []
peak_blocks = 0
for _ in range(6):
    active = scheduler.schedule_step()
    timeline.append([req.id for req in active])
    peak_blocks = max(peak_blocks, scheduler.used_memory_blocks)
    scheduler.finish_decode_step()
    peak_blocks = max(peak_blocks, scheduler.used_memory_blocks)

print(timeline[:3])
print(f"peak blocks: {peak_blocks}/{scheduler.max_memory_blocks}")

assert timeline[0] == ["req_A", "req_B", "req_C"]
assert timeline[1] == ["req_A", "req_C", "req_D"]
assert "req_E" in timeline[2]
assert peak_blocks <= scheduler.max_memory_blocks

# A tight KV-block budget forces an actual preemption before decode.
pressure = ContinuousBatchScheduler(max_batch_size=2, max_memory_blocks=2)
pressure.waiting.extend([
    Request("urgent", prompt_len=16, max_new_tokens=2, priority=1),
    Request("background", prompt_len=16, max_new_tokens=2, priority=0),
])
active_under_pressure = pressure.schedule_step()
print([req.id for req in active_under_pressure], f"preemptions={pressure.preemptions}")
assert [req.id for req in active_under_pressure] == ["urgent"]
assert pressure.preemptions == 1

Output

[['req_A', 'req_B', 'req_C'], ['req_A', 'req_C', 'req_D'], ['req_A', 'req_D', 'req_E']]
peak blocks: 3/16
['urgent'] preemptions=1

The key observation is that schedule_step() rebuilds the active set at every iteration, scanning the running requests and admitting queued work under the batch and KV-cache budgets.

In a real server, the control loop alternates between schedule_step() and finish_decode_step(): schedule the next forward pass, run it once, record the extra token and any newly needed KV block, then repeat. Real schedulers add more machinery on top of this loop, such as prefix-cache reuse, cancellation handling, speculative decoding, and token-budget accounting. A recomputation-based preemption path would drop the victim's KV cache and put it back into waiting instead of parking it in swapped. The tight-budget assertion proves that this example's preemption branch runs rather than merely existing in the code.

Serving Engines and System Designs

The ideas in this article show up across most modern inference stacks, but some entries below are production engines while others are research systems that shaped later engine design:

Framework	What it emphasizes	Where it shines
vLLM^[2]	PagedAttention and continuous batching for high-throughput serving	General-purpose high-throughput serving
TensorRT-LLM^[4]	NVIDIA-optimized kernels, in-flight batching, paged KV cache	Teams standardized on the NVIDIA serving stack
SGLang^[5]	Continuous batching plus aggressive prefix reuse (RadixAttention)	Workloads with repeated prefixes or structured generation
DistServe^[3]	Explicit prefill/decode disaggregation and SLO-aware scheduling	Deployments where TTFT and TPOT targets dominate raw TPS

Other servers implement many of the same ideas too. Feature matrices change quickly, so treat engine comparisons as moving targets and verify the current docs before making a production choice.

Throughput vs. Goodput

Benchmark numbers in this space are highly workload-dependent. The meaningful comparison isn't "Which server is fastest?" but "Which system stays within TTFT and TPOT targets for my traffic mix?"

The literature still gives useful anchor points:

Source	What it measured	Why it matters
Orca^[1]	Iteration-level scheduling against request-level baselines	Showed that per-iteration scheduling can improve throughput and latency together
vLLM^[2]	PagedAttention plus continuous batching against prior serving systems	Reported 2-4x higher throughput at the same latency level compared with systems such as FasterTransformer and Orca in the paper's evaluated setups
DistServe^[3]	Phase-disaggregated serving under TTFT/TPOT SLOs	Reported up to 7.4x more request capacity or 12.6x tighter SLOs while keeping more than 90% of requests within latency constraints

Goodput: The Production Metric

Raw throughput can be misleading if a large fraction of requests miss their latency target. DistServe frames goodput as the maximum request rate that can be served while still meeting TTFT and TPOT objectives at a chosen SLO-attainment target.^[3] A simple mental model is:

\text{Goodput} = \frac{\text{requests processed within SLA}}{\text{total time}}

Unlike raw throughput (which counts all completed requests), goodput only counts requests that satisfy the latency target. In interview settings, this is the right framing: optimize tokens/sec only after you define the TTFT, TPOT, and end-to-end latency targets that matter for users.

Calculate goodput from an explicit latency objective rather than reporting only completions:

goodput-under-slo.py

window_seconds = 10
latencies_ms = [
    {"ttft": 120, "tpot": 25},
    {"ttft": 450, "tpot": 22},
    {"ttft": 180, "tpot": 42},
    {"ttft": 190, "tpot": 27},
]
ttft_target, tpot_target = 300, 30
within_slo = [
    item for item in latencies_ms
    if item["ttft"] <= ttft_target and item["tpot"] <= tpot_target
]

print(f"raw throughput: {len(latencies_ms) / window_seconds:.2f} req/s")
print(f"goodput: {len(within_slo) / window_seconds:.2f} req/s")
print(f"SLO-compliant requests: {len(within_slo)}/{len(latencies_ms)}")

Output

raw throughput: 0.40 req/s
goodput: 0.20 req/s
SLO-compliant requests: 2/4

Advanced Scheduling: Chunked Prefill

One remaining bottleneck in naive continuous batching is prefill/decode interference. Sarathi-Serve addresses this with chunked-prefills plus decode-maximal batching.^[6] A single long prefill can delay other requests waiting for a decode step, producing visible gaps in token output when the phase mix is poorly scheduled.

Mechanism

Chunked prefill splits a long prefill phase into smaller blocks (for example, 512-token or 1,024-token chunks). Sarathi then uses decode-maximal batching to pair one prefill chunk with as many decode requests as possible, so the chunk keeps the GPU compute units busy while the decodes "piggyback" on the same batch. This flow shows how chunking prevents latency spikes:

Chunked prefill splits a long prompt into smaller chunks so decode streams can run between chunks instead of stalling behind one long prefill. — Chunked prefill trades one large latency spike for smaller prefill chunks interleaved with decode work. Existing streams keep receiving tokens.

Why This Matters

Without chunked prefill, a long prefill can occupy a long scheduling interval. During this window, concurrent decode requests wait for their next token. Users can see irregular token output even if average TBT looks acceptable.

With chunked prefill, the same 4,000-token document can be split into smaller pieces. With a 512-token chunk size, for example, the scheduler processes about 8 chunks and gains scheduling boundaries at which decode work can run. The long prompt may pay overhead from additional scheduling and kernel boundaries, but one admitted prefill chunk no longer consumes the entire prompt in one step.

Current vLLM V1 documentation describes chunked prefill as enabled by default: the scheduler prioritizes decode requests and uses the remaining max_num_batched_tokens budget for prefills, chunking a prefill that doesn't fit. Its tuning guidance notes the tradeoff: larger token budgets tend to improve TTFT and throughput, while smaller budgets can improve inter-token latency.^[7]

Here is the same budget tradeoff as a small schedule:

chunked-prefill-plan.py

prompt_tokens = 4000
chunk_size = 512
chunks = []
remaining = prompt_tokens
while remaining:
    current = min(chunk_size, remaining)
    chunks.append(current)
    remaining -= current

print(f"chunk count: {len(chunks)}")
print(f"first/last chunk: {chunks[0]}/{chunks[-1]} tokens")
print(f"decode scheduling boundaries: {len(chunks) - 1}")

Output

chunk count: 8
first/last chunk: 512/416 tokens
decode scheduling boundaries: 7

Disaggregated Serving

A more advanced approach, disaggregated serving, splits prefill and decode across separate worker pools. Splitwise^[8] and DistServe^[3] study this design explicitly. The motivation is simple: prefill often wants dense matrix-math throughput, while decode often wants KV-cache capacity and memory bandwidth.

Disaggregation isn't the right default for every deployment. For short prompts, small clusters, or workloads with high prefix-cache hit rates, running prefill on the decode worker can be simpler than paying a KV-transfer hop. The right comparison is goodput under a defined TTFT and TPOT objective, including transfer overhead.

Architecture

By decoupling these two phases, we can construct a specialized pipeline where requests are routed to the most appropriate hardware pool for their current state. The following diagram illustrates this disaggregated architecture:

Disaggregated serving routes prompts to prefill workers, transfers KV state, then streams tokens from decode workers tuned for memory bandwidth. — Disaggregation splits serving by phase. Prefill workers optimize dense prompt processing; decode workers optimize KV-cache capacity and token streaming.

Benefits

Prefill workers: Can be sized and tuned for high matrix-math throughput.
Decode workers: Can be sized for KV-cache capacity and memory bandwidth without worrying about prefill interference.
Scaling: You can scale prefill and decode pools independently based on whether your traffic skews toward long prompts or long generations.

Cost

The KV cache (or equivalent prefill state) must be transferred between pools. Whether that transfer is worth it depends on prompt length, interconnect speed, and your TTFT/TPOT SLOs. DistServe is a good reminder that once you optimize for goodput instead of raw TPS, paying that transfer cost can still be the right tradeoff.^[3]

Estimate the handoff bytes before assuming split pools improve latency:

kv-handoff-cost.py

layers, kv_heads, head_dim, bytes_per_value = 80, 8, 128, 2
prompt_tokens = 4096
link_gib_per_second = 50
kv_bytes = 2 * layers * kv_heads * head_dim * bytes_per_value * prompt_tokens
transfer_ms = kv_bytes / (link_gib_per_second * 1024**3) * 1000

print(f"KV handoff size: {kv_bytes / 1024**3:.2f} GiB")
print(f"idealized transfer time at {link_gib_per_second} GiB/s: {transfer_ms:.1f} ms")
print("production check: add protocol, queueing, and synchronization overhead")

Output

KV handoff size: 1.25 GiB
idealized transfer time at 50 GiB/s: 25.0 ms
production check: add protocol, queueing, and synchronization overhead

Common Mistakes in Production

Even experienced engineers trip over a few predictable traps when tuning a continuous batching setup.

Static Batching Mindset

Symptom: You design the scheduler as if all requests in a batch must start and finish together.

Cause: Traditional model-serving systems often batch fixed-shape requests, so the batch is treated as one request-level unit of work.

Fix: Treat decoding as an iteration-level loop. Finished requests leave at the next boundary, cancellations free their KV blocks, and waiting requests enter when batch and token budgets allow.

The OOM Oversimplification

Symptom: You increase max_batch_size and the server crashes with an out-of-memory error mid-generation.

Cause: You treated memory as a fixed cost per request, ignoring that the KV cache grows with every output token. A batch of ten short prompts fits easily, but if those prompts turn into long generations, the cache doubles or triples in size.

Fix: Tune by KV cache blocks or token budget, not by raw request count. Monitor peak memory during the longest generations in your workload, and leave headroom for bursts.

Ignoring Time to First Token (TTFT)

Symptom: Throughput benchmarks look great, but users complain that new requests take forever to start responding.

Cause: The scheduler is so focused on keeping the decode batch full that it delays admitting new prefills. A long queue of decode work starves the waiting queue.

Fix: Set a maximum decode-only iteration budget or a waiting-timeout. If a request sits in the queue longer than your TTFT target, preempt a low-priority decode stream and admit the new prefill. The right balance depends on your product: a chat app needs low TTFT, while a background summarization pipeline can tolerate a longer wait.

Try It Yourself: The Manual Scheduler

Here's a concrete exercise to test whether the concept has stuck.

Scenario: You have a GPU batch size of 3. Five support tickets arrive:

Ticket	Prefill length	Expected output
T1	10 tokens	20 tokens
T2	5 tokens	40 tokens
T3	8 tokens	15 tokens
T4	12 tokens	30 tokens
T5	6 tokens	10 tokens

Task: Walk through the first 50 iterations by hand.

Which three tickets start in the initial batch?
At which iteration does the first ticket finish?
Which ticket replaces it?
How many total decode iterations would static batching require if the initial batch were [T1, T2, T3]?
How many iterations does continuous batching save?

Answer sketch: T3 finishes first, at iteration 15, making room for T4. T1 finishes at iteration 20, making room for T5. T2 finishes at iteration 40. T5, which started at iteration 20, finishes at iteration 30. T4, which started at iteration 15, finishes at iteration 45. Static batching needs 40 iterations for the first batch and 30 for the second, totaling 70. Continuous batching finishes all five tickets in 45 iterations, saving 25 iterations (roughly 35% faster).

Check the hand calculation with a three-slot event simulation:

manual-scheduler-answer.py

import heapq

durations = [20, 40, 15, 30, 10]
slots = 3
static_iterations = sum(max(durations[i:i + slots]) for i in range(0, len(durations), slots))

active = durations[:slots]
heapq.heapify(active)
for duration in durations[slots:]:
    available_at = heapq.heappop(active)
    heapq.heappush(active, available_at + duration)
continuous_iterations = max(active)

print(f"static iterations: {static_iterations}")
print(f"continuous iterations: {continuous_iterations}")
print(f"iterations saved: {static_iterations - continuous_iterations}")

Output

static iterations: 70
continuous iterations: 45
iterations saved: 25

Key takeaways

Static batching wastes capacity when request lengths differ. Finished requests leave idle holes until the slowest request completes.
Continuous batching is common in high-throughput LLM serving. It schedules at the iteration level, adding and removing requests dynamically.
Memory management and scheduling are inseparable. Dynamic admission only works well if KV-cache allocation is flexible.
Prefill/decode interference is often the next bottleneck. Chunked prefill and disaggregation both exist to keep decode latency smooth.
Metrics matter. Don't optimize only for tokens/sec. Optimize for Goodput, TTFT, and TPOT under your actual workload.

The evolution of LLM serving is a move toward finer-grained control: batch-level static scheduling, then iteration-level continuous scheduling, then chunked prefill to reduce interference, then phase-level disaggregation when prefill and decode need different hardware pools.

What you should be able to defend

After working through this chapter, you should be able to:

Explain static-batch waste. Show why heterogeneous output lengths create idle slots and, in pipelined deployments, pipeline bubbles.
Describe iteration-level scheduling. Walk through finished-request removal, queued-request admission, and active-set rebuilding at decode boundaries.
Analyze prefill/decode interference. Explain why compute-heavy prefills can stall memory-bound decode streams.
Connect scheduling to memory. Explain why PagedAttention and fixed-size KV blocks make continuous admission practical.
Compare policies. Discuss FCFS, priority scheduling, decode-first scheduling, swapping, and recomputation.
Use goodput correctly. Define goodput as request throughput under TTFT and TPOT/SLA targets, not raw tokens per second.

Production questions

How does Sarathi-Serve reduce prefill-decode interference?

Sarathi-Serve splits long prefills into smaller chunks, then builds decode-maximal batches around those chunks. One prefill chunk can run with as many decode requests as possible, so a single long prompt does not block every active stream for a large uninterrupted prefill window. The result is smoother TBT/TPOT and lower tail latency for concurrent users.^[6]

What is the optimal batch size for a given GPU?

There is no universal batch size. The practical limit depends on weight memory, KV-cache growth, prompt length distribution, output length distribution, and scheduler token budget. Profile on target hardware and choose the largest request or token budget that improves throughput without missing TTFT, TPOT, or end-to-end latency SLOs.

How should cancellations work mid-generation?

A canceled request should leave the running set at the next scheduler boundary. Its KV-cache pages should be returned to the memory pool immediately, so the scheduler can admit other work instead of carrying a dead sequence until the original batch would have finished.

How does disaggregated inference separate prefill and decode?

Disaggregated inference routes prompt processing to a prefill pool and token generation to a decode pool. After prefill finishes, the KV state moves to a decode worker. That lets prefill workers optimize for dense matrix math and decode workers optimize for KV-cache capacity and memory bandwidth, at the cost of an extra state-transfer hop.^[3]

Why is PagedAttention closely tied to continuous batching?

Continuous batching works much better when KV caches are allocated in fixed-size blocks instead of large contiguous tensors. Requests start, grow, finish, and cancel at unpredictable times. PagedAttention maps logical token blocks to non-contiguous physical blocks, so the scheduler can admit new work without needing large contiguous VRAM regions.^[2]

Course handoff

You now understand why LLM serving is fundamentally an iteration-level scheduling problem, not a request-level one. You can explain the difference between static and continuous batching, calculate the throughput gain from slot reuse, and identify why flexible memory management and careful prefill/decode scheduling are needed to keep latency smooth.

The next article, Scaling LLM Inference, builds directly on this foundation. It explains why decode is memory-bandwidth bound, how KV cache design and PagedAttention shape the batch size limit, and how speculative decoding adds another layer of throughput optimization. You'll need the mental model of the iteration loop and the token budget you built here to understand why speculative decoding does more than speed up individual requests; it changes the math for the whole batch.

Next Step

Continue to Scaling LLM Inference

Scheduling explains how requests share GPU time; scaling inference combines that with KV-cache memory, batching policy, and deployment topology to handle real traffic.

PreviousFlashAttention & Memory Efficiency

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Orca: A Distributed Serving System for Transformer-Based Generative Models.

Yu, G.-I., et al. · 2022 · OSDI 2022

Efficient Memory Management for Large Language Model Serving with PagedAttention.

Kwon, W., et al. · 2023 · SOSP 2023

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving.

Zhong, Y., et al. · 2024 · OSDI 2024

TensorRT-LLM: A High-Performance Inference Framework for LLMs.

NVIDIA · 2024

SGLang: Efficient Execution of Structured Language Model Programs.

Zheng, L., et al. · 2023

Sarathi-Serve: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.

Agrawal, A., et al. · 2023 · arXiv preprint

Optimization and Tuning.

vLLM · 2026

Splitwise: Efficient Generative LLM Inference Using Phase Splitting.

Patel, P., et al. · 2023

Back to Topics

LearnInference & Production ScaleContinuous Batching & Scheduling

🚀HardInference Optimization

Continuous Batching & Scheduling

Understand how LLM schedulers use continuous batching, chunked prefill, and prefill-decode disaggregation to improve throughput without violating TTFT or inter-token latency targets.

34 min read

Learning path

Step 128 of 155 in the full curriculum

FlashAttention & Memory Efficiency Scaling LLM Inference

Continuous Batching & Scheduling

Why Batching Is Tricky for LLMs

To understand why static batching is so wasteful, recall two facts about how LLMs generate text.

These two facts make LLM serving fundamentally different from image-classification serving, where every request has the same fixed shape and finishes in a single forward pass.

The Problem with Static Batching

Request	Tokens	Status
Request 1	100 tokens	Done at step 100, waits 300 more steps
Request 2	400 tokens	Finishes at step 400
Request 3	50 tokens	Done at step 50, waits 350 more steps

Continuous Batching (Iteration-Level Scheduling)

Completed requests leave immediately, and queued requests can join on the next iteration. The scheduler effectively rebuilds the active running set for the next forward pass.

Step	Batch Composition	Event
0	[Req1, Req2, Req3]	Initial batch
50	[Req1, Req2, Req4 (new)]	Req3 done, Req4 joins
100	[Req5 (new), Req2, Req4]	Req1 done, Req5 joins
400	[Req5, Req6 (new), Req4]	Req2 done, Req6 joins

A Worked Example: Three Support Tickets

Make the waste concrete with numbers. Suppose the GPU can run three requests at once, and three support tickets arrive:

Request	Task	Output needed
Ticket A	Refund status	10 tokens
Ticket B	Shipping delay explanation	50 tokens
Ticket C	Full order-history analysis	200 tokens

In this tiny example, continuous batching fills slots that static batching would leave idle, so the server can answer substantially more customer questions per GPU hour without a hardware upgrade.

Under sustained load, slot reuse avoids idle work from completed short requests. Actual utilization still depends on token budgets, memory pressure, phase mix, and kernel efficiency.

This first calculation isolates slot reuse from GPU details. It treats each active output token as one useful slot-iteration:

static-versus-continuous-slot-use.py

output_lengths = [10, 50, 200]
batch_slots = len(output_lengths)
static_capacity = max(output_lengths) * batch_slots
useful_work = sum(output_lengths)
idle_work = static_capacity - useful_work

print(f"useful slot-iterations: {useful_work}")
print(f"static idle slot-iterations: {idle_work}")
print(f"static utilization: {useful_work / static_capacity:.1%}")

Output

useful slot-iterations: 260
static idle slot-iterations: 340
static utilization: 43.3%

Memory Management: The Hidden Enabler

The Fragmentation Problem

In standard Transformer implementations, the KV cache for a request is often treated like a growing contiguous buffer sized for a maximum sequence length.

When requests have different lengths and arrive at different times:

Internal Fragmentation: You have to pre-allocate memory for max_seq_len (e.g., 4096 tokens) even if the request generates 50 tokens.
External Fragmentation: As requests finish and leave, they create variable-sized "holes" in GPU memory that new requests might not fit into cleanly.

A useful size estimate follows this formula:

M_{KV} = 2 \times L \times s \times H_{kv} \times d \times b

Use the formula directly before choosing a concurrency target:

kv-cache-budget.py

layers, kv_heads, head_dim, bytes_per_value = 80, 8, 128, 2
tokens = 4096
bytes_per_token = 2 * layers * kv_heads * head_dim * bytes_per_value
cache_bytes = bytes_per_token * tokens

print(f"KV state per token: {bytes_per_token / 1024:.0f} KiB")
print(f"KV state at {tokens:,} tokens: {cache_bytes / 1024**3:.2f} GiB")

Output

KV state per token: 320 KiB
KV state at 4,096 tokens: 1.25 GiB

PagedAttention Solution

PagedAttention^[2] solves this by borrowing the concept of virtual memory from operating systems, as illustrated below.

KV Blocks: The KV cache is divided into fixed-size blocks (e.g., 16 or 32 tokens).
Non-Contiguous Allocation: Blocks can be stored anywhere in GPU memory (VRAM). They don't need to be contiguous.
Block Table: Each request keeps a mapping table from logical page to physical block that tells the runtime where its KV blocks live.

This toy allocation computes the last-block slack for a few active sequences:

paged-kv-slack.py

block_size = 16
sequence_lengths = [17, 31, 48, 65]
reserved = sum(((length + block_size - 1) // block_size) * block_size for length in sequence_lengths)
used = sum(sequence_lengths)
slack = reserved - used

print(f"active tokens: {used}")
print(f"reserved block slots: {reserved}")
print(f"last-block slack: {slack} tokens ({slack / reserved:.1%})")

Output

active tokens: 161
reserved block slots: 192
last-block slack: 31 tokens (16.1%)

Prefill vs. Decode Phases

LLM serving has two distinct computational phases with opposing characteristics:

Prefill (Prompt Processing)

Processes all input tokens in parallel (like training).
Often compute-heavy because it processes many prompt tokens in matrix multiplications.
Runs once per request at the start.
Latency goal: Minimize Time To First Token (TTFT).

Decode (Token Generation)

Generates tokens one at a time (autoregressive).
Often memory-bandwidth limited at practical batch sizes because each step processes few new tokens while reading model and KV state.
Runs many times per request (once per output token).
Latency goal: Minimize inter-token latency. Many dashboards report this as TBT or TPOT, while papers such as DistServe define TPOT as the average time per output token after the first.^[3]

The following table summarizes the key differences between the prefill and decode phases:

Phase	Arithmetic Intensity	Bottleneck	Batching Strategy
Prefill	Often high	Frequently compute throughput	Batch prompt tokens
Decode	Often lower	Frequently memory bandwidth	Batch active streams

The Interference Problem

Token-budget scheduling makes that bound visible:

decode-first-token-budget.py

max_num_batched_tokens = 1024
decode_requests = 96
pending_prefill_tokens = 1800
prefill_budget = max_num_batched_tokens - decode_requests
scheduled_prefill = min(pending_prefill_tokens, prefill_budget)

print(f"decode tokens scheduled first: {decode_requests}")
print(f"prefill tokens scheduled now: {scheduled_prefill}")
print(f"prefill tokens left for later: {pending_prefill_tokens - scheduled_prefill}")

Output

decode tokens scheduled first: 96
prefill tokens scheduled now: 928
prefill tokens left for later: 872

Solutions include:

Chunked prefill: Break long prefills into chunks, interleave with decode steps.
Disaggregated prefill: Separate prefill and decode onto different GPU pools.^[3]
Priority scheduling: Prioritize decode steps to minimize latency.

Scheduling Policies

The scheduler's job is to decide which requests to run in the next iteration.

First-Come-First-Served (FCFS)

Shortest-Job-First (SJF)

shortest-job-first-bound.py

jobs = [("long", 50), ("quick", 5), ("medium", 20)]

def average_completion(order):
    elapsed = 0
    completion_times = []
    for _, tokens in order:
        elapsed += tokens
        completion_times.append(elapsed)
    return sum(completion_times) / len(completion_times)

fcfs = average_completion(jobs)
oracle_sjf = average_completion(sorted(jobs, key=lambda job: job[1]))
print(f"FCFS average completion: {fcfs:.1f} steps")
print(f"oracle SJF average completion: {oracle_sjf:.1f} steps")
print("production caveat: output length is not known up front")

Output

FCFS average completion: 60.0 steps
oracle SJF average completion: 35.0 steps
production caveat: output length is not known up front

Priority and Decode-First Scheduling

Preemption Strategies

When GPU memory is full (KV cache blocks are exhausted), the scheduler must either pause work, evict state, or stop admitting new requests. Two common recovery strategies are:

Swap Out (CPU Offload): Move the victim request's KV cache from GPU VRAM to CPU RAM.
- Pros: Saves progress; cheap to resume if PCIe bandwidth is available.
- Cons: High latency penalty for the swapped-out request.
Recomputation: Drop the KV cache entirely and re-run the prefill phase when the request is rescheduled.
- Pros: Zero memory overhead on CPU; simple to implement.
- Cons: Wastes GPU compute; works best when the prompt is short (low prefill cost).

preemption-strategies.py

from dataclasses import dataclass

@dataclass
class Request:
    id: str
    prompt_len: int
    max_new_tokens: int
    priority: int = 0  # Higher value = higher priority
    generated_tokens: int = 0
    allocated_blocks: int = 0
    swapped_to_cpu: bool = False

    @property
    def total_tokens(self) -> int:
        return self.prompt_len + self.generated_tokens

    def is_finished(self) -> bool:
        """Check if request has reached EOS or max_new_tokens."""
        return self.generated_tokens >= self.max_new_tokens

class ContinuousBatchScheduler:
    """
    Simplified iteration-level scheduler.

    Real engines also track per-request token budgets, prefix-cache hits,
    EOS detection, and beam-search groups. This example focuses on the core
    loop: retire finished requests, resume paused work, admit new work, then
    preempt if the next decode step would overflow KV-cache capacity.
    """
    BLOCK_SIZE = 16

    def __init__(self, max_batch_size: int, max_memory_blocks: int):
        self.max_batch_size = max_batch_size
        self.max_memory_blocks = max_memory_blocks
        self.running: list[Request] = []  # Currently generating on GPU
        self.waiting: list[Request] = []  # Queued requests
        self.swapped: list[Request] = []  # Preempted to CPU RAM
        self.used_memory_blocks = 0
        self.preemptions = 0

    def schedule_step(self) -> list[Request]:
        # 1. Retire finished requests and free their KV cache.
        active = []
        for req in self.running:
            if req.is_finished():
                self._free_kv_cache(req)
            else:
                active.append(req)
        self.running = active

        # 2. Resume preempted requests first so they don't starve.
        self._admit(self.swapped, from_cpu=True)

        # 3. Admit fresh waiting requests.
        self._admit(self.waiting, from_cpu=False)

        # 4. Preempt only if the next decode step would exceed the safe budget.
        while self.running and not self._can_run_next_decode_step(self.running):
            victim = min(self.running, key=lambda r: (r.priority, r.generated_tokens))
            self.running.remove(victim)
            self._swap_out_to_cpu(victim)
            victim.swapped_to_cpu = True
            self.swapped.append(victim)
            self.preemptions += 1

        return self.running

    def finish_decode_step(self) -> None:
        # Call this after the model emits one token for each running request.
        for req in self.running:
            req.generated_tokens += 1
            required_blocks = self._blocks_for_tokens(req.total_tokens)
            growth = required_blocks - req.allocated_blocks
            if growth > 0:
                self.used_memory_blocks += growth
                req.allocated_blocks = required_blocks

    def _admit(self, queue: list[Request], from_cpu: bool) -> None:
        queue.sort(key=lambda r: r.priority, reverse=True)

        while (
            queue
            and len(self.running) < self.max_batch_size
            and self._can_allocate_current_state(queue[0])
        ):
            req = queue.pop(0)
            if from_cpu:
                self._swap_in_from_cpu(req)
            else:
                self._allocate_kv_cache(req)
            req.swapped_to_cpu = False
            self.running.append(req)

    def _blocks_for_tokens(self, token_count: int) -> int:
        return (token_count + self.BLOCK_SIZE - 1) // self.BLOCK_SIZE

    def _can_allocate_current_state(self, req: Request) -> bool:
        return self.used_memory_blocks + self._blocks_for_tokens(req.total_tokens) <= self.max_memory_blocks

    def _allocate_kv_cache(self, req: Request) -> None:
        req.allocated_blocks = self._blocks_for_tokens(req.total_tokens)
        self.used_memory_blocks += req.allocated_blocks

    def _free_kv_cache(self, req: Request) -> None:
        self.used_memory_blocks -= req.allocated_blocks
        req.allocated_blocks = 0

    def _swap_out_to_cpu(self, req: Request) -> None:
        self._free_kv_cache(req)

    def _swap_in_from_cpu(self, req: Request) -> None:
        self._allocate_kv_cache(req)

    def _projected_growth_for_next_token(self, req: Request) -> int:
        next_blocks = self._blocks_for_tokens(req.total_tokens + 1)
        return max(0, next_blocks - req.allocated_blocks)

    def _can_run_next_decode_step(self, batch: list[Request]) -> bool:
        growth = sum(self._projected_growth_for_next_token(req) for req in batch)
        return self.used_memory_blocks + growth <= self.max_memory_blocks

scheduler = ContinuousBatchScheduler(max_batch_size=3, max_memory_blocks=16)
scheduler.waiting.extend([
    Request("req_A", prompt_len=8, max_new_tokens=3),
    Request("req_B", prompt_len=8, max_new_tokens=1),
    Request("req_C", prompt_len=8, max_new_tokens=2),
    Request("req_D", prompt_len=8, max_new_tokens=2),
    Request("req_E", prompt_len=8, max_new_tokens=1),
])

timeline: list[list[str]] = []
peak_blocks = 0
for _ in range(6):
    active = scheduler.schedule_step()
    timeline.append([req.id for req in active])
    peak_blocks = max(peak_blocks, scheduler.used_memory_blocks)
    scheduler.finish_decode_step()
    peak_blocks = max(peak_blocks, scheduler.used_memory_blocks)

print(timeline[:3])
print(f"peak blocks: {peak_blocks}/{scheduler.max_memory_blocks}")

assert timeline[0] == ["req_A", "req_B", "req_C"]
assert timeline[1] == ["req_A", "req_C", "req_D"]
assert "req_E" in timeline[2]
assert peak_blocks <= scheduler.max_memory_blocks

# A tight KV-block budget forces an actual preemption before decode.
pressure = ContinuousBatchScheduler(max_batch_size=2, max_memory_blocks=2)
pressure.waiting.extend([
    Request("urgent", prompt_len=16, max_new_tokens=2, priority=1),
    Request("background", prompt_len=16, max_new_tokens=2, priority=0),
])
active_under_pressure = pressure.schedule_step()
print([req.id for req in active_under_pressure], f"preemptions={pressure.preemptions}")
assert [req.id for req in active_under_pressure] == ["urgent"]
assert pressure.preemptions == 1

Output

[['req_A', 'req_B', 'req_C'], ['req_A', 'req_C', 'req_D'], ['req_A', 'req_D', 'req_E']]
peak blocks: 3/16
['urgent'] preemptions=1

The key observation is that schedule_step() rebuilds the active set at every iteration, scanning the running requests and admitting queued work under the batch and KV-cache budgets.

Serving Engines and System Designs

The ideas in this article show up across most modern inference stacks, but some entries below are production engines while others are research systems that shaped later engine design:

Framework	What it emphasizes	Where it shines
vLLM^[2]	PagedAttention and continuous batching for high-throughput serving	General-purpose high-throughput serving
TensorRT-LLM^[4]	NVIDIA-optimized kernels, in-flight batching, paged KV cache	Teams standardized on the NVIDIA serving stack
SGLang^[5]	Continuous batching plus aggressive prefix reuse (RadixAttention)	Workloads with repeated prefixes or structured generation
DistServe^[3]	Explicit prefill/decode disaggregation and SLO-aware scheduling	Deployments where TTFT and TPOT targets dominate raw TPS

Other servers implement many of the same ideas too. Feature matrices change quickly, so treat engine comparisons as moving targets and verify the current docs before making a production choice.

Throughput vs. Goodput

Benchmark numbers in this space are highly workload-dependent. The meaningful comparison isn't "Which server is fastest?" but "Which system stays within TTFT and TPOT targets for my traffic mix?"

The literature still gives useful anchor points:

Source	What it measured	Why it matters
Orca^[1]	Iteration-level scheduling against request-level baselines	Showed that per-iteration scheduling can improve throughput and latency together
vLLM^[2]	PagedAttention plus continuous batching against prior serving systems	Reported 2-4x higher throughput at the same latency level compared with systems such as FasterTransformer and Orca in the paper's evaluated setups
DistServe^[3]	Phase-disaggregated serving under TTFT/TPOT SLOs	Reported up to 7.4x more request capacity or 12.6x tighter SLOs while keeping more than 90% of requests within latency constraints

Goodput: The Production Metric

\text{Goodput} = \frac{\text{requests processed within SLA}}{\text{total time}}

Calculate goodput from an explicit latency objective rather than reporting only completions:

goodput-under-slo.py

window_seconds = 10
latencies_ms = [
    {"ttft": 120, "tpot": 25},
    {"ttft": 450, "tpot": 22},
    {"ttft": 180, "tpot": 42},
    {"ttft": 190, "tpot": 27},
]
ttft_target, tpot_target = 300, 30
within_slo = [
    item for item in latencies_ms
    if item["ttft"] <= ttft_target and item["tpot"] <= tpot_target
]

print(f"raw throughput: {len(latencies_ms) / window_seconds:.2f} req/s")
print(f"goodput: {len(within_slo) / window_seconds:.2f} req/s")
print(f"SLO-compliant requests: {len(within_slo)}/{len(latencies_ms)}")

Output

raw throughput: 0.40 req/s
goodput: 0.20 req/s
SLO-compliant requests: 2/4

Advanced Scheduling: Chunked Prefill

Mechanism

Why This Matters

Here is the same budget tradeoff as a small schedule:

chunked-prefill-plan.py

prompt_tokens = 4000
chunk_size = 512
chunks = []
remaining = prompt_tokens
while remaining:
    current = min(chunk_size, remaining)
    chunks.append(current)
    remaining -= current

print(f"chunk count: {len(chunks)}")
print(f"first/last chunk: {chunks[0]}/{chunks[-1]} tokens")
print(f"decode scheduling boundaries: {len(chunks) - 1}")

Output

chunk count: 8
first/last chunk: 512/416 tokens
decode scheduling boundaries: 7

Disaggregated Serving

Architecture

Benefits

Prefill workers: Can be sized and tuned for high matrix-math throughput.
Decode workers: Can be sized for KV-cache capacity and memory bandwidth without worrying about prefill interference.
Scaling: You can scale prefill and decode pools independently based on whether your traffic skews toward long prompts or long generations.

Cost

Estimate the handoff bytes before assuming split pools improve latency:

kv-handoff-cost.py

layers, kv_heads, head_dim, bytes_per_value = 80, 8, 128, 2
prompt_tokens = 4096
link_gib_per_second = 50
kv_bytes = 2 * layers * kv_heads * head_dim * bytes_per_value * prompt_tokens
transfer_ms = kv_bytes / (link_gib_per_second * 1024**3) * 1000

print(f"KV handoff size: {kv_bytes / 1024**3:.2f} GiB")
print(f"idealized transfer time at {link_gib_per_second} GiB/s: {transfer_ms:.1f} ms")
print("production check: add protocol, queueing, and synchronization overhead")

Output

KV handoff size: 1.25 GiB
idealized transfer time at 50 GiB/s: 25.0 ms
production check: add protocol, queueing, and synchronization overhead

Common Mistakes in Production

Even experienced engineers trip over a few predictable traps when tuning a continuous batching setup.

Static Batching Mindset

Symptom: You design the scheduler as if all requests in a batch must start and finish together.

Cause: Traditional model-serving systems often batch fixed-shape requests, so the batch is treated as one request-level unit of work.

Fix: Treat decoding as an iteration-level loop. Finished requests leave at the next boundary, cancellations free their KV blocks, and waiting requests enter when batch and token budgets allow.

The OOM Oversimplification

Symptom: You increase max_batch_size and the server crashes with an out-of-memory error mid-generation.

Fix: Tune by KV cache blocks or token budget, not by raw request count. Monitor peak memory during the longest generations in your workload, and leave headroom for bursts.

Ignoring Time to First Token (TTFT)

Symptom: Throughput benchmarks look great, but users complain that new requests take forever to start responding.

Cause: The scheduler is so focused on keeping the decode batch full that it delays admitting new prefills. A long queue of decode work starves the waiting queue.

Try It Yourself: The Manual Scheduler

Here's a concrete exercise to test whether the concept has stuck.

Scenario: You have a GPU batch size of 3. Five support tickets arrive:

Ticket	Prefill length	Expected output
T1	10 tokens	20 tokens
T2	5 tokens	40 tokens
T3	8 tokens	15 tokens
T4	12 tokens	30 tokens
T5	6 tokens	10 tokens

Task: Walk through the first 50 iterations by hand.

Which three tickets start in the initial batch?
At which iteration does the first ticket finish?
Which ticket replaces it?
How many total decode iterations would static batching require if the initial batch were [T1, T2, T3]?
How many iterations does continuous batching save?

Check the hand calculation with a three-slot event simulation:

manual-scheduler-answer.py

import heapq

durations = [20, 40, 15, 30, 10]
slots = 3
static_iterations = sum(max(durations[i:i + slots]) for i in range(0, len(durations), slots))

active = durations[:slots]
heapq.heapify(active)
for duration in durations[slots:]:
    available_at = heapq.heappop(active)
    heapq.heappush(active, available_at + duration)
continuous_iterations = max(active)

print(f"static iterations: {static_iterations}")
print(f"continuous iterations: {continuous_iterations}")
print(f"iterations saved: {static_iterations - continuous_iterations}")

Output

static iterations: 70
continuous iterations: 45
iterations saved: 25

Key takeaways

Static batching wastes capacity when request lengths differ. Finished requests leave idle holes until the slowest request completes.
Continuous batching is common in high-throughput LLM serving. It schedules at the iteration level, adding and removing requests dynamically.
Memory management and scheduling are inseparable. Dynamic admission only works well if KV-cache allocation is flexible.
Prefill/decode interference is often the next bottleneck. Chunked prefill and disaggregation both exist to keep decode latency smooth.
Metrics matter. Don't optimize only for tokens/sec. Optimize for Goodput, TTFT, and TPOT under your actual workload.

What you should be able to defend

After working through this chapter, you should be able to:

Explain static-batch waste. Show why heterogeneous output lengths create idle slots and, in pipelined deployments, pipeline bubbles.
Describe iteration-level scheduling. Walk through finished-request removal, queued-request admission, and active-set rebuilding at decode boundaries.
Analyze prefill/decode interference. Explain why compute-heavy prefills can stall memory-bound decode streams.
Connect scheduling to memory. Explain why PagedAttention and fixed-size KV blocks make continuous admission practical.
Compare policies. Discuss FCFS, priority scheduling, decode-first scheduling, swapping, and recomputation.
Use goodput correctly. Define goodput as request throughput under TTFT and TPOT/SLA targets, not raw tokens per second.

Production questions

How does Sarathi-Serve reduce prefill-decode interference?

What is the optimal batch size for a given GPU?

How should cancellations work mid-generation?

How does disaggregated inference separate prefill and decode?

Why is PagedAttention closely tied to continuous batching?

Course handoff

Next Step

Continue to Scaling LLM Inference

Scheduling explains how requests share GPU time; scaling inference combines that with KV-cache memory, batching policy, and deployment topology to handle real traffic.

PreviousFlashAttention & Memory Efficiency

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Orca: A Distributed Serving System for Transformer-Based Generative Models.

Yu, G.-I., et al. · 2022 · OSDI 2022

Efficient Memory Management for Large Language Model Serving with PagedAttention.

Kwon, W., et al. · 2023 · SOSP 2023

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving.

Zhong, Y., et al. · 2024 · OSDI 2024

TensorRT-LLM: A High-Performance Inference Framework for LLMs.

NVIDIA · 2024

SGLang: Efficient Execution of Structured Language Model Programs.

Zheng, L., et al. · 2023

Sarathi-Serve: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.

Agrawal, A., et al. · 2023 · arXiv preprint

Optimization and Tuning.

vLLM · 2026

Splitwise: Efficient Generative LLM Inference Using Phase Splitting.

Patel, P., et al. · 2023

Continuous Batching & Scheduling

Continuous Batching & Scheduling

Why Batching Is Tricky for LLMs

The Problem with Static Batching

Continuous Batching (Iteration-Level Scheduling)

A Worked Example: Three Support Tickets

Memory Management: The Hidden Enabler

The Fragmentation Problem

Which term makes KV-cache memory grow during decoding?

PagedAttention Solution

Prefill vs. Decode Phases

Prefill (Prompt Processing)

Decode (Token Generation)

The Interference Problem

Scheduling Policies

First-Come-First-Served (FCFS)

Shortest-Job-First (SJF)

Priority and Decode-First Scheduling

Preemption Strategies

Why does continuous batching usually schedule by token or KV-cache budget instead of only request count?

Serving Engines and System Designs

Throughput vs. Goodput

Goodput: The Production Metric

Why can higher throughput still be worse goodput?

Advanced Scheduling: Chunked Prefill

Mechanism

Why This Matters

Disaggregated Serving

Architecture

Benefits

Cost

Common Mistakes in Production

Static Batching Mindset

The OOM Oversimplification

Ignoring Time to First Token (TTFT)

Try It Yourself: The Manual Scheduler

Key takeaways

What you should be able to defend

Production questions

How does Sarathi-Serve reduce prefill-decode interference?

What is the optimal batch size for a given GPU?

How should cancellations work mid-generation?

How does disaggregated inference separate prefill and decode?

Why is PagedAttention closely tied to continuous batching?

Course handoff

Continuous Batching & Scheduling

Continuous Batching & Scheduling

Why Batching Is Tricky for LLMs

The Problem with Static Batching

Continuous Batching (Iteration-Level Scheduling)

A Worked Example: Three Support Tickets

Memory Management: The Hidden Enabler

The Fragmentation Problem

Which term makes KV-cache memory grow during decoding?

PagedAttention Solution

Prefill vs. Decode Phases

Prefill (Prompt Processing)

Decode (Token Generation)

The Interference Problem

Scheduling Policies

First-Come-First-Served (FCFS)

Shortest-Job-First (SJF)

Priority and Decode-First Scheduling

Preemption Strategies

Why does continuous batching usually schedule by token or KV-cache budget instead of only request count?

Serving Engines and System Designs

Throughput vs. Goodput

Goodput: The Production Metric

Why can higher throughput still be worse goodput?

Advanced Scheduling: Chunked Prefill

Mechanism

Why This Matters

Disaggregated Serving

Architecture

Benefits

Cost

Common Mistakes in Production

Static Batching Mindset

The OOM Oversimplification

Ignoring Time to First Token (TTFT)