LearnInference & Production ScaleScaling LLM Inference

🚀HardInference Optimization

Scaling LLM Inference

Explains why decode-heavy LLM serving is often memory-bound and how KV-cache design, batching, PagedAttention, and speculative decoding improve scale.

41 min read

Learning path

Step 129 of 155 in the full curriculum

Continuous Batching & Scheduling Model Parallelism for LLM Inference

Scaling LLM Inference

The previous chapter showed how continuous batching keeps decode slots useful. This chapter asks the capacity question for scaling large language model (LLM) inference: when requests, model weights, and KV state all compete for HBM, what actually limits serving concurrency?

Imagine you run an online store and you want a chatbot that answers "Where's my order?" Every time a customer asks, the model has to generate a response one word at a time. It's not because the model is "thinking." During decode, the serving stack keeps rereading billions of model weights from GPU memory and consulting a growing KV cache, and that memory movement takes time. Picture a fulfillment line where the routing map has to be reloaded for every single item. Reading the pick list is fast. Reopening the same giant routing map one item at a time is painfully slow.

This article ties together prefill, decode, batching, KV-cache memory, PagedAttention, disaggregation, speculative decoding, and quantization so serving bottlenecks become measurable instead of mysterious. The thread running through all of them is one decision: where to sit on the throughput, latency, and cost triangle for your workload. For the scheduling loop behind this chapter, see continuous batching.

The two phases of generation

LLM inference is distinct from training because it consists of two radically different computational phases: Prefill and Decode. Understanding this distinction is the first step to optimization.

Think of our order-tracking bot. When a customer sends "Where is order 48291?", the system first has to read and understand that entire sentence. That's the prefill phase. Then it starts answering, generating one word at a time: "Your", "order", "is", "in", "transit." That's the decode phase.

Prefill: reading the prompt in one go

In the prefill phase, the model processes the entire user prompt in parallel. This is similar to training: the GPU receives a matrix of shape [batch_size, prompt_len, hidden_dim] and computes attention for all tokens simultaneously.

Because all the input tokens are known upfront, the attention mechanism can compute the interactions between every token in the prompt at once. This parallel processing allows the GPU to use its massive matrix multiplication engines efficiently. A long prefill usually dominates Time To First Token (TTFT), though TTFT also includes queueing and scheduling delay before the first output token is emitted. The figure below shows how all tokens in the prompt are processed simultaneously to generate the first output token.

Prefill phase diagram showing known prompt tokens processed in parallel by large matrix operations before the first output token. — Prefill is the parallel prompt-processing phase. It usually dominates TTFT for long prompts because the first output token can't be emitted until the prompt has been processed.

Key characteristics

Often compute-heavy: Prefill exposes large matrix multiplications. FlashAttention keeps attention exact while reducing HBM traffic relative to materializing the full attention matrix; it does not make all attention IO linear in sequence length.^[1]
Parallel-friendly: Processing many prompt positions together can drive much higher tensor-core utilization than one-token decode. Whether it saturates compute depends on sequence shape, kernel, and hardware.
Latency: Time usually grows with prompt length, and long prompts often dominate TTFT.

Decode: answering one word at a time

Once the first token is generated, the model switches to autoregressive generation. It generates one token at a time, feeding it back as input for the next step.

Unlike the prefill phase, decoding can't be parallelized across tokens because each new token depends on the previous ones. The system is locked into a sequential, step-by-step loop. The speed at which tokens are produced in this phase is measured as Time Per Output Token (TPOT), often expressed as tokens per second (TPS), which dictates how fast the text streams to the user. While TTFT affects perceived responsiveness, TPOT determines the "reading speed" of the generation. The following figure shows this autoregressive process, where each generated token is fed back as input for the next step.

Decode loop diagram showing each output token rereading model weights and KV state, computing logits, sampling one token, and appending new KV before the next step. — Decode is a sequential memory loop. Batching can amortize repeated weight reads across requests, but each request still advances one generated token at a time.

Key characteristics

Often memory-bound: A decode step needs model weights and the KV state used by attention. At small or latency-sensitive batches, repeated reads commonly make HBM bandwidth the ceiling; batching can raise arithmetic intensity by sharing weight reads across active requests.
Low arithmetic intensity at small batches: The arithmetic intensity (FLOPs/byte, i.e., Floating Point Operations per byte of data loaded) can be low because the runtime moves large tensors for only one new position per sequence.

Why decode is memory-bound

Decode-heavy LLM serving is often memory-bandwidth bound, not compute-bound. In a compute-bound operation, the system is bottlenecked by the mathematical calculations it must perform. Training and long-prefill workloads typically expose much larger matrix operations than interactive decode, so they can drive compute hardware more effectively.

During token generation, the bottleneck often shifts toward memory movement. Each new token needs model weights and attention state, but contributes only one new position per active request. At modest decode batches, this produces low arithmetic intensity and makes HBM traffic a central constraint.

To make this concrete, imagine a model with 7 billion parameters stored in 16-bit precision. Its weights occupy about 14 GB in decimal units. If one uncached decode step had to read that full weight footprint for one active token, the weight-read lower bound alone would be about 14 GB per step. Real kernels, cache reuse, batch size, tensor parallelism, and KV traffic determine the observed bandwidth cost.

When profiling confirms this bandwidth ceiling, serving work should focus on bytes moved, cache residency, batch policy, and queueing behavior rather than only raw floating-point throughput.

Roofline-style utilization comparison showing prefill leaning toward compute while small-batch decode can lean toward HBM bandwidth. — Prefill and decode hit different ceilings. Decode often saturates memory bandwidth while tensor cores wait, so bandwidth-reducing optimizations matter more than peak FLOPs.

decode-bandwidth-lower-bound.py

parameters = 7_000_000_000
bytes_per_parameter = 2  # FP16
ideal_hbm_bandwidth_gb_s = 2_000

weight_bytes = parameters * bytes_per_parameter
ideal_steps_per_second = ideal_hbm_bandwidth_gb_s * 1_000_000_000 / weight_bytes

print(f"FP16 weight footprint: {weight_bytes / 1_000_000_000:.2f} GB")
print(f"ideal weight-read upper bound: {ideal_steps_per_second:.1f} single-token steps/s")
print("Observed TPS is lower once KV reads and runtime overhead are included.")

Output

FP16 weight footprint: 14.00 GB
ideal weight-read upper bound: 142.9 single-token steps/s
Observed TPS is lower once KV reads and runtime overhead are included.

The KV cache: saving state so you don't restart

Without caching, every new token requires recomputing attention over all previous tokens. The KV cache stores the Key and Value matrices for all past tokens, so we only need to compute them for the new token.

Think of it like a shift handoff log. Without it, our order-tracking bot would have to reread the entire customer conversation from the beginning every time it wanted to say the next word. With the KV cache, it remembers what it already understood and only processes the newest token.

KV cache packing comparison showing contiguous reservations wasting memory while paged block allocation packs active request blocks into a shared reusable pool. — Once KV state persists across decode steps, memory packing becomes a scheduling constraint. Paged blocks make more HBM usable for active requests.

The illustration here zooms in on a different but equally important serving concern: once you keep KV states around, you need to pack them efficiently in GPU memory instead of reserving one giant contiguous region per request.

KV cache append diagram showing prefix keys and values reused across decode steps while the newest token adds one fresh KV pair. — The KV cache keeps old keys and values alive across decode steps. Each new token adds one fresh KV pair instead of rebuilding the whole prefix.

Memory cost of KV cache

The KV cache is often the largest consumer of GPU memory during inference, sometimes exceeding the model weights themselves for long contexts. This is crucial for capacity planning and determining the maximum batch size a given GPU can support.

Let's work through a concrete example by hand before showing the code. Suppose we're serving our order-tracking bot with a model that has 80 layers, uses Grouped Query Attention with 8 KV heads, and each head has dimension 128. For one request with a sequence length of 8,192 tokens, stored in FP16 (2 bytes per element):

We need both K and V: that's a factor of 2
One request, 8,192 tokens, 80 layers, 8 heads, head size 128, 2 bytes each
Total bytes = 2 * 1 * 8,192 * 80 * 8 * 128 * 2 = 2,684,354,560 bytes
Divide by 1024^3: that's about 2.5 GiB per request

Now scale that up. For a production batch of 64 concurrent requests at an 8K context window, that's about 160 GiB of KV cache alone. This is why techniques like Grouped Query Attention (GQA), which reduces the number of KV heads from num_heads to num_kv_heads, are standard in modern models.^[2]

The following Python function generalizes that exact calculation. It takes the model's architectural parameters and returns the KV cache memory in GiB.

KV cache capacity chart showing memory growing linearly with sequence length and batch size, with an 8K-context 64-request batch reaching about 160 GiB in the worked example. — KV memory grows linearly with sequence length and batch size. For long contexts, KV cache alone can cap concurrency before model weights do.

memory-cost-of-kv-cache.py

def kv_cache_memory(
    batch_size: int,
    seq_len: int,
    num_layers: int,
    num_kv_heads: int,
    head_dim: int,
    dtype_bytes: int = 2  # FP16
) -> float:
    """Calculate KV cache memory in GiB."""
    # 2 for K and V, per layer, per head
    total_bytes = (
        2 * batch_size * seq_len * num_layers * num_kv_heads * head_dim * dtype_bytes
    )
    return total_bytes / (1024 ** 3)

# Example model: 80 layers, 8 KV heads (GQA), head_dim=128
# Batch=1, seq_len=8192, FP16 (2 bytes):
# 2 * 1 * 8192 * 80 * 8 * 128 * 2 = ~2.5 GiB per request
one_request = kv_cache_memory(
    batch_size=1,
    seq_len=8192,
    num_layers=80,
    num_kv_heads=8,
    head_dim=128,
)
production_batch = kv_cache_memory(
    batch_size=64,
    seq_len=8192,
    num_layers=80,
    num_kv_heads=8,
    head_dim=128,
)

print(f"one 8K request: {one_request:.1f} GiB")
print(f"64 active 8K requests: {production_batch:.0f} GiB")
print("single-request estimate correct:", one_request == 2.5)
print("64-request estimate correct:", production_batch == 160.0)

Output

one 8K request: 2.5 GiB
64 active 8K requests: 160 GiB
single-request estimate correct: True
64-request estimate correct: True

Throughput vs. latency trade-off

There's an inherent tension between maximizing system throughput and minimizing per-request latency.

Chart showing throughput rising with batch pressure while p99 latency rises faster once batch size gets large. — Batching improves aggregate tokens per second, but user-facing latency can worsen once large batches push harder on shared memory bandwidth. Track throughput and TTFT or TPOT together.

Metric	Optimized By	Trade-off
Throughput (tokens/sec)	Larger effective batches	Can increase TTFT or inter-token latency once shared resources are pressured.
Latency (ms/token)	Smaller admitted batches	Can leave throughput unused and raise cost per token.

Production tip: Monitor GPU KV-cache usage, prefill backlog, and decode queue depth together. High KV usage plus rising TTFT usually means memory pressure is capping concurrency. Low KV usage with idle compute means you're leaving throughput on the table.

The throughput, latency, cost triangle

Throughput and latency are two corners of a third constraint that the business actually cares about: cost per token. These three pull against each other, and picking where to sit on that triangle is the central job of an inference engineer.

Cost per token is simpler than it looks. If you rent a GPU at a fixed hourly rate and it sustains some number of tokens per second, then:

$\text{cost per token} = \frac{\text{GPU \$ per hour}}{\text{sustained tokens per second} \times 3600}$

Sustained throughput, not the sticker hourly rate, dominates the answer. A faster, pricier GPU can still be cheaper per token if its throughput rises faster than its price. Let's work an example by hand. Suppose one GPU costs $3.00/hour and a well-batched deployment sustains 2,500 decode tokens/second across all active requests:

Tokens per hour = 2,500 * 3,600 = 9,000,000
Cost per token = $3.00 / 9,000,000 = $0.00000033
Cost per million tokens = about $0.33

Now starve the batch. If under-configured batching or idle capacity drops sustained throughput to 250 tokens/second, the same GPU-hour spreads over one-tenth the tokens, so cost per million jumps to about $3.33. Utilization is a direct 10x multiplier on cost. This is why batching is not only a latency knob; it moves the cost corner of the triangle.

cost-per-million-tokens.py

def cost_per_million(hourly_cost: float, sustained_tps: int) -> float:
    return hourly_cost / (sustained_tps * 3600) * 1_000_000

well_batched = cost_per_million(3.00, 2_500)
starved = cost_per_million(3.00, 250)

print(f"2,500 tokens/s: ${well_batched:.2f} per million tokens")
print(f"250 tokens/s: ${starved:.2f} per million tokens")
print(f"cost multiplier: {starved / well_batched:.0f}x")

Output

2,500 tokens/s: $0.33 per million tokens
250 tokens/s: $3.33 per million tokens
cost multiplier: 10x

The triangle has a simple rule: you can usually optimize two corners hard, but the third drifts. Push batch size for throughput and cost, and tail latency rises. Cap batch size for tight latency SLOs, and your cost per token climbs because the GPU is underused. There is no single best operating point, only the one that fits your product's latency SLO at acceptable cost.

Operating point	Batch size	Cost per token	Latency (TTFT/TPOT)	Typical fit
Latency-first	Small	High	Low	Interactive chat, code completion
Balanced	Medium	Medium	Medium	General chat assistants
Throughput-first	Large	Low	High	Offline batch jobs, summarization, evals

Production tip: Pick the operating point from the product SLO, then size hardware to it. An interactive assistant with a 500 ms TTFT budget can't run the same batch size as an overnight document-summarization job, even on identical GPUs. The summarization job can push batch size until cost per token bottoms out because no human is waiting on each token.

Chunked prefills

Long prefills (e.g., Retrieval-Augmented Generation (RAG), where retrieved documents are appended to the user's prompt, creating contexts of 10,000 tokens) can delay decode turns on a shared worker. The duration depends on model, kernel, hardware, and prompt length, but enough long prompts can noticeably worsen active streams in a multi-tenant service.

To reduce that interference, engineers can break large prefills into smaller, fixed-size chunks (e.g., 512 tokens). The system admits a chunk, gives active decodes another scheduling opportunity, and later processes the next chunk. This bounds admitted prefill work per turn, but does not guarantee a latency SLO when queues or kernels are already overloaded. It often trades a higher TTFT for the long request against improved TPOT for existing streams.^[3]

Chunked prefill timeline showing a long prompt split into chunks that provide decode scheduling opportunities between admitted prefill slices. — Chunked prefill slices a large prompt into smaller GPU turns. The long request may wait longer for first token, while active streams get more opportunities to decode.

prefill-chunk-budget.py

import math

prompt_tokens = 10_000
chunk_tokens = 512
turns = math.ceil(prompt_tokens / chunk_tokens)
last_chunk = prompt_tokens - chunk_tokens * (turns - 1)

print(f"prefill turns: {turns}")
print(f"largest admitted prefill slice: {chunk_tokens} tokens")
print(f"last slice: {last_chunk} tokens")

Output

prefill turns: 20
largest admitted prefill slice: 512 tokens
last slice: 272 tokens

Disaggregated inference

Disaggregated inference separates prefill and decode across worker pools rather than running both phases on one worker. Systems such as Splitwise and DistServe show when this can improve goodput: avoided interference must repay KV-transfer and coordination overhead.^[4]^[5]

The problem: conflicting optimization targets

Prefill and decode phases have opposite hardware needs:

Prefill is often compute-heavy: Long prompts create large matrix operations over many positions
Decode is often bandwidth-heavy: Small-batch token generation repeatedly reads weights and KV state

When both phases run on the same GPU, they interfere. A long prefill can "block" decode requests, causing head-of-line blocking where a massive prompt stalls generation for all other users.

Prefill-decode disaggregation

The solution is architectural separation:

Prefill cluster (compute-optimized): Dedicated prefill workers process incoming prompts in parallel. These nodes excel at the compute-heavy attention operations.
Decode cluster (bandwidth-optimized): Separate workers handle token generation. These nodes are tuned for the memory-bound sequential decoding loop and steady high-concurrency decode traffic.
KV cache handoff: After prefill completes, the KV cache is transferred over a fast interconnect from the prefill worker to a decode worker, which continues generation.

Disaggregated inference architecture showing compute-optimized prefill workers handing KV cache over a fast interconnect to bandwidth-optimized decode workers. — Prefill-decode disaggregation separates the compute-heavy prompt phase from the bandwidth-heavy streaming phase. It helps when avoided queueing costs exceed KV-transfer overhead.

Benefits of disaggregation

Can reduce head-of-line blocking: Long prefills no longer directly occupy decode workers
Potentially right-sized hardware: Each phase can run on workers chosen for its measured bottleneck
Separate scaling knobs: Prefill and decode clusters can scale independently based on workload patterns
Potential efficiency gain: Separate pools can better match the two workload profiles when transfer overhead is acceptable

Disaggregation is a design pattern, not a mandatory default. The KV-transfer cost has to be lower than the queueing and interference it removes, which is why it becomes more attractive as prompts get longer and decode traffic gets denser.^[4]^[5]

Disaggregation also changes how you autoscale. Because prefill load tracks incoming prompt tokens and decode load tracks active generation, the two pools scale on different signals. A burst of long RAG prompts points toward more prefill capacity; a surge in concurrent streaming conversations points toward more decode capacity. Useful signals include queue depth, KV-cache utilization, and TTFT/TPOT percentiles, with cold-start time included because loading model weights onto a new worker is not instant.

kv-handoff-lower-bound.py

kv_cache_gib = 2.5
interconnect_gb_s = 200

transfer_bytes = kv_cache_gib * 1024**3
ideal_transfer_ms = transfer_bytes / (interconnect_gb_s * 1_000_000_000) * 1000

print(f"KV state to transfer: {kv_cache_gib:.1f} GiB")
print(f"ideal one-way transfer floor at {interconnect_gb_s} GB/s: {ideal_transfer_ms:.1f} ms")
print("Queueing saved must exceed transfer plus scheduling overhead.")

Output

KV state to transfer: 2.5 GiB
ideal one-way transfer floor at 200 GB/s: 13.4 ms
Queueing saved must exceed transfer plus scheduling overhead.

Batching strategies: the loading-dock analogy

Naive batching strategies lead to significant inefficiency due to the variable length of text. Picture a warehouse loading dock handling shipments that take different amounts of time to prepare.

Static batching: waiting for the whole pallet

In static batching, the dock waits for 4 parcels before releasing the pallet. If one parcel needs 10 extra minutes of labeling, the other 3 sit ready but blocked. In serving terms, we group requests into a batch and pad them to the length of the longest active sequence. The batch membership stays fixed for that run, so when one request finishes early its slot often turns into padding or sits idle until the longest request finishes. The timeline below illustrates how shorter requests waste compute cycles while the batch waits on the longest request.

Static versus continuous batching timeline showing static batches wasting finished slots as padding while continuous batching admits new requests after each decode step. — Static batching holds slots until the batch cycle ends. Continuous batching changes membership at token-step boundaries so finished requests leave and queued requests enter.

The problem with static batching

Static batching creates two significant inefficiencies. First, the GPU is forced to process "padding tokens" that don't contribute to the final output, wasting valuable compute cycles and memory bandwidth. Second, once shorter requests finish, their batch slots usually can't be reused until the scheduler rebuilds the batch around the longest surviving sequence. This degrades both latency and overall throughput, especially when request lengths vary widely.

Continuous batching: filling open dock slots

Continuous batching (introduced by Orca) operates at the iteration level.^[6] The dock releases one finished parcel, can pull the next queued parcel into the open slot, and keeps useful work flowing when demand exists.

In serving terms, instead of waiting for a whole batch cycle to finish, the scheduler can eject completed requests and insert new ones after every token generation step. See our continuous batching deep-dive for scheduling algorithms and preemption strategies. The timeline above shows slot reuse; its actual benefit depends on queued work, KV capacity, and latency policy.

Benefits of continuous batching

Continuous batching provides three useful advantages under mixed, queued workloads:

Less slot waste: Finished requests can leave at iteration boundaries rather than remaining as padding or idle slots.
Lower completion delay for short requests: A completed request need not wait for the longest request in a fixed batch, though queueing and large active batches can still worsen latency.
Policy control: The scheduler can decide how to admit prefills alongside ongoing decode while respecting KV memory and latency SLOs.

To implement continuous batching, systems use a scheduling loop that manages active requests dynamically. The following sketch shows the shape of a continuous batcher. It takes a queue of incoming requests, processes prefill for new requests up to the maximum batch size, and then runs a single decoding step for all active requests.

benefits-of-continuous-batching.py

import torch

# Minimal request stub for illustration
class Request:
    def is_done(self) -> bool: return False
    def get_next_token(self) -> torch.Tensor: return torch.tensor([0])
    def update(self, logits: torch.Tensor): pass

class ContinuousBatcher:
    def __init__(self, model: torch.nn.Module, max_batch_size: int = 64):
        self.model = model
        self.max_batch = max_batch_size
        self.active_requests: list[Request] = []
        self.queue: list[Request] = []
    
    def step(self):
        """
        Executes a single generation step for the current batch.
        """
        # 1. Remove completed requests
        self.active_requests = [
            req for req in self.active_requests if not req.is_done()
        ]
        
        # 2. Add new requests from queue (up to max batch size)
        while self.queue and len(self.active_requests) < self.max_batch:
            new_req = self.queue.pop(0)
            # Run prefill for the new request (often done in parallel or on a separate stream)
            # pseudo-code: new_req.run_prefill(self.model)
            self.active_requests.append(new_req)
        
        # 3. Run one decode step for all active requests
        if self.active_requests:
            # Gather current input tokens from all requests
            input_tokens = torch.stack([req.get_next_token() for req in self.active_requests])
            
            # Forward pass (batched)
            logits = self.model.decode(input_tokens)
            
            # Update requests with new tokens
            for i, req in enumerate(self.active_requests):
                req.update(logits[i])

Memory management: PagedAttention (vLLM)

Traditional KV-cache allocation is like reserving a full pallet position for every request, even if it only needs a small bin. PagedAttention is like a shared bin system that assigns fixed-size slots on demand: Request A gets slots 7, 2, and 5 (non-contiguous, but tracked by a block table). When a request finishes, its slots become available again. Paging sharply reduces worst-case reservation waste, but a partially filled final block and bookkeeping still consume memory.

The problem

KV cache is allocated per-request, but request lengths vary. Pre-allocating the maximum possible sequence length for every request wastes a massive amount of memory. For example, if the system allocates 4096 tokens per request by default:

Request	Tokens Needed	Tokens Allocated	Memory Wasted
Request A	100	4096	97.5%
Request B	3000	4096	26.8%

PagedAttention solution

PagedAttention applies the operating system concept of virtual memory to KV cache management.^[7] Instead of contiguous physical memory, we divide the KV cache into fixed-size "blocks" (pages). The following figure illustrates how logical blocks map to non-contiguous physical GPU memory via a block table.

PagedAttention block-table diagram mapping logical KV blocks to non-contiguous physical GPU memory pages while preserving sequence order. — PagedAttention separates logical token order from physical HBM placement. A block table lets the runtime use non-contiguous pages while attention still sees the right sequence.

Impact of PagedAttention

By avoiding large contiguous reservations and allocating fixed-size blocks on demand, PagedAttention lets the runtime fit more useful KV state into the same HBM budget.^[7] It does not eliminate all slack: each live request can still leave a partially filled last block, and the block table has overhead. In practice, it substantially reduces memory lost to worst-case preallocation.

paged-kv-slack.py

import math

requests = [100, 3_000]
max_context = 4_096
block_tokens = 16
reserved_tokens = len(requests) * max_context
paged_tokens = sum(math.ceil(tokens / block_tokens) * block_tokens for tokens in requests)

print(f"max-context reservation: {reserved_tokens} token slots")
print(f"paged allocation: {paged_tokens} token slots")
print(f"remaining final-block slack: {paged_tokens - sum(requests)} token slots")

Output

max-context reservation: 8192 token slots
paged allocation: 3120 token slots
remaining final-block slack: 20 token slots

Copy-on-write for shared blocks

PagedAttention's copy-on-write mechanism matters whenever multiple active continuations share the same prompt prefix. In the original vLLM setting, this is especially important for beam search and parallel sampling, where several continuations reuse the same prompt blocks before they diverge.^[7] The figure below shows two continuations initially pointing to the same shared prefix blocks before branching.

Copy-on-write KV cache diagram showing two continuations sharing prefix blocks until one branch diverges and allocates a private block. — Copy-on-write shares immutable prefix KV blocks across continuations, then clones only the block that must diverge. That saves memory without corrupting another branch's view.

Initially, the shared prefix blocks have a reference count greater than one. Appending new tokens usually allocates fresh blocks for each continuation. If a continuation needs to write into a block that's still shared, the runtime first clones that block so the other continuations keep their original view. That's what copy-on-write means here: share immutable prefix state aggressively, then split only when sequences diverge.^[7]

Context parallelism and long-context serving

As context windows grow into the hundreds of thousands or millions of tokens, a single GPU often can't hold the full KV cache or attention working set for one request. Context Parallelism (CP) addresses this by splitting the input sequence itself across multiple GPUs.^[8]

How context parallelism works

Instead of splitting layers (tensor parallelism) or batches (data parallelism), CP splits the sequence dimension:

A 1M token sequence is divided into N chunks (e.g., 250K tokens per GPU on 4 GPUs)
Each GPU processes its chunk independently during the prefill phase
Attention is computed using ring-style communication patterns to handle cross-chunk dependencies
The KV cache is distributed across the GPU cluster

This approach becomes useful once a single request's context no longer fits comfortably on one accelerator.

Ring attention for context parallelism

Modern implementations use ring attention (Liu et al., 2024) or similar distributed attention algorithms that minimize communication overhead. GPUs form a logical ring, passing partial attention results to their neighbors until the full context is covered. This can extend supported context length roughly linearly with the number of devices, at least until communication becomes the next bottleneck.^[8]

context-parallel-kv-shards.py

def kv_gib(sequence_tokens: int) -> float:
    total_bytes = 2 * sequence_tokens * 80 * 8 * 128 * 2
    return total_bytes / 1024**3

total_kv = kv_gib(1_000_000)
devices = 4
print(f"one 1M-token request KV: {total_kv:.1f} GiB")
print(f"evenly sharded over {devices} devices: {total_kv / devices:.1f} GiB/device")
print("Communication and runtime buffers still add overhead.")

Output

one 1M-token request KV: 305.2 GiB
evenly sharded over 4 devices: 76.3 GiB/device
Communication and runtime buffers still add overhead.

Speculative decoding: the smart assistant

Speculative decoding works like a fulfillment shortcut. A small draft model proposes the next few support-message tokens. The large target model checks the proposed span in a verification pass. If enough draft tokens survive acceptance, this can replace several target decode passes; if not, draft and verification work can lose to ordinary decoding.

Think of it as a senior routing checker (the large target model) and a fast draft scanner (the small draft model). The scanner guesses the next 5 tokens. The checker looks at the span together. If the first 3 are accepted, one target verification pass can emit those accepted tokens plus a correction or continuation token. The accounting must still include the scanner's draft work.

To visualize this, consider how speculative decoding coordinates the interaction between the two models.^[9] We use a fast, small draft model to propose a sequence of multiple tokens. The large target model then verifies these proposed tokens in parallel, accepting correct ones and correcting any mistakes.

Speculative decoding diagram showing a small draft model proposing multiple tokens, a large target model verifying them in one pass, and accepted tokens reducing target decode passes. — Speculative decoding is only faster when draft work is cheap and acceptance rate is high. The target model still preserves correctness through accept/reject correction.

The key insight of speculative decoding is the asymmetric cost of the two paths. Drafting may be cheap enough, while verification can score all k draft positions in one target-model pass. If acceptance is high, one target pass can replace several ordinary target decode passes. Total latency still includes draft generation, verification, sampling, rejected-token correction, and kernel overhead.

The function below shows one exact speculative step for a single sequence. It first samples k draft tokens from the small model, then runs the large target model once on [prompt + draft_tokens], and finally performs the accept/reject test from Leviathan et al. with residual resampling on the first mismatch.^[9]

speculative-decoding-the-smart-assistant.py

import torch
import torch.nn.functional as F

def next_token_probs(model, input_ids: torch.Tensor) -> torch.Tensor:
    with torch.no_grad():
        logits = model(input_ids).logits[0, -1]
    return F.softmax(logits, dim=-1)

def speculative_step(
    draft_model,
    target_model,
    input_ids: torch.Tensor,
    k: int = 4,
) -> list[int]:
    """
    Return one speculative chunk for a single sequence.
    Assumes input_ids has shape [1, seq_len].
    """
    assert input_ids.shape[0] == 1, "single-sequence example"

    draft_tokens: list[int] = []
    draft_dists: list[torch.Tensor] = []
    draft_input = input_ids

    for _ in range(k):
        q = next_token_probs(draft_model, draft_input)
        token = torch.multinomial(q, num_samples=1).item()
        draft_tokens.append(token)
        draft_dists.append(q)
        token_tensor = torch.tensor([[token]], device=input_ids.device)
        draft_input = torch.cat([draft_input, token_tensor], dim=1)

    # One target-model pass verifies all draft positions at once.
    with torch.no_grad():
        target_logits = target_model(draft_input).logits[0]

    target_dists = F.softmax(
        target_logits[input_ids.size(1) - 1 : input_ids.size(1) + k],
        dim=-1,
    )

    accepted: list[int] = []
    for i, token in enumerate(draft_tokens):
        p = target_dists[i]
        q = draft_dists[i]
        acceptance = min(1.0, (p[token] / q[token]).item())

        if torch.rand(()) < acceptance:
            accepted.append(token)
            continue

        residual = torch.clamp(p - q, min=0)
        if residual.sum() <= 0:
            replacement = torch.argmax(p).item()
        else:
            residual = residual / residual.sum()
            replacement = torch.multinomial(residual, num_samples=1).item()
        return accepted + [replacement]

    # If all k draft tokens are accepted, sample one extra token from p.
    extra = torch.multinomial(target_dists[k], num_samples=1).item()
    return accepted + [extra]

Follow-on work such as EAGLE uses the target model's hidden states to predict future tokens instead of relying on a separately trained small draft model.^[10] The important takeaway isn't that one speculative variant always wins. It's that these methods trade extra compute for fewer expensive target-model decode passes, so the real payoff depends on acceptance rate, hardware, and implementation overhead.

speculative-pass-accounting.py

ordinary_target_passes = 5
draft_proposals = 4
accepted_prefix = 3
target_verification_passes = 1
emitted_tokens = accepted_prefix + 1

print(f"ordinary target passes for {ordinary_target_passes} tokens: {ordinary_target_passes}")
print(f"one verification emits in this example: {emitted_tokens} tokens")
print(f"extra draft passes paid: {draft_proposals}")
print("Speedup requires cheap drafting and high acceptance.")

Output

ordinary target passes for 5 tokens: 5
one verification emits in this example: 4 tokens
extra draft passes paid: 4
Speedup requires cheap drafting and high acceptance.

Hardware-aware optimization: quantization and precision

Inference performance isn't just about algorithms. Modern GPUs and specialized accelerators provide hardware-level features that fundamentally change the efficiency equation.

Low-precision inference and quantization

Many serving stacks still use BF16/FP16 as a baseline, but lower-precision modes are increasingly common because they cut model-weight bandwidth and, in some systems, shrink the KV footprint enough to raise concurrency.^[11]^[12]^[13]

FP8 (8-bit floating point): Useful when your hardware and kernels support it, because it lowers bandwidth pressure while preserving more dynamic range than integer-only formats.^[11]
INT8/INT4-style weight quantization: Common for weight-only inference, where the goal is to shrink the bytes reread on every decode step without fully quantizing the rest of the runtime.^[12]
KV-cache compression or quantization: Targets the other major memory consumer during long-context serving, which matters once the KV cache rather than weights caps concurrency.^[13]

Quantization approaches include:

Weight-only quantization: Keep activations in higher precision (BF16/FP16) but compress model weights to 4-8 bits
KV-cache compression/quantization: Reduce KV bytes, or compress less useful KV state, when long contexts would otherwise cap batch size and residency

Quantization bandwidth chart showing FP16, FP8, INT8, and INT4 reducing bytes per weight and increasing effective decode bandwidth for memory-bound serving. — For memory-bound decode, lower precision mainly helps by reducing bytes moved per token. Actual speedup depends on kernel support, dequantization overhead, and quality constraints.

Common mistake: Beginners often think 4-bit quantization is just about saving disk space. The real win is reducing bytes moved during decode. Whether that turns into a large latency win still depends on kernels, dequant overhead, and quality constraints.

weight-bytes-by-precision.py

parameters = 7_000_000_000
bytes_per_weight = {"FP16": 2.0, "INT8": 1.0, "INT4": 0.5}

for precision, width in bytes_per_weight.items():
    traffic_gb = parameters * width / 1_000_000_000
    print(f"{precision}: {traffic_gb:.1f} GB of weights per full read")

print("INT4 weight traffic is 0.25x FP16 before kernel overhead.")

Output

FP16: 14.0 GB of weights per full read
INT8: 7.0 GB of weights per full read
INT4: 3.5 GB of weights per full read
INT4 weight traffic is 0.25x FP16 before kernel overhead.

Inference-first silicon beyond GPUs

GPUs still dominate general-purpose LLM serving, but some deployments use inference-first accelerators with larger on-chip memory, dataflow execution, or more deterministic scheduling. The trade-off is usually a narrower software stack: you may get excellent latency or cost for a specific serving pattern, but at the price of custom compilation, fewer kernels, and less ecosystem flexibility.

Common pitfalls

The following symptoms show up in production logs and profiling traces. If you see them, here's what they mean and how to fix them.

"Training optimizations don't speed up my inference"

Symptom: You upgrade math kernels or buy more peak FLOPs, but decode barely speeds up. Cause: Training is compute-bound, while decode is often memory-bound. If the GPU is already waiting on weights and KV reads, more math capacity does little. Fix: Profile bandwidth first. If HBM is near ceiling, prioritize batching, quantization, GQA, or KV-cache management before chasing more tensor-core throughput.

"Short requests are as slow as long ones"

Symptom: Small chats wait almost as long as large ones even when the queue looks short. Cause: Static batching pads to the longest request and cannot reuse finished slots until the batch cycle ends. Fix: Switch to continuous batching so completed requests leave immediately and queued work fills open slots on the next decode step.

"Speculative decoding made latency worse"

Symptom: Draft-model serving added overhead, but target-model passes did not fall enough to repay it. Cause: Acceptance rate is too low or implementation overhead is too high. Wrong draft tokens force extra verification and correction work. Fix: Measure accepted tokens per draft span before rollout. If most proposals are rejected, keep standard decoding or use a stronger draft path.

"HBM usage grows almost linearly even when prefixes are shared"

Symptom: HBM usage spikes even though many sessions start from the same long system prompt or retrieved prefix. Cause: The runtime is not reusing cached prefix KV state across independent requests, or its isolation policy does not allow that reuse. Fix: Enable an explicit prefix-caching feature with an appropriate tenant and privacy policy. Copy-on-write can preserve shared blocks after reuse is established; it is not by itself a cross-request cache.

"Throughput keeps rising but users complain about slowness"

Symptom: Aggregate TPS looks better, but TTFT or TPOT percentiles get worse and chat feels sluggish. Cause: Larger batches improve throughput while making active requests compete harder for bandwidth. Fix: Use SLO-aware scheduling. Cap batch size or split latency-sensitive traffic from background work, and watch throughput together with p95 or p99 latency.

Try it yourself: the VRAM calculator

Here's a practical check you can do with pen and paper or a short Python script.

Problem: You want to serve a 7B-class model on a single GPU with 80 GiB of HBM. The model has 28 layers, uses Grouped Query Attention with 4 KV heads, head dimension 128, and you plan to use FP16 (2 bytes per element). Your average customer conversation has a 4,096-token context. What's the maximum batch size you can support before the KV cache alone fills the GPU, assuming you need to leave 20 GiB for weights and runtime overhead?

Hint: Start by computing the KV cache for one request, then see how many fit in the remaining 60 GiB.

Worked solution

For one request at 4,096 tokens:

text

KV bytes = 2 * batch * seq_len * layers * kv_heads * head_dim * dtype_bytes
         = 2 * 1 * 4096 * 28 * 4 * 128 * 2
         = 234,881,024 bytes
         ≈ 0.22 GiB per request

Available memory for KV cache: 80 GiB total - 20 GiB reserved = 60 GiB

Maximum batch size = 60 GiB / 0.21875 GiB per request ≈ 274 requests (rounding the per-request figure to 0.22 GiB gives about 273, so treat this as a ballpark, not a precise count)

In practice, you'd run at a lower batch size to leave headroom for activation buffers, temporary tensors, allocator slack, and bursty long-context requests. A production engineer might cap this far below the arithmetic ceiling and monitor actual HBM usage.

vram-capacity-headroom.py

import math

kv_bytes = 2 * 1 * 4_096 * 28 * 4 * 128 * 2
kv_per_request_gib = kv_bytes / 1024**3
kv_budget_gib = 80 - 20
arithmetic_ceiling = math.floor(kv_budget_gib / kv_per_request_gib)

print(f"KV per request: {kv_per_request_gib:.5f} GiB")
print(f"KV budget after reserved memory: {kv_budget_gib} GiB")
print(f"arithmetic-only batch ceiling: {arithmetic_ceiling}")

Output

KV per request: 0.21875 GiB
KV budget after reserved memory: 60 GiB
arithmetic-only batch ceiling: 274

Mastery check

Evaluation rubric

Foundational: Why decode-heavy LLM serving is often memory-bandwidth bound even when peak FLOPs look huge.
Intermediate: How prefill and decode differ, and why TTFT and TPOT need separate dashboards.
Advanced: How KV-cache size scales with batch size, context length, layers, KV heads, head dimension, and precision.
Advanced: Why static batching wastes slots, while continuous batching uses iteration-level admission to keep decode work flowing.
Advanced: How PagedAttention uses virtual-memory-style blocks to pack KV state, and why cross-request prefix reuse requires explicit caching policy in addition to copy-on-write.
Advanced: When speculative decoding, disaggregated inference, context parallelism, and quantization help, and when their overheads can erase the win.
Advanced: How throughput, latency, and cost per token trade off, and how to pick an operating point from a product SLO instead of chasing one metric.

Follow-up questions

How does KV-cache size scale with sequence length and batch size?

KV cache stores key and value tensors from previous tokens across all layers, so memory grows linearly with sequence length, batch size, number of layers, number of KV heads, head dimension, and bytes per element. A useful formula is 2 * batch * seq_len * layers * kv_heads * head_dim * bytes. GQA reduces the kv_heads term, and KV-cache quantization reduces bytes.

When does speculative decoding hurt performance?

Speculative decoding hurts when the draft path is inaccurate or expensive enough that extra draft work and rejection handling cost more than the target-model passes saved. Low acceptance, poor kernel fusion, or tiny latency-sensitive batches can erase the win.

How does tensor parallelism interact with batching?

Tensor parallelism splits layer math across multiple GPUs, so every decode step includes collective communication such as all-reduce or all-gather. Larger batches can amortize that communication overhead, but small interactive batches may become communication-bound before they become compute-bound. Batching and tensor parallelism need to be tuned together.

What are the tradeoffs between throughput and latency?

Serving systems balance aggregate throughput against per-request latency. Continuous batching can increase total tokens per second, but larger batches also make active requests compete for the same memory bandwidth. Production schedulers need latency SLOs, not only raw tokens/sec.

How do you estimate cost per token for a self-hosted model?

Take the GPU hourly rate, divide by sustained tokens per second times 3,600, then multiply by one million for cost per million tokens. The dominant variables are real-world sustained throughput (driven by batching, quantization, and model size) and utilization. A GPU sitting idle multiplies cost per token directly, so a 5x cheaper sticker price can still lose if it delivers less than one-fifth the throughput.

A long RAG prefill is stalling active chat streams. Do you try chunked prefill or full disaggregation first?

Usually try chunked prefill first if the main problem is one large prompt monopolizing a shared worker for too long. Move to prefill-decode disaggregation when that interference remains large enough that separate pools and KV handoff beat their transfer and coordination overhead.

Course handoff

You now understand why serving capacity depends on phase behavior, memory bandwidth, KV-cache residency, scheduler policy, and precision choices. These tools let you read a slow serving trace and separate prefill queueing, decode bandwidth pressure, fragmented KV memory, and speculative-decoding overhead.

Next Step

Continue to Model Parallelism for LLM Inference

Batching and KV-cache planning show where serving memory goes; model parallelism teaches what changes when one production model must be split across several GPUs.

PreviousContinuous Batching & Scheduling

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. · 2022 · NeurIPS 2022

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.

Ainslie, J., et al. · 2023 · EMNLP 2023

Sarathi-Serve: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.

Agrawal, A., et al. · 2023 · arXiv preprint

Splitwise: Efficient Generative LLM Inference Using Phase Splitting.

Patel, P., et al. · 2023

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving.

Zhong, Y., et al. · 2024 · OSDI 2024

Orca: A Distributed Serving System for Transformer-Based Generative Models.

Yu, G.-I., et al. · 2022 · OSDI 2022

Efficient Memory Management for Large Language Model Serving with PagedAttention.

Kwon, W., et al. · 2023 · SOSP 2023

Ring Attention with Blockwise Transformers for Near-Infinite Context.

Liu, H., et al. · 2024 · arXiv preprint

Fast Inference from Transformers via Speculative Decoding.

Leviathan, Y., Kalman, M., & Matias, Y. · 2023 · ICML 2023

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty.

Li, Y., et al. · 2024 · ICML 2024

FP8 Formats for Deep Learning.

Micikevicius, P., et al. · 2022

GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers

Frantar, E., et al. · 2023 · ICLR 2023

SnapKV: Compressing KV Cache by Selecting Global Attention Patterns.

Li, Y., et al. · 2024

Back to Topics

LearnInference & Production ScaleScaling LLM Inference

🚀HardInference Optimization

Scaling LLM Inference

Explains why decode-heavy LLM serving is often memory-bound and how KV-cache design, batching, PagedAttention, and speculative decoding improve scale.

41 min read

Learning path

Step 129 of 155 in the full curriculum

Continuous Batching & Scheduling Model Parallelism for LLM Inference

Scaling LLM Inference

The two phases of generation

Prefill: reading the prompt in one go

Key characteristics

Often compute-heavy: Prefill exposes large matrix multiplications. FlashAttention keeps attention exact while reducing HBM traffic relative to materializing the full attention matrix; it does not make all attention IO linear in sequence length.^[1]
Parallel-friendly: Processing many prompt positions together can drive much higher tensor-core utilization than one-token decode. Whether it saturates compute depends on sequence shape, kernel, and hardware.
Latency: Time usually grows with prompt length, and long prompts often dominate TTFT.

Decode: answering one word at a time

Once the first token is generated, the model switches to autoregressive generation. It generates one token at a time, feeding it back as input for the next step.

Key characteristics

Often memory-bound: A decode step needs model weights and the KV state used by attention. At small or latency-sensitive batches, repeated reads commonly make HBM bandwidth the ceiling; batching can raise arithmetic intensity by sharing weight reads across active requests.
Low arithmetic intensity at small batches: The arithmetic intensity (FLOPs/byte, i.e., Floating Point Operations per byte of data loaded) can be low because the runtime moves large tensors for only one new position per sequence.

Why decode is memory-bound

When profiling confirms this bandwidth ceiling, serving work should focus on bytes moved, cache residency, batch policy, and queueing behavior rather than only raw floating-point throughput.

decode-bandwidth-lower-bound.py

parameters = 7_000_000_000
bytes_per_parameter = 2  # FP16
ideal_hbm_bandwidth_gb_s = 2_000

weight_bytes = parameters * bytes_per_parameter
ideal_steps_per_second = ideal_hbm_bandwidth_gb_s * 1_000_000_000 / weight_bytes

print(f"FP16 weight footprint: {weight_bytes / 1_000_000_000:.2f} GB")
print(f"ideal weight-read upper bound: {ideal_steps_per_second:.1f} single-token steps/s")
print("Observed TPS is lower once KV reads and runtime overhead are included.")

Output

FP16 weight footprint: 14.00 GB
ideal weight-read upper bound: 142.9 single-token steps/s
Observed TPS is lower once KV reads and runtime overhead are included.

The KV cache: saving state so you don't restart

Memory cost of KV cache

We need both K and V: that's a factor of 2
One request, 8,192 tokens, 80 layers, 8 heads, head size 128, 2 bytes each
Total bytes = 2 * 1 * 8,192 * 80 * 8 * 128 * 2 = 2,684,354,560 bytes
Divide by 1024^3: that's about 2.5 GiB per request

The following Python function generalizes that exact calculation. It takes the model's architectural parameters and returns the KV cache memory in GiB.

memory-cost-of-kv-cache.py

def kv_cache_memory(
    batch_size: int,
    seq_len: int,
    num_layers: int,
    num_kv_heads: int,
    head_dim: int,
    dtype_bytes: int = 2  # FP16
) -> float:
    """Calculate KV cache memory in GiB."""
    # 2 for K and V, per layer, per head
    total_bytes = (
        2 * batch_size * seq_len * num_layers * num_kv_heads * head_dim * dtype_bytes
    )
    return total_bytes / (1024 ** 3)

# Example model: 80 layers, 8 KV heads (GQA), head_dim=128
# Batch=1, seq_len=8192, FP16 (2 bytes):
# 2 * 1 * 8192 * 80 * 8 * 128 * 2 = ~2.5 GiB per request
one_request = kv_cache_memory(
    batch_size=1,
    seq_len=8192,
    num_layers=80,
    num_kv_heads=8,
    head_dim=128,
)
production_batch = kv_cache_memory(
    batch_size=64,
    seq_len=8192,
    num_layers=80,
    num_kv_heads=8,
    head_dim=128,
)

print(f"one 8K request: {one_request:.1f} GiB")
print(f"64 active 8K requests: {production_batch:.0f} GiB")
print("single-request estimate correct:", one_request == 2.5)
print("64-request estimate correct:", production_batch == 160.0)

Output

one 8K request: 2.5 GiB
64 active 8K requests: 160 GiB
single-request estimate correct: True
64-request estimate correct: True

Throughput vs. latency trade-off

There's an inherent tension between maximizing system throughput and minimizing per-request latency.

Metric	Optimized By	Trade-off
Throughput (tokens/sec)	Larger effective batches	Can increase TTFT or inter-token latency once shared resources are pressured.
Latency (ms/token)	Smaller admitted batches	Can leave throughput unused and raise cost per token.

Production tip: Monitor GPU KV-cache usage, prefill backlog, and decode queue depth together. High KV usage plus rising TTFT usually means memory pressure is capping concurrency. Low KV usage with idle compute means you're leaving throughput on the table.

The throughput, latency, cost triangle

Cost per token is simpler than it looks. If you rent a GPU at a fixed hourly rate and it sustains some number of tokens per second, then:

$\text{cost per token} = \frac{\text{GPU \$ per hour}}{\text{sustained tokens per second} \times 3600}$

Tokens per hour = 2,500 * 3,600 = 9,000,000
Cost per token = $3.00 / 9,000,000 = $0.00000033
Cost per million tokens = about $0.33

cost-per-million-tokens.py

def cost_per_million(hourly_cost: float, sustained_tps: int) -> float:
    return hourly_cost / (sustained_tps * 3600) * 1_000_000

well_batched = cost_per_million(3.00, 2_500)
starved = cost_per_million(3.00, 250)

print(f"2,500 tokens/s: ${well_batched:.2f} per million tokens")
print(f"250 tokens/s: ${starved:.2f} per million tokens")
print(f"cost multiplier: {starved / well_batched:.0f}x")

Output

2,500 tokens/s: $0.33 per million tokens
250 tokens/s: $3.33 per million tokens
cost multiplier: 10x

Operating point	Batch size	Cost per token	Latency (TTFT/TPOT)	Typical fit
Latency-first	Small	High	Low	Interactive chat, code completion
Balanced	Medium	Medium	Medium	General chat assistants
Throughput-first	Large	Low	High	Offline batch jobs, summarization, evals

Production tip: Pick the operating point from the product SLO, then size hardware to it. An interactive assistant with a 500 ms TTFT budget can't run the same batch size as an overnight document-summarization job, even on identical GPUs. The summarization job can push batch size until cost per token bottoms out because no human is waiting on each token.

Chunked prefills

prefill-chunk-budget.py

import math

prompt_tokens = 10_000
chunk_tokens = 512
turns = math.ceil(prompt_tokens / chunk_tokens)
last_chunk = prompt_tokens - chunk_tokens * (turns - 1)

print(f"prefill turns: {turns}")
print(f"largest admitted prefill slice: {chunk_tokens} tokens")
print(f"last slice: {last_chunk} tokens")

Output

prefill turns: 20
largest admitted prefill slice: 512 tokens
last slice: 272 tokens

Disaggregated inference

The problem: conflicting optimization targets

Prefill and decode phases have opposite hardware needs:

Prefill is often compute-heavy: Long prompts create large matrix operations over many positions
Decode is often bandwidth-heavy: Small-batch token generation repeatedly reads weights and KV state

When both phases run on the same GPU, they interfere. A long prefill can "block" decode requests, causing head-of-line blocking where a massive prompt stalls generation for all other users.

Prefill-decode disaggregation

The solution is architectural separation:

Prefill cluster (compute-optimized): Dedicated prefill workers process incoming prompts in parallel. These nodes excel at the compute-heavy attention operations.
Decode cluster (bandwidth-optimized): Separate workers handle token generation. These nodes are tuned for the memory-bound sequential decoding loop and steady high-concurrency decode traffic.
KV cache handoff: After prefill completes, the KV cache is transferred over a fast interconnect from the prefill worker to a decode worker, which continues generation.

Benefits of disaggregation

Can reduce head-of-line blocking: Long prefills no longer directly occupy decode workers
Potentially right-sized hardware: Each phase can run on workers chosen for its measured bottleneck
Separate scaling knobs: Prefill and decode clusters can scale independently based on workload patterns
Potential efficiency gain: Separate pools can better match the two workload profiles when transfer overhead is acceptable

kv-handoff-lower-bound.py

kv_cache_gib = 2.5
interconnect_gb_s = 200

transfer_bytes = kv_cache_gib * 1024**3
ideal_transfer_ms = transfer_bytes / (interconnect_gb_s * 1_000_000_000) * 1000

print(f"KV state to transfer: {kv_cache_gib:.1f} GiB")
print(f"ideal one-way transfer floor at {interconnect_gb_s} GB/s: {ideal_transfer_ms:.1f} ms")
print("Queueing saved must exceed transfer plus scheduling overhead.")

Output

KV state to transfer: 2.5 GiB
ideal one-way transfer floor at 200 GB/s: 13.4 ms
Queueing saved must exceed transfer plus scheduling overhead.

Batching strategies: the loading-dock analogy

Naive batching strategies lead to significant inefficiency due to the variable length of text. Picture a warehouse loading dock handling shipments that take different amounts of time to prepare.

Static batching: waiting for the whole pallet

The problem with static batching

Continuous batching: filling open dock slots

Benefits of continuous batching

Continuous batching provides three useful advantages under mixed, queued workloads:

Less slot waste: Finished requests can leave at iteration boundaries rather than remaining as padding or idle slots.
Lower completion delay for short requests: A completed request need not wait for the longest request in a fixed batch, though queueing and large active batches can still worsen latency.
Policy control: The scheduler can decide how to admit prefills alongside ongoing decode while respecting KV memory and latency SLOs.

benefits-of-continuous-batching.py

import torch

# Minimal request stub for illustration
class Request:
    def is_done(self) -> bool: return False
    def get_next_token(self) -> torch.Tensor: return torch.tensor([0])
    def update(self, logits: torch.Tensor): pass

class ContinuousBatcher:
    def __init__(self, model: torch.nn.Module, max_batch_size: int = 64):
        self.model = model
        self.max_batch = max_batch_size
        self.active_requests: list[Request] = []
        self.queue: list[Request] = []
    
    def step(self):
        """
        Executes a single generation step for the current batch.
        """
        # 1. Remove completed requests
        self.active_requests = [
            req for req in self.active_requests if not req.is_done()
        ]
        
        # 2. Add new requests from queue (up to max batch size)
        while self.queue and len(self.active_requests) < self.max_batch:
            new_req = self.queue.pop(0)
            # Run prefill for the new request (often done in parallel or on a separate stream)
            # pseudo-code: new_req.run_prefill(self.model)
            self.active_requests.append(new_req)
        
        # 3. Run one decode step for all active requests
        if self.active_requests:
            # Gather current input tokens from all requests
            input_tokens = torch.stack([req.get_next_token() for req in self.active_requests])
            
            # Forward pass (batched)
            logits = self.model.decode(input_tokens)
            
            # Update requests with new tokens
            for i, req in enumerate(self.active_requests):
                req.update(logits[i])

Memory management: PagedAttention (vLLM)

The problem

Request	Tokens Needed	Tokens Allocated	Memory Wasted
Request A	100	4096	97.5%
Request B	3000	4096	26.8%

PagedAttention solution

Impact of PagedAttention

paged-kv-slack.py

import math

requests = [100, 3_000]
max_context = 4_096
block_tokens = 16
reserved_tokens = len(requests) * max_context
paged_tokens = sum(math.ceil(tokens / block_tokens) * block_tokens for tokens in requests)

print(f"max-context reservation: {reserved_tokens} token slots")
print(f"paged allocation: {paged_tokens} token slots")
print(f"remaining final-block slack: {paged_tokens - sum(requests)} token slots")

Output

max-context reservation: 8192 token slots
paged allocation: 3120 token slots
remaining final-block slack: 20 token slots

Copy-on-write for shared blocks

Context parallelism and long-context serving

How context parallelism works

Instead of splitting layers (tensor parallelism) or batches (data parallelism), CP splits the sequence dimension:

A 1M token sequence is divided into N chunks (e.g., 250K tokens per GPU on 4 GPUs)
Each GPU processes its chunk independently during the prefill phase
Attention is computed using ring-style communication patterns to handle cross-chunk dependencies
The KV cache is distributed across the GPU cluster

This approach becomes useful once a single request's context no longer fits comfortably on one accelerator.

Ring attention for context parallelism

context-parallel-kv-shards.py

def kv_gib(sequence_tokens: int) -> float:
    total_bytes = 2 * sequence_tokens * 80 * 8 * 128 * 2
    return total_bytes / 1024**3

total_kv = kv_gib(1_000_000)
devices = 4
print(f"one 1M-token request KV: {total_kv:.1f} GiB")
print(f"evenly sharded over {devices} devices: {total_kv / devices:.1f} GiB/device")
print("Communication and runtime buffers still add overhead.")

Output

one 1M-token request KV: 305.2 GiB
evenly sharded over 4 devices: 76.3 GiB/device
Communication and runtime buffers still add overhead.

Speculative decoding: the smart assistant

speculative-decoding-the-smart-assistant.py

import torch
import torch.nn.functional as F

def next_token_probs(model, input_ids: torch.Tensor) -> torch.Tensor:
    with torch.no_grad():
        logits = model(input_ids).logits[0, -1]
    return F.softmax(logits, dim=-1)

def speculative_step(
    draft_model,
    target_model,
    input_ids: torch.Tensor,
    k: int = 4,
) -> list[int]:
    """
    Return one speculative chunk for a single sequence.
    Assumes input_ids has shape [1, seq_len].
    """
    assert input_ids.shape[0] == 1, "single-sequence example"

    draft_tokens: list[int] = []
    draft_dists: list[torch.Tensor] = []
    draft_input = input_ids

    for _ in range(k):
        q = next_token_probs(draft_model, draft_input)
        token = torch.multinomial(q, num_samples=1).item()
        draft_tokens.append(token)
        draft_dists.append(q)
        token_tensor = torch.tensor([[token]], device=input_ids.device)
        draft_input = torch.cat([draft_input, token_tensor], dim=1)

    # One target-model pass verifies all draft positions at once.
    with torch.no_grad():
        target_logits = target_model(draft_input).logits[0]

    target_dists = F.softmax(
        target_logits[input_ids.size(1) - 1 : input_ids.size(1) + k],
        dim=-1,
    )

    accepted: list[int] = []
    for i, token in enumerate(draft_tokens):
        p = target_dists[i]
        q = draft_dists[i]
        acceptance = min(1.0, (p[token] / q[token]).item())

        if torch.rand(()) < acceptance:
            accepted.append(token)
            continue

        residual = torch.clamp(p - q, min=0)
        if residual.sum() <= 0:
            replacement = torch.argmax(p).item()
        else:
            residual = residual / residual.sum()
            replacement = torch.multinomial(residual, num_samples=1).item()
        return accepted + [replacement]

    # If all k draft tokens are accepted, sample one extra token from p.
    extra = torch.multinomial(target_dists[k], num_samples=1).item()
    return accepted + [extra]

speculative-pass-accounting.py

ordinary_target_passes = 5
draft_proposals = 4
accepted_prefix = 3
target_verification_passes = 1
emitted_tokens = accepted_prefix + 1

print(f"ordinary target passes for {ordinary_target_passes} tokens: {ordinary_target_passes}")
print(f"one verification emits in this example: {emitted_tokens} tokens")
print(f"extra draft passes paid: {draft_proposals}")
print("Speedup requires cheap drafting and high acceptance.")

Output

ordinary target passes for 5 tokens: 5
one verification emits in this example: 4 tokens
extra draft passes paid: 4
Speedup requires cheap drafting and high acceptance.

Hardware-aware optimization: quantization and precision

Inference performance isn't just about algorithms. Modern GPUs and specialized accelerators provide hardware-level features that fundamentally change the efficiency equation.

Low-precision inference and quantization

FP8 (8-bit floating point): Useful when your hardware and kernels support it, because it lowers bandwidth pressure while preserving more dynamic range than integer-only formats.^[11]
INT8/INT4-style weight quantization: Common for weight-only inference, where the goal is to shrink the bytes reread on every decode step without fully quantizing the rest of the runtime.^[12]
KV-cache compression or quantization: Targets the other major memory consumer during long-context serving, which matters once the KV cache rather than weights caps concurrency.^[13]

Quantization approaches include:

Weight-only quantization: Keep activations in higher precision (BF16/FP16) but compress model weights to 4-8 bits
KV-cache compression/quantization: Reduce KV bytes, or compress less useful KV state, when long contexts would otherwise cap batch size and residency

Common mistake: Beginners often think 4-bit quantization is just about saving disk space. The real win is reducing bytes moved during decode. Whether that turns into a large latency win still depends on kernels, dequant overhead, and quality constraints.

weight-bytes-by-precision.py

parameters = 7_000_000_000
bytes_per_weight = {"FP16": 2.0, "INT8": 1.0, "INT4": 0.5}

for precision, width in bytes_per_weight.items():
    traffic_gb = parameters * width / 1_000_000_000
    print(f"{precision}: {traffic_gb:.1f} GB of weights per full read")

print("INT4 weight traffic is 0.25x FP16 before kernel overhead.")

Output

FP16: 14.0 GB of weights per full read
INT8: 7.0 GB of weights per full read
INT4: 3.5 GB of weights per full read
INT4 weight traffic is 0.25x FP16 before kernel overhead.

Inference-first silicon beyond GPUs

Common pitfalls

The following symptoms show up in production logs and profiling traces. If you see them, here's what they mean and how to fix them.

"Training optimizations don't speed up my inference"

"Short requests are as slow as long ones"

"Speculative decoding made latency worse"

"HBM usage grows almost linearly even when prefixes are shared"

"Throughput keeps rising but users complain about slowness"

Try it yourself: the VRAM calculator

Here's a practical check you can do with pen and paper or a short Python script.

Hint: Start by computing the KV cache for one request, then see how many fit in the remaining 60 GiB.

Worked solution

For one request at 4,096 tokens:

text

KV bytes = 2 * batch * seq_len * layers * kv_heads * head_dim * dtype_bytes
         = 2 * 1 * 4096 * 28 * 4 * 128 * 2
         = 234,881,024 bytes
         ≈ 0.22 GiB per request

Available memory for KV cache: 80 GiB total - 20 GiB reserved = 60 GiB

Maximum batch size = 60 GiB / 0.21875 GiB per request ≈ 274 requests (rounding the per-request figure to 0.22 GiB gives about 273, so treat this as a ballpark, not a precise count)

vram-capacity-headroom.py

import math

kv_bytes = 2 * 1 * 4_096 * 28 * 4 * 128 * 2
kv_per_request_gib = kv_bytes / 1024**3
kv_budget_gib = 80 - 20
arithmetic_ceiling = math.floor(kv_budget_gib / kv_per_request_gib)

print(f"KV per request: {kv_per_request_gib:.5f} GiB")
print(f"KV budget after reserved memory: {kv_budget_gib} GiB")
print(f"arithmetic-only batch ceiling: {arithmetic_ceiling}")

Output

KV per request: 0.21875 GiB
KV budget after reserved memory: 60 GiB
arithmetic-only batch ceiling: 274

Mastery check

Evaluation rubric

Foundational: Why decode-heavy LLM serving is often memory-bandwidth bound even when peak FLOPs look huge.
Intermediate: How prefill and decode differ, and why TTFT and TPOT need separate dashboards.
Advanced: How KV-cache size scales with batch size, context length, layers, KV heads, head dimension, and precision.
Advanced: Why static batching wastes slots, while continuous batching uses iteration-level admission to keep decode work flowing.
Advanced: How PagedAttention uses virtual-memory-style blocks to pack KV state, and why cross-request prefix reuse requires explicit caching policy in addition to copy-on-write.
Advanced: When speculative decoding, disaggregated inference, context parallelism, and quantization help, and when their overheads can erase the win.
Advanced: How throughput, latency, and cost per token trade off, and how to pick an operating point from a product SLO instead of chasing one metric.

Follow-up questions

How does KV-cache size scale with sequence length and batch size?

When does speculative decoding hurt performance?

How does tensor parallelism interact with batching?

What are the tradeoffs between throughput and latency?

How do you estimate cost per token for a self-hosted model?

A long RAG prefill is stalling active chat streams. Do you try chunked prefill or full disaggregation first?

Course handoff

Next Step

Continue to Model Parallelism for LLM Inference

Batching and KV-cache planning show where serving memory goes; model parallelism teaches what changes when one production model must be split across several GPUs.

PreviousContinuous Batching & Scheduling

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. · 2022 · NeurIPS 2022

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.

Ainslie, J., et al. · 2023 · EMNLP 2023

Sarathi-Serve: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.

Agrawal, A., et al. · 2023 · arXiv preprint

Splitwise: Efficient Generative LLM Inference Using Phase Splitting.

Patel, P., et al. · 2023

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving.

Zhong, Y., et al. · 2024 · OSDI 2024

Orca: A Distributed Serving System for Transformer-Based Generative Models.

Yu, G.-I., et al. · 2022 · OSDI 2022

Efficient Memory Management for Large Language Model Serving with PagedAttention.

Kwon, W., et al. · 2023 · SOSP 2023

Ring Attention with Blockwise Transformers for Near-Infinite Context.

Liu, H., et al. · 2024 · arXiv preprint

Fast Inference from Transformers via Speculative Decoding.

Leviathan, Y., Kalman, M., & Matias, Y. · 2023 · ICML 2023

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty.

Li, Y., et al. · 2024 · ICML 2024

FP8 Formats for Deep Learning.

Micikevicius, P., et al. · 2022

GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers

Frantar, E., et al. · 2023 · ICLR 2023

SnapKV: Compressing KV Cache by Selecting Global Attention Patterns.

Li, Y., et al. · 2024

Scaling LLM Inference

Scaling LLM Inference

Why does a serving engineer separate LLM generation into prefill and decode instead of treating inference as one uniform operation?

The two phases of generation

Prefill: reading the prompt in one go

Key characteristics

A user sends a 20,000-token RAG prompt and then receives a 50-token answer. Which phase likely dominates TTFT, and why?

Decode: answering one word at a time

Key characteristics

Why does batching help decode even though every request still needs its own next token?

Why decode is memory-bound

You profile decode and see tensor cores idle while HBM bandwidth is near saturation. Should you first buy more FLOPs or reduce memory traffic?

The KV cache: saving state so you don't restart

Memory cost of KV cache

In the KV-cache formula, which terms can the model architecture reduce without shortening user context?

Throughput vs. latency trade-off

If total tokens/sec improves after you raise batch size but users complain that streams feel slower, what metric did you optimize at the expense of what metric?

The throughput, latency, cost triangle

Two teams serve the same model on the same GPU but report very different cost per million tokens. The interactive team is 5x more expensive. Why is that not necessarily a misconfiguration?

Chunked prefills

What does chunked prefill sacrifice, and what does it protect?

Disaggregated inference

The problem: conflicting optimization targets

Prefill-decode disaggregation

Benefits of disaggregation

When can prefill-decode disaggregation make latency worse instead of better?

Batching strategies: the loading-dock analogy

Static batching: waiting for the whole pallet

The problem with static batching

Why is static batching especially painful when output lengths vary widely?

Continuous batching: filling open dock slots

Benefits of continuous batching

What does "iteration-level scheduling" mean in continuous batching?

Memory management: PagedAttention (vLLM)

What problem does PagedAttention solve that continuous batching alone doesn't solve?

The problem

PagedAttention solution

Impact of PagedAttention

Copy-on-write for shared blocks

Why does copy-on-write matter for beam search, and what else is required to share prefixes across independent requests?

Context parallelism and long-context serving

How context parallelism works

Ring attention for context parallelism

What bottleneck does context parallelism target, and what bottleneck can it introduce?

Speculative decoding: the smart assistant

Why can speculative decoding be exact yet still slower than normal decoding?

Hardware-aware optimization: quantization and precision

Low-precision inference and quantization

Why does weight-only INT4 often help decode latency more directly than prefill latency?

Inference-first silicon beyond GPUs

Common pitfalls

"Training optimizations don't speed up my inference"

"Short requests are as slow as long ones"

"Speculative decoding made latency worse"

"HBM usage grows almost linearly even when prefixes are shared"

"Throughput keeps rising but users complain about slowness"

A dashboard shows high TTFT, rising prefill backlog, stable TPOT, and moderate KV memory. Which subsystem should you inspect first?

Try it yourself: the VRAM calculator

Worked solution

Why is the arithmetic maximum batch size from a KV-cache formula not a safe production setting?

Mastery check

Evaluation rubric

Follow-up questions

How does KV-cache size scale with sequence length and batch size?

When does speculative decoding hurt performance?

How does tensor parallelism interact with batching?

What are the tradeoffs between throughput and latency?

How do you estimate cost per token for a self-hosted model?

A long RAG prefill is stalling active chat streams. Do you try chunked prefill or full disaggregation first?

Course handoff

Scaling LLM Inference

Scaling LLM Inference

Why does a serving engineer separate LLM generation into prefill and decode instead of treating inference as one uniform operation?

The two phases of generation

Prefill: reading the prompt in one go

Key characteristics

A user sends a 20,000-token RAG prompt and then receives a 50-token answer. Which phase likely dominates TTFT, and why?

Decode: answering one word at a time

Key characteristics

Why does batching help decode even though every request still needs its own next token?