Explains why decode-heavy LLM serving is often memory-bound and how KV-cache design, batching, PagedAttention, and speculative decoding improve scale.
The previous chapter showed how continuous batching keeps decode slots useful. This chapter asks the capacity question for scaling large language model (LLM) inference: when requests, model weights, and KV state all compete for HBM, what actually limits serving concurrency?
Imagine you run an online store and you want a chatbot that answers "Where's my order?" Every time a customer asks, the model has to generate a response one word at a time. It's not because the model is "thinking." During decode, the serving stack keeps rereading billions of model weights from GPU memory and consulting a growing KV cache, and that memory movement takes time. Picture a fulfillment line where the routing map has to be reloaded for every single item. Reading the pick list is fast. Reopening the same giant routing map one item at a time is painfully slow.
This article ties together prefill, decode, batching, KV-cache memory, PagedAttention, disaggregation, speculative decoding, and quantization so serving bottlenecks become measurable instead of mysterious. The thread running through all of them is one decision: where to sit on the throughput, latency, and cost triangle for your workload. For the scheduling loop behind this chapter, see continuous batching.
LLM inference is distinct from training because it consists of two radically different computational phases: Prefill and Decode. Understanding this distinction is the first step to optimization.
Think of our order-tracking bot. When a customer sends "Where is order 48291?", the system first has to read and understand that entire sentence. That's the prefill phase. Then it starts answering, generating one word at a time: "Your", "order", "is", "in", "transit." That's the decode phase.
In the prefill phase, the model processes the entire user prompt in parallel. This is similar to training: the GPU receives a matrix of shape [batch_size, prompt_len, hidden_dim] and computes attention for all tokens simultaneously.
Because all the input tokens are known upfront, the attention mechanism can compute the interactions between every token in the prompt at once. This parallel processing allows the GPU to use its massive matrix multiplication engines efficiently. A long prefill usually dominates Time To First Token (TTFT), though TTFT also includes queueing and scheduling delay before the first output token is emitted. The figure below shows how all tokens in the prompt are processed simultaneously to generate the first output token.
Once the first token is generated, the model switches to autoregressive generation. It generates one token at a time, feeding it back as input for the next step.
Unlike the prefill phase, decoding can't be parallelized across tokens because each new token depends on the previous ones. The system is locked into a sequential, step-by-step loop. The speed at which tokens are produced in this phase is measured as Time Per Output Token (TPOT), often expressed as tokens per second (TPS), which dictates how fast the text streams to the user. While TTFT affects perceived responsiveness, TPOT determines the "reading speed" of the generation. The following figure shows this autoregressive process, where each generated token is fed back as input for the next step.
Decode-heavy LLM serving is often memory-bandwidth bound, not compute-bound. In a compute-bound operation, the system is bottlenecked by the mathematical calculations it must perform. Training and long-prefill workloads typically expose much larger matrix operations than interactive decode, so they can drive compute hardware more effectively.
During token generation, the bottleneck often shifts toward memory movement. Each new token needs model weights and attention state, but contributes only one new position per active request. At modest decode batches, this produces low arithmetic intensity and makes HBM traffic a central constraint.
To make this concrete, imagine a model with 7 billion parameters stored in 16-bit precision. Its weights occupy about 14 GB in decimal units. If one uncached decode step had to read that full weight footprint for one active token, the weight-read lower bound alone would be about 14 GB per step. Real kernels, cache reuse, batch size, tensor parallelism, and KV traffic determine the observed bandwidth cost.
When profiling confirms this bandwidth ceiling, serving work should focus on bytes moved, cache residency, batch policy, and queueing behavior rather than only raw floating-point throughput.
1parameters = 7_000_000_000
2bytes_per_parameter = 2 # FP16
3ideal_hbm_bandwidth_gb_s = 2_000
4
5weight_bytes = parameters * bytes_per_parameter
6ideal_steps_per_second = ideal_hbm_bandwidth_gb_s * 1_000_000_000 / weight_bytes
7
8print(f"FP16 weight footprint: {weight_bytes / 1_000_000_000:.2f} GB")
9print(f"ideal weight-read upper bound: {ideal_steps_per_second:.1f} single-token steps/s")
10print("Observed TPS is lower once KV reads and runtime overhead are included.")1FP16 weight footprint: 14.00 GB
2ideal weight-read upper bound: 142.9 single-token steps/s
3Observed TPS is lower once KV reads and runtime overhead are included.Without caching, every new token requires recomputing attention over all previous tokens. The KV cache stores the Key and Value matrices for all past tokens, so we only need to compute them for the new token.
Think of it like a shift handoff log. Without it, our order-tracking bot would have to reread the entire customer conversation from the beginning every time it wanted to say the next word. With the KV cache, it remembers what it already understood and only processes the newest token.
The illustration here zooms in on a different but equally important serving concern: once you keep KV states around, you need to pack them efficiently in GPU memory instead of reserving one giant contiguous region per request.
The KV cache is often the largest consumer of GPU memory during inference, sometimes exceeding the model weights themselves for long contexts. This is crucial for capacity planning and determining the maximum batch size a given GPU can support.
Let's work through a concrete example by hand before showing the code. Suppose we're serving our order-tracking bot with a model that has 80 layers, uses Grouped Query Attention with 8 KV heads, and each head has dimension 128. For one request with a sequence length of 8,192 tokens, stored in FP16 (2 bytes per element):
Now scale that up. For a production batch of 64 concurrent requests at an 8K context window, that's about 160 GiB of KV cache alone. This is why techniques like Grouped Query Attention (GQA), which reduces the number of KV heads from num_heads to num_kv_heads, are standard in modern models.[2]
The following Python function generalizes that exact calculation. It takes the model's architectural parameters and returns the KV cache memory in GiB.
1def kv_cache_memory(
2 batch_size: int,
3 seq_len: int,
4 num_layers: int,
5 num_kv_heads: int,
6 head_dim: int,
7 dtype_bytes: int = 2 # FP16
8) -> float:
9 """Calculate KV cache memory in GiB."""
10 # 2 for K and V, per layer, per head
11 total_bytes = (
12 2 * batch_size * seq_len * num_layers * num_kv_heads * head_dim * dtype_bytes
13 )
14 return total_bytes / (1024 ** 3)
15
16# Example model: 80 layers, 8 KV heads (GQA), head_dim=128
17# Batch=1, seq_len=8192, FP16 (2 bytes):
18# 2 * 1 * 8192 * 80 * 8 * 128 * 2 = ~2.5 GiB per request
19one_request = kv_cache_memory(
20 batch_size=1,
21 seq_len=8192,
22 num_layers=80,
23 num_kv_heads=8,
24 head_dim=128,
25)
26production_batch = kv_cache_memory(
27 batch_size=64,
28 seq_len=8192,
29 num_layers=80,
30 num_kv_heads=8,
31 head_dim=128,
32)
33
34print(f"one 8K request: {one_request:.1f} GiB")
35print(f"64 active 8K requests: {production_batch:.0f} GiB")
36print("single-request estimate correct:", one_request == 2.5)
37print("64-request estimate correct:", production_batch == 160.0)1one 8K request: 2.5 GiB
264 active 8K requests: 160 GiB
3single-request estimate correct: True
464-request estimate correct: TrueThere's an inherent tension between maximizing system throughput and minimizing per-request latency.
| Metric | Optimized By | Trade-off |
|---|---|---|
| Throughput (tokens/sec) | Larger effective batches | Can increase TTFT or inter-token latency once shared resources are pressured. |
| Latency (ms/token) | Smaller admitted batches | Can leave throughput unused and raise cost per token. |
Production tip: Monitor GPU KV-cache usage, prefill backlog, and decode queue depth together. High KV usage plus rising TTFT usually means memory pressure is capping concurrency. Low KV usage with idle compute means you're leaving throughput on the table.
Throughput and latency are two corners of a third constraint that the business actually cares about: cost per token. These three pull against each other, and picking where to sit on that triangle is the central job of an inference engineer.
Cost per token is simpler than it looks. If you rent a GPU at a fixed hourly rate and it sustains some number of tokens per second, then:
Sustained throughput, not the sticker hourly rate, dominates the answer. A faster, pricier GPU can still be cheaper per token if its throughput rises faster than its price. Let's work an example by hand. Suppose one GPU costs $3.00/hour and a well-batched deployment sustains 2,500 decode tokens/second across all active requests:
Now starve the batch. If under-configured batching or idle capacity drops sustained throughput to 250 tokens/second, the same GPU-hour spreads over one-tenth the tokens, so cost per million jumps to about $3.33. Utilization is a direct 10x multiplier on cost. This is why batching is not only a latency knob; it moves the cost corner of the triangle.
1def cost_per_million(hourly_cost: float, sustained_tps: int) -> float:
2 return hourly_cost / (sustained_tps * 3600) * 1_000_000
3
4well_batched = cost_per_million(3.00, 2_500)
5starved = cost_per_million(3.00, 250)
6
7print(f"2,500 tokens/s: ${well_batched:.2f} per million tokens")
8print(f"250 tokens/s: ${starved:.2f} per million tokens")
9print(f"cost multiplier: {starved / well_batched:.0f}x")12,500 tokens/s: $0.33 per million tokens
2250 tokens/s: $3.33 per million tokens
3cost multiplier: 10xThe triangle has a simple rule: you can usually optimize two corners hard, but the third drifts. Push batch size for throughput and cost, and tail latency rises. Cap batch size for tight latency SLOs, and your cost per token climbs because the GPU is underused. There is no single best operating point, only the one that fits your product's latency SLO at acceptable cost.
| Operating point | Batch size | Cost per token | Latency (TTFT/TPOT) | Typical fit |
|---|---|---|---|---|
| Latency-first | Small | High | Low | Interactive chat, code completion |
| Balanced | Medium | Medium | Medium | General chat assistants |
| Throughput-first | Large | Low | High | Offline batch jobs, summarization, evals |
Production tip: Pick the operating point from the product SLO, then size hardware to it. An interactive assistant with a 500 ms TTFT budget can't run the same batch size as an overnight document-summarization job, even on identical GPUs. The summarization job can push batch size until cost per token bottoms out because no human is waiting on each token.
Long prefills (e.g., Retrieval-Augmented Generation (RAG), where retrieved documents are appended to the user's prompt, creating contexts of 10,000 tokens) can delay decode turns on a shared worker. The duration depends on model, kernel, hardware, and prompt length, but enough long prompts can noticeably worsen active streams in a multi-tenant service.
To reduce that interference, engineers can break large prefills into smaller, fixed-size chunks (e.g., 512 tokens). The system admits a chunk, gives active decodes another scheduling opportunity, and later processes the next chunk. This bounds admitted prefill work per turn, but does not guarantee a latency SLO when queues or kernels are already overloaded. It often trades a higher TTFT for the long request against improved TPOT for existing streams.[3]
1import math
2
3prompt_tokens = 10_000
4chunk_tokens = 512
5turns = math.ceil(prompt_tokens / chunk_tokens)
6last_chunk = prompt_tokens - chunk_tokens * (turns - 1)
7
8print(f"prefill turns: {turns}")
9print(f"largest admitted prefill slice: {chunk_tokens} tokens")
10print(f"last slice: {last_chunk} tokens")1prefill turns: 20
2largest admitted prefill slice: 512 tokens
3last slice: 272 tokensDisaggregated inference separates prefill and decode across worker pools rather than running both phases on one worker. Systems such as Splitwise and DistServe show when this can improve goodput: avoided interference must repay KV-transfer and coordination overhead.[4][5]
Prefill and decode phases have opposite hardware needs:
When both phases run on the same GPU, they interfere. A long prefill can "block" decode requests, causing head-of-line blocking where a massive prompt stalls generation for all other users.
The solution is architectural separation:
Prefill cluster (compute-optimized): Dedicated prefill workers process incoming prompts in parallel. These nodes excel at the compute-heavy attention operations.
Decode cluster (bandwidth-optimized): Separate workers handle token generation. These nodes are tuned for the memory-bound sequential decoding loop and steady high-concurrency decode traffic.
KV cache handoff: After prefill completes, the KV cache is transferred over a fast interconnect from the prefill worker to a decode worker, which continues generation.
Disaggregation is a design pattern, not a mandatory default. The KV-transfer cost has to be lower than the queueing and interference it removes, which is why it becomes more attractive as prompts get longer and decode traffic gets denser.[4][5]
Disaggregation also changes how you autoscale. Because prefill load tracks incoming prompt tokens and decode load tracks active generation, the two pools scale on different signals. A burst of long RAG prompts points toward more prefill capacity; a surge in concurrent streaming conversations points toward more decode capacity. Useful signals include queue depth, KV-cache utilization, and TTFT/TPOT percentiles, with cold-start time included because loading model weights onto a new worker is not instant.
1kv_cache_gib = 2.5
2interconnect_gb_s = 200
3
4transfer_bytes = kv_cache_gib * 1024**3
5ideal_transfer_ms = transfer_bytes / (interconnect_gb_s * 1_000_000_000) * 1000
6
7print(f"KV state to transfer: {kv_cache_gib:.1f} GiB")
8print(f"ideal one-way transfer floor at {interconnect_gb_s} GB/s: {ideal_transfer_ms:.1f} ms")
9print("Queueing saved must exceed transfer plus scheduling overhead.")1KV state to transfer: 2.5 GiB
2ideal one-way transfer floor at 200 GB/s: 13.4 ms
3Queueing saved must exceed transfer plus scheduling overhead.Naive batching strategies lead to significant inefficiency due to the variable length of text. Picture a warehouse loading dock handling shipments that take different amounts of time to prepare.
In static batching, the dock waits for 4 parcels before releasing the pallet. If one parcel needs 10 extra minutes of labeling, the other 3 sit ready but blocked. In serving terms, we group requests into a batch and pad them to the length of the longest active sequence. The batch membership stays fixed for that run, so when one request finishes early its slot often turns into padding or sits idle until the longest request finishes. The timeline below illustrates how shorter requests waste compute cycles while the batch waits on the longest request.
Static batching creates two significant inefficiencies. First, the GPU is forced to process "padding tokens" that don't contribute to the final output, wasting valuable compute cycles and memory bandwidth. Second, once shorter requests finish, their batch slots usually can't be reused until the scheduler rebuilds the batch around the longest surviving sequence. This degrades both latency and overall throughput, especially when request lengths vary widely.
Continuous batching (introduced by Orca) operates at the iteration level.[6] The dock releases one finished parcel, can pull the next queued parcel into the open slot, and keeps useful work flowing when demand exists.
In serving terms, instead of waiting for a whole batch cycle to finish, the scheduler can eject completed requests and insert new ones after every token generation step. See our continuous batching deep-dive for scheduling algorithms and preemption strategies. The timeline above shows slot reuse; its actual benefit depends on queued work, KV capacity, and latency policy.
Continuous batching provides three useful advantages under mixed, queued workloads:
To implement continuous batching, systems use a scheduling loop that manages active requests dynamically. The following sketch shows the shape of a continuous batcher. It takes a queue of incoming requests, processes prefill for new requests up to the maximum batch size, and then runs a single decoding step for all active requests.
1import torch
2
3# Minimal request stub for illustration
4class Request:
5 def is_done(self) -> bool: return False
6 def get_next_token(self) -> torch.Tensor: return torch.tensor([0])
7 def update(self, logits: torch.Tensor): pass
8
9class ContinuousBatcher:
10 def __init__(self, model: torch.nn.Module, max_batch_size: int = 64):
11 self.model = model
12 self.max_batch = max_batch_size
13 self.active_requests: list[Request] = []
14 self.queue: list[Request] = []
15
16 def step(self):
17 """
18 Executes a single generation step for the current batch.
19 """
20 # 1. Remove completed requests
21 self.active_requests = [
22 req for req in self.active_requests if not req.is_done()
23 ]
24
25 # 2. Add new requests from queue (up to max batch size)
26 while self.queue and len(self.active_requests) < self.max_batch:
27 new_req = self.queue.pop(0)
28 # Run prefill for the new request (often done in parallel or on a separate stream)
29 # pseudo-code: new_req.run_prefill(self.model)
30 self.active_requests.append(new_req)
31
32 # 3. Run one decode step for all active requests
33 if self.active_requests:
34 # Gather current input tokens from all requests
35 input_tokens = torch.stack([req.get_next_token() for req in self.active_requests])
36
37 # Forward pass (batched)
38 logits = self.model.decode(input_tokens)
39
40 # Update requests with new tokens
41 for i, req in enumerate(self.active_requests):
42 req.update(logits[i])Traditional KV-cache allocation is like reserving a full pallet position for every request, even if it only needs a small bin. PagedAttention is like a shared bin system that assigns fixed-size slots on demand: Request A gets slots 7, 2, and 5 (non-contiguous, but tracked by a block table). When a request finishes, its slots become available again. Paging sharply reduces worst-case reservation waste, but a partially filled final block and bookkeeping still consume memory.
KV cache is allocated per-request, but request lengths vary. Pre-allocating the maximum possible sequence length for every request wastes a massive amount of memory. For example, if the system allocates 4096 tokens per request by default:
| Request | Tokens Needed | Tokens Allocated | Memory Wasted |
|---|---|---|---|
| Request A | 100 | 4096 | 97.5% |
| Request B | 3000 | 4096 | 26.8% |
PagedAttention applies the operating system concept of virtual memory to KV cache management.[7] Instead of contiguous physical memory, we divide the KV cache into fixed-size "blocks" (pages). The following figure illustrates how logical blocks map to non-contiguous physical GPU memory via a block table.
By avoiding large contiguous reservations and allocating fixed-size blocks on demand, PagedAttention lets the runtime fit more useful KV state into the same HBM budget.[7] It does not eliminate all slack: each live request can still leave a partially filled last block, and the block table has overhead. In practice, it substantially reduces memory lost to worst-case preallocation.
1import math
2
3requests = [100, 3_000]
4max_context = 4_096
5block_tokens = 16
6reserved_tokens = len(requests) * max_context
7paged_tokens = sum(math.ceil(tokens / block_tokens) * block_tokens for tokens in requests)
8
9print(f"max-context reservation: {reserved_tokens} token slots")
10print(f"paged allocation: {paged_tokens} token slots")
11print(f"remaining final-block slack: {paged_tokens - sum(requests)} token slots")1max-context reservation: 8192 token slots
2paged allocation: 3120 token slots
3remaining final-block slack: 20 token slotsPagedAttention's copy-on-write mechanism matters whenever multiple active continuations share the same prompt prefix. In the original vLLM setting, this is especially important for beam search and parallel sampling, where several continuations reuse the same prompt blocks before they diverge.[7] The figure below shows two continuations initially pointing to the same shared prefix blocks before branching.
Initially, the shared prefix blocks have a reference count greater than one. Appending new tokens usually allocates fresh blocks for each continuation. If a continuation needs to write into a block that's still shared, the runtime first clones that block so the other continuations keep their original view. That's what copy-on-write means here: share immutable prefix state aggressively, then split only when sequences diverge.[7]
As context windows grow into the hundreds of thousands or millions of tokens, a single GPU often can't hold the full KV cache or attention working set for one request. Context Parallelism (CP) addresses this by splitting the input sequence itself across multiple GPUs.[8]
Instead of splitting layers (tensor parallelism) or batches (data parallelism), CP splits the sequence dimension:
This approach becomes useful once a single request's context no longer fits comfortably on one accelerator.
Modern implementations use ring attention (Liu et al., 2024) or similar distributed attention algorithms that minimize communication overhead. GPUs form a logical ring, passing partial attention results to their neighbors until the full context is covered. This can extend supported context length roughly linearly with the number of devices, at least until communication becomes the next bottleneck.[8]
1def kv_gib(sequence_tokens: int) -> float:
2 total_bytes = 2 * sequence_tokens * 80 * 8 * 128 * 2
3 return total_bytes / 1024**3
4
5total_kv = kv_gib(1_000_000)
6devices = 4
7print(f"one 1M-token request KV: {total_kv:.1f} GiB")
8print(f"evenly sharded over {devices} devices: {total_kv / devices:.1f} GiB/device")
9print("Communication and runtime buffers still add overhead.")1one 1M-token request KV: 305.2 GiB
2evenly sharded over 4 devices: 76.3 GiB/device
3Communication and runtime buffers still add overhead.Speculative decoding works like a fulfillment shortcut. A small draft model proposes the next few support-message tokens. The large target model checks the proposed span in a verification pass. If enough draft tokens survive acceptance, this can replace several target decode passes; if not, draft and verification work can lose to ordinary decoding.
Think of it as a senior routing checker (the large target model) and a fast draft scanner (the small draft model). The scanner guesses the next 5 tokens. The checker looks at the span together. If the first 3 are accepted, one target verification pass can emit those accepted tokens plus a correction or continuation token. The accounting must still include the scanner's draft work.
To visualize this, consider how speculative decoding coordinates the interaction between the two models.[9] We use a fast, small draft model to propose a sequence of multiple tokens. The large target model then verifies these proposed tokens in parallel, accepting correct ones and correcting any mistakes.
The key insight of speculative decoding is the asymmetric cost of the two paths. Drafting may be cheap enough, while verification can score all k draft positions in one target-model pass. If acceptance is high, one target pass can replace several ordinary target decode passes. Total latency still includes draft generation, verification, sampling, rejected-token correction, and kernel overhead.
The function below shows one exact speculative step for a single sequence. It first samples k draft tokens from the small model, then runs the large target model once on [prompt + draft_tokens], and finally performs the accept/reject test from Leviathan et al. with residual resampling on the first mismatch.[9]
1import torch
2import torch.nn.functional as F
3
4def next_token_probs(model, input_ids: torch.Tensor) -> torch.Tensor:
5 with torch.no_grad():
6 logits = model(input_ids).logits[0, -1]
7 return F.softmax(logits, dim=-1)
8
9def speculative_step(
10 draft_model,
11 target_model,
12 input_ids: torch.Tensor,
13 k: int = 4,
14) -> list[int]:
15 """
16 Return one speculative chunk for a single sequence.
17 Assumes input_ids has shape [1, seq_len].
18 """
19 assert input_ids.shape[0] == 1, "single-sequence example"
20
21 draft_tokens: list[int] = []
22 draft_dists: list[torch.Tensor] = []
23 draft_input = input_ids
24
25 for _ in range(k):
26 q = next_token_probs(draft_model, draft_input)
27 token = torch.multinomial(q, num_samples=1).item()
28 draft_tokens.append(token)
29 draft_dists.append(q)
30 token_tensor = torch.tensor([[token]], device=input_ids.device)
31 draft_input = torch.cat([draft_input, token_tensor], dim=1)
32
33 # One target-model pass verifies all draft positions at once.
34 with torch.no_grad():
35 target_logits = target_model(draft_input).logits[0]
36
37 target_dists = F.softmax(
38 target_logits[input_ids.size(1) - 1 : input_ids.size(1) + k],
39 dim=-1,
40 )
41
42 accepted: list[int] = []
43 for i, token in enumerate(draft_tokens):
44 p = target_dists[i]
45 q = draft_dists[i]
46 acceptance = min(1.0, (p[token] / q[token]).item())
47
48 if torch.rand(()) < acceptance:
49 accepted.append(token)
50 continue
51
52 residual = torch.clamp(p - q, min=0)
53 if residual.sum() <= 0:
54 replacement = torch.argmax(p).item()
55 else:
56 residual = residual / residual.sum()
57 replacement = torch.multinomial(residual, num_samples=1).item()
58 return accepted + [replacement]
59
60 # If all k draft tokens are accepted, sample one extra token from p.
61 extra = torch.multinomial(target_dists[k], num_samples=1).item()
62 return accepted + [extra]Follow-on work such as EAGLE uses the target model's hidden states to predict future tokens instead of relying on a separately trained small draft model.[10] The important takeaway isn't that one speculative variant always wins. It's that these methods trade extra compute for fewer expensive target-model decode passes, so the real payoff depends on acceptance rate, hardware, and implementation overhead.
1ordinary_target_passes = 5
2draft_proposals = 4
3accepted_prefix = 3
4target_verification_passes = 1
5emitted_tokens = accepted_prefix + 1
6
7print(f"ordinary target passes for {ordinary_target_passes} tokens: {ordinary_target_passes}")
8print(f"one verification emits in this example: {emitted_tokens} tokens")
9print(f"extra draft passes paid: {draft_proposals}")
10print("Speedup requires cheap drafting and high acceptance.")1ordinary target passes for 5 tokens: 5
2one verification emits in this example: 4 tokens
3extra draft passes paid: 4
4Speedup requires cheap drafting and high acceptance.Inference performance isn't just about algorithms. Modern GPUs and specialized accelerators provide hardware-level features that fundamentally change the efficiency equation.
Many serving stacks still use BF16/FP16 as a baseline, but lower-precision modes are increasingly common because they cut model-weight bandwidth and, in some systems, shrink the KV footprint enough to raise concurrency.[11][12][13]
Quantization approaches include:
Common mistake: Beginners often think 4-bit quantization is just about saving disk space. The real win is reducing bytes moved during decode. Whether that turns into a large latency win still depends on kernels, dequant overhead, and quality constraints.
1parameters = 7_000_000_000
2bytes_per_weight = {"FP16": 2.0, "INT8": 1.0, "INT4": 0.5}
3
4for precision, width in bytes_per_weight.items():
5 traffic_gb = parameters * width / 1_000_000_000
6 print(f"{precision}: {traffic_gb:.1f} GB of weights per full read")
7
8print("INT4 weight traffic is 0.25x FP16 before kernel overhead.")1FP16: 14.0 GB of weights per full read
2INT8: 7.0 GB of weights per full read
3INT4: 3.5 GB of weights per full read
4INT4 weight traffic is 0.25x FP16 before kernel overhead.GPUs still dominate general-purpose LLM serving, but some deployments use inference-first accelerators with larger on-chip memory, dataflow execution, or more deterministic scheduling. The trade-off is usually a narrower software stack: you may get excellent latency or cost for a specific serving pattern, but at the price of custom compilation, fewer kernels, and less ecosystem flexibility.
The following symptoms show up in production logs and profiling traces. If you see them, here's what they mean and how to fix them.
Symptom: You upgrade math kernels or buy more peak FLOPs, but decode barely speeds up. Cause: Training is compute-bound, while decode is often memory-bound. If the GPU is already waiting on weights and KV reads, more math capacity does little. Fix: Profile bandwidth first. If HBM is near ceiling, prioritize batching, quantization, GQA, or KV-cache management before chasing more tensor-core throughput.
Symptom: Small chats wait almost as long as large ones even when the queue looks short. Cause: Static batching pads to the longest request and cannot reuse finished slots until the batch cycle ends. Fix: Switch to continuous batching so completed requests leave immediately and queued work fills open slots on the next decode step.
Symptom: Draft-model serving added overhead, but target-model passes did not fall enough to repay it. Cause: Acceptance rate is too low or implementation overhead is too high. Wrong draft tokens force extra verification and correction work. Fix: Measure accepted tokens per draft span before rollout. If most proposals are rejected, keep standard decoding or use a stronger draft path.
Symptom: HBM usage spikes even though many sessions start from the same long system prompt or retrieved prefix. Cause: The runtime is not reusing cached prefix KV state across independent requests, or its isolation policy does not allow that reuse. Fix: Enable an explicit prefix-caching feature with an appropriate tenant and privacy policy. Copy-on-write can preserve shared blocks after reuse is established; it is not by itself a cross-request cache.
Symptom: Aggregate TPS looks better, but TTFT or TPOT percentiles get worse and chat feels sluggish. Cause: Larger batches improve throughput while making active requests compete harder for bandwidth. Fix: Use SLO-aware scheduling. Cap batch size or split latency-sensitive traffic from background work, and watch throughput together with p95 or p99 latency.
Here's a practical check you can do with pen and paper or a short Python script.
Problem: You want to serve a 7B-class model on a single GPU with 80 GiB of HBM. The model has 28 layers, uses Grouped Query Attention with 4 KV heads, head dimension 128, and you plan to use FP16 (2 bytes per element). Your average customer conversation has a 4,096-token context. What's the maximum batch size you can support before the KV cache alone fills the GPU, assuming you need to leave 20 GiB for weights and runtime overhead?
Hint: Start by computing the KV cache for one request, then see how many fit in the remaining 60 GiB.
For one request at 4,096 tokens:
1KV bytes = 2 * batch * seq_len * layers * kv_heads * head_dim * dtype_bytes
2 = 2 * 1 * 4096 * 28 * 4 * 128 * 2
3 = 234,881,024 bytes
4 ≈ 0.22 GiB per requestAvailable memory for KV cache: 80 GiB total - 20 GiB reserved = 60 GiB
Maximum batch size = 60 GiB / 0.21875 GiB per request ≈ 274 requests (rounding the per-request figure to 0.22 GiB gives about 273, so treat this as a ballpark, not a precise count)
In practice, you'd run at a lower batch size to leave headroom for activation buffers, temporary tensors, allocator slack, and bursty long-context requests. A production engineer might cap this far below the arithmetic ceiling and monitor actual HBM usage.
1import math
2
3kv_bytes = 2 * 1 * 4_096 * 28 * 4 * 128 * 2
4kv_per_request_gib = kv_bytes / 1024**3
5kv_budget_gib = 80 - 20
6arithmetic_ceiling = math.floor(kv_budget_gib / kv_per_request_gib)
7
8print(f"KV per request: {kv_per_request_gib:.5f} GiB")
9print(f"KV budget after reserved memory: {kv_budget_gib} GiB")
10print(f"arithmetic-only batch ceiling: {arithmetic_ceiling}")1KV per request: 0.21875 GiB
2KV budget after reserved memory: 60 GiB
3arithmetic-only batch ceiling: 274KV cache stores key and value tensors from previous tokens across all layers, so memory grows linearly with sequence length, batch size, number of layers, number of KV heads, head dimension, and bytes per element. A useful formula is 2 * batch * seq_len * layers * kv_heads * head_dim * bytes. GQA reduces the kv_heads term, and KV-cache quantization reduces bytes.
Speculative decoding hurts when the draft path is inaccurate or expensive enough that extra draft work and rejection handling cost more than the target-model passes saved. Low acceptance, poor kernel fusion, or tiny latency-sensitive batches can erase the win.
Tensor parallelism splits layer math across multiple GPUs, so every decode step includes collective communication such as all-reduce or all-gather. Larger batches can amortize that communication overhead, but small interactive batches may become communication-bound before they become compute-bound. Batching and tensor parallelism need to be tuned together.
Serving systems balance aggregate throughput against per-request latency. Continuous batching can increase total tokens per second, but larger batches also make active requests compete for the same memory bandwidth. Production schedulers need latency SLOs, not only raw tokens/sec.
Take the GPU hourly rate, divide by sustained tokens per second times 3,600, then multiply by one million for cost per million tokens. The dominant variables are real-world sustained throughput (driven by batching, quantization, and model size) and utilization. A GPU sitting idle multiplies cost per token directly, so a 5x cheaper sticker price can still lose if it delivers less than one-fifth the throughput.
Usually try chunked prefill first if the main problem is one large prompt monopolizing a shared worker for too long. Move to prefill-decode disaggregation when that interference remains large enough that separate pools and KV handoff beat their transfer and coordination overhead.
You now understand why serving capacity depends on phase behavior, memory bandwidth, KV-cache residency, scheduler policy, and precision choices. These tools let you read a slow serving trace and separate prefill queueing, decode bandwidth pressure, fragmented KV memory, and speculative-decoding overhead.
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. · 2022 · NeurIPS 2022
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.
Ainslie, J., et al. · 2023 · EMNLP 2023
Sarathi-Serve: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.
Agrawal, A., et al. · 2023 · arXiv preprint
Splitwise: Efficient Generative LLM Inference Using Phase Splitting.
Patel, P., et al. · 2023
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving.
Zhong, Y., et al. · 2024 · OSDI 2024
Orca: A Distributed Serving System for Transformer-Based Generative Models.
Yu, G.-I., et al. · 2022 · OSDI 2022
Efficient Memory Management for Large Language Model Serving with PagedAttention.
Kwon, W., et al. · 2023 · SOSP 2023
Ring Attention with Blockwise Transformers for Near-Infinite Context.
Liu, H., et al. · 2024 · arXiv preprint
Fast Inference from Transformers via Speculative Decoding.
Leviathan, Y., Kalman, M., & Matias, Y. · 2023 · ICML 2023
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty.
Li, Y., et al. · 2024 · ICML 2024
FP8 Formats for Deep Learning.
Micikevicius, P., et al. · 2022
GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers
Frantar, E., et al. · 2023 · ICLR 2023
SnapKV: Compressing KV Cache by Selecting Global Attention Patterns.
Li, Y., et al. · 2024