Reduce LLM inter-token latency by pairing cheap drafting with target-model verification. Learn the rejection-sampling proof, speedup model, method choices, and production rollout gates.
SLM deployment showed how a small model can fit on constrained hardware. Speculative decoding uses a small or cheap draft path differently: not as the final model, but as a proposal engine that a larger target model verifies.
Speculative decoding can speed up generation by letting a cheap draft model propose that a larger target model verifies. This chapter explains the target-distribution proof and the measurements needed before claiming a latency improvement.
Imagine your warehouse uses a slow senior routing checker to write daily shipping reports. The checker reviews every sentence one at a time. Now imagine adding a fast draft scanner that proposes several sentences ahead, and the checker reviews the batch in one pass. If the drafts are good, the checker approves them. If not, it fixes the first mistake and continues. This is the core idea behind speculative decoding: use a fast draft process so a target model can preserve its output distribution in theory while reducing latency when the workload and implementation fit.[1][2]
This can work because the expensive part of low-batch LLM generation is often repeatedly moving model weights and state, not only arithmetic. A small draft model proposes several tokens, and the large target model scores that proposed chunk in one verification pass instead of spending one full decode step per token.
Speculative decoding targets low-batch LLM inference that is memory-bandwidth bound. Verifying a short drafted chunk still costs extra work, and it helps only when accepted-token savings outweigh the draft and verification overhead.
To understand why speculative decoding works, it helps to first understand the fundamental limitations of standard autoregressive generation. In a standard setup, an LLM generates text one token at a time, and each new token depends on all the previously generated tokens. This sequential dependency creates a massive bottleneck.
Most engineers instinctively assume that the slow speed of LLM inference is due to the sheer volume of mathematical calculations required by billions of parameters. However, modern are incredibly fast at performing matrix multiplications. The real bottleneck lies elsewhere: moving data around. The processor is constantly starved for data, waiting for the massive model weights to be fetched from memory before it can perform its rapid calculations.
Think of a fulfillment line. Arithmetic intensity measures how much work gets done for each delivery of input data. If one data load lets the GPU verify 100 candidate tokens, that's high intensity, and the GPU is keeping busy. But if each data load verifies only 1 token, the compute units are mostly waiting for the next memory transfer.
In GPU terms, this ratio is the number of FLOPs (floating point operations, a measure of computational performance) performed per byte of data loaded from memory:
This ratio helps diagnose whether hardware is spending time computing or waiting for data. Under a weight-only back-of-the-envelope model, single-token decode performs about 1 FLOP per byte of weights moved.[2]
During autoregressive decoding, generating one token with a model of parameters in FP16 (16-bit floating-point) requires:
This gives an arithmetic intensity of ~1 FLOP/byte in the weight-only model. Real decode also pays for KV-cache traffic, activations, kernels, scheduling, and batching, so profile the actual engine before declaring a bottleneck.[2]
For an illustrative FP16 70B target, the weight footprint alone is about 140 GB. Insert a measured or documented device bandwidth into the simplified model before comparing serial decode with verification:
1params_b = 70
2bytes_per_weight = 2
3example_bandwidth_gbs = 3_350 # example input; use the deployed accelerator specification
4weights_gb = params_b * bytes_per_weight
5weight_stream_ms = weights_gb / example_bandwidth_gbs * 1_000
6
7print(f"FP16 weight footprint: {weights_gb} GB")
8print(f"weight-only read time at {example_bandwidth_gbs} GB/s: {weight_stream_ms:.1f} ms")
9print("This is a diagnostic lower bound, not measured request latency.")1FP16 weight footprint: 140 GB
2weight-only read time at 3350 GB/s: 41.8 ms
3This is a diagnostic lower bound, not measured request latency.| Phase | Simplified expectation | Bottleneck to measure | Intuition |
|---|---|---|---|
| Prefill (many tokens together) | Higher intensity | Often compute or mixed | Large matrix work amortizes weight loads |
| Low-batch decode (1 token) | ~1 FLOP/byte in FP16 weight-only model | Often memory bandwidth | Move active weights to emit one new token |
Do not assume low-batch decode is compute-bound without measuring it. Weight traffic is often a dominant cost when the target emits one token at a time.[2]
Here's what makes speculative decoding so clever. Imagine a slow quality-control station where every package must be scanned one at a time. The bottleneck isn't checking the barcode (that's fast); it's moving each package through the scanner separately. Now imagine if the station could inspect 5 proposed packages in one batch and reject only the first bad one. The line moves much faster, even though the station does slightly more work per batch.
That's exactly what happens here: because weight movement dominates, verifying a short candidate chunk can be much closer to one target-model pass than to separate decode passes. We use a cheap draft model to propose a sequence of candidate tokens. The target model then uses teacher forcing, meaning it scores a known candidate sequence in parallel instead of generating those tokens one by one. The weight-loading cost is paid once for the whole drafted chunk, which raises arithmetic intensity and improves throughput when the draft is accurate enough.
The figure is small enough to trace by hand. The draft model proposes "order ships from refund." The target accepts the first three tokens, rejects "refund," and samples a correction from the residual target distribution.
Think of the target model as a warehouse verifier checking a draft pick list. For each proposed token, the verifier asks: "Would I have chosen this token too?" If the target model likes it even more than the draft model did, instant approval. If the target model likes it less, it might still keep it (proportional to how close the preferences are), or reject it and write its own token instead.
Let's walk through a concrete example. Suppose the prefix so far is "The order" and the draft model proposes three tokens:
| Position | Draft token | Draft prob. | Target prob. | Verdict |
|---|---|---|---|---|
| 1 | ships | 0.40 | 0.60 | Accept (target likes it more) |
| 2 | from | 0.30 | 0.15 | Roll: accept with probability 0.50 |
| 3 | Seattle | 0.20 | 0.25 | Accept (target likes it more) |
For token 1, the target probability (0.60) is higher than the draft probability (0.40), so the verifier always accepts. For token 2, the target probability (0.15) is lower than the draft (0.30). The verifier flips a weighted coin: it accepts with probability 0.15 / 0.30 = 0.50. If the coin comes up reject, the verifier stops checking further tokens and samples a correction from the residual distribution. Token 3 only matters if token 2 survived.
1draft_probability = 0.30
2target_probability = 0.15
3accept_probability = min(1.0, target_probability / draft_probability)
4
5print(f"accept probability for 'from': {accept_probability:.2f}")
6print("Later drafted positions are discarded after a rejection.")1accept probability for 'from': 0.50
2Later drafted positions are discarded after a rejection.This construction preserves the target distribution exactly. The accepted branch contributes the overlap between the two distributions, and the residual sampler contributes the missing mass. Added together, they reconstruct the target probability for every token.[1]
Mathematically, for each draft token , both models assign a probability to that token conditioned on the same prefix. We compare the target model's probability with the draft model's probability :
Compute the ratio of the big model's probability to the draft model's probability for this token. If the big model likes it more (ratio >= 1), always accept. If the big model likes it less, accept randomly with probability equal to the ratio. The bigger the disagreement, the more likely rejection.
When a token is rejected at position , we sample a correction token from the residual distribution:
where is the normalizing constant.
The correction picks from tokens that the big model wanted more than the draft model predicted. If the draft said "the" had 20% probability but the big model wanted 35%, that extra 15% enters the residual pool. Tokens where the draft was already too generous (draft > target) get zero residual probability, since they were over-represented, not under-represented.
This accept/reject scheme is a modified rejection-sampling algorithm. The accepted branch contributes , and the residual sampler contributes the missing mass . Add those two terms together and you recover exactly.[1]
1tokens = ["cat", "dog", "mouse"]
2draft = [0.60, 0.30, 0.10]
3target = [0.40, 0.50, 0.10]
4residual_mass = [max(0.0, p - q) for p, q in zip(target, draft)]
5normalizer = sum(residual_mass)
6residual = [value / normalizer if normalizer else 0.0 for value in residual_mass]
7
8print(dict(zip(tokens, residual)))
9print(f"positive correction mass: {normalizer:.2f}")1{'cat': 0.0, 'dog': 1.0, 'mouse': 0.0}
2positive correction mass: 0.20The following code demonstrates a simplified version of the speculative decoding algorithm using PyTorch. It requires two components: a pre-trained target model and a smaller, computationally efficient draft model. The algorithm generates draft tokens autoregressively with the small model, concatenates them with the current context, and then validates the entire sequence in a single forward pass through the target model.
This version is intentionally pedagogical: it handles the accept/reject loop and residual sampling, but leaves out production concerns like KV-cache reuse, EOS handling, batching, and logits processors. It assumes Hugging Face-style causal-LM logits, where position predicts token . In a production sampler, the acceptance test has to use the same post-processed distributions you actually serve, not a different temperature or top-p configuration.[1][2]
1def top_k_normalize(probabilities, k):
2 kept = sorted(range(len(probabilities)), key=probabilities.__getitem__, reverse=True)[:k]
3 total = sum(probabilities[index] for index in kept)
4 return [probabilities[index] / total if index in kept else 0.0 for index in range(len(probabilities))]
5
6raw_target = [0.55, 0.30, 0.15]
7raw_draft = [0.40, 0.35, 0.25]
8served_target = top_k_normalize(raw_target, k=2)
9served_draft = top_k_normalize(raw_draft, k=2)
10token_id = 1
11
12raw_accept = min(1.0, raw_target[token_id] / raw_draft[token_id])
13served_accept = min(1.0, served_target[token_id] / served_draft[token_id])
14print(f"raw acceptance: {raw_accept:.3f}")
15print(f"served top-k acceptance: {served_accept:.3f}")
16print("Verification must use served probabilities.")1raw acceptance: 0.857
2served top-k acceptance: 0.756
3Verification must use served probabilities.1import torch
2import torch.nn.functional as F
3
4def speculative_decode(
5 target_model: torch.nn.Module,
6 draft_model: torch.nn.Module,
7 input_ids: torch.Tensor, # (1, seq_len)
8 K: int = 5,
9 max_new_tokens: int = 100,
10) -> torch.Tensor:
11 """Pedagogical speculative decoding loop."""
12 generated = input_ids.clone()
13
14 tokens_generated = 0
15 while tokens_generated < max_new_tokens:
16 step_k = min(K, max_new_tokens - tokens_generated)
17 prompt_len = generated.shape[1]
18
19 # Step 1: Draft step_k tokens autoregressively with the small model.
20 draft_tokens: list[int] = []
21 draft_probs: list[torch.Tensor] = []
22 draft_input = generated.clone()
23
24 for _ in range(step_k):
25 with torch.no_grad():
26 logits = draft_model(draft_input).logits[:, -1, :] # (1, vocab)
27 probs = F.softmax(logits, dim=-1)
28 token = torch.multinomial(probs, num_samples=1)
29
30 draft_tokens.append(token.item())
31 draft_probs.append(probs.squeeze(0))
32 draft_input = torch.cat([draft_input, token], dim=-1)
33
34 verify_suffix = torch.tensor(
35 [draft_tokens],
36 device=generated.device,
37 dtype=generated.dtype,
38 )
39 verify_input = torch.cat([generated, verify_suffix], dim=-1)
40
41 # Step 2: One target pass scores every drafted position at once.
42 with torch.no_grad():
43 target_logits = target_model(verify_input).logits
44
45 # In Hugging Face causal LMs, position prompt_len - 1 predicts
46 # the first drafted token.
47 for i in range(step_k):
48 pos = prompt_len - 1 + i
49 target_p = F.softmax(target_logits[:, pos, :], dim=-1).squeeze(0)
50 draft_p = draft_probs[i]
51 token_id = draft_tokens[i]
52
53 ratio = (target_p[token_id] / draft_p[token_id]).item()
54 if torch.rand(1).item() < min(1.0, ratio):
55 continue
56
57 residual = torch.clamp(target_p - draft_p, min=0)
58 residual = residual / residual.sum()
59 correction = torch.multinomial(residual, num_samples=1)
60
61 accepted_prefix = torch.tensor(
62 [draft_tokens[:i]],
63 device=generated.device,
64 dtype=generated.dtype,
65 )
66 generated = torch.cat(
67 [generated, accepted_prefix, correction.unsqueeze(0)],
68 dim=-1,
69 )
70 tokens_generated += i + 1
71 break
72 else:
73 generated = torch.cat([generated, verify_suffix], dim=-1)
74 tokens_generated += step_k
75
76 # The last logit also predicts one bonus token beyond the draft.
77 if tokens_generated < max_new_tokens:
78 bonus_pos = prompt_len - 1 + step_k
79 bonus_probs = F.softmax(target_logits[:, bonus_pos, :], dim=-1)
80 bonus = torch.multinomial(bonus_probs, num_samples=1)
81
82 generated = torch.cat([generated, bonus], dim=-1)
83 tokens_generated += 1
84
85 return generatedSuppose input_ids currently contains the tokens for "Track order 42". The loop sets step_k = 5 and the draft model autoregressively generates ["ships", "from", "Seattle", "on", "Friday"]. The target model then scores all five draft positions in a single forward pass on the concatenated sequence.
If the verifier accepts the first three tokens but rejects the fourth, the code appends ["ships", "from", "Seattle"] plus a correction token sampled from the residual distribution. The loop then resumes from the new prefix, generating another batch of five draft tokens. If all five tokens are accepted, the code appends all five and also samples a bonus token from the last target logit, yielding six new tokens for one target pass.
The expected number of tokens generated per verification step depends on the acceptance rate and the speculation length . A useful back-of-the-envelope model assumes each drafted token is accepted independently with probability and that one verification pass costs about one normal target-model decode step. Think of this like drafting a pick list where each accepted item keeps the line moving, but the first wrong item forces a stop and correction.
Under that approximation, the expected tokens per verification round is given by the geometric series:
is the per-token acceptance probability and is how many tokens the draft model proposes. When is high, you get close to tokens per round because many drafted tokens survive and you often collect the bonus token too. When is low, most drafts get rejected early and you fall back toward 1 token per round.
Wall-clock speedup must also account for the cost ratio (the time it takes to run the draft model relative to the target model). This gives a useful approximation, not an exact production forecast, because real systems also pay for cache growth, kernel launches, and batching effects.
The numerator is how many tokens you get per verification round. The denominator is the modeled cost: 1 target-model pass plus draft-model passes, each costing fraction of a target pass. For example, if the draft model is 10x cheaper () and you speculate tokens, the denominator is .
Use illustrative measured inputs. Suppose a candidate draft path costs 10% of a target pass (). You set and observe an acceptance rate of in a benchmark.
Expected tokens per round = tokens.
Cost denominator = target-equivalent passes.
Speedup = x.
The model predicts about 2.5x for those inputs. If acceptance changes to 0.6, the same K = 5 model predicts about 1.6x; at 0.9, it predicts about 3.1x. These are model outputs to compare against a benchmark, not promised throughput.
| Candidate setup | Assume | Approx. tokens/round | Approx. speedup | ||
|---|---|---|---|---|---|
| Measured path A | 0.6 | 5 | 0.1 | 2.4 | 1.6x |
| Measured path A | 0.7 | 5 | 0.1 | 2.9 | 2.0x |
| Measured path A | 0.8 | 5 | 0.1 | 3.7 | 2.5x |
| Measured path A | 0.9 | 5 | 0.1 | 4.7 | 3.1x |
| Measured path B | 0.85 | 8 | 0.1 | 5.1 | 2.8x |
| Measured path C | 0.90 | 10 | 0.1 | 6.9 | 3.4x |
1def expected_tokens(acceptance: float, depth: int) -> float:
2 return sum(acceptance**step for step in range(depth + 1))
3
4def modeled_speedup(acceptance: float, depth: int, draft_cost: float) -> float:
5 return expected_tokens(acceptance, depth) / (1 + depth * draft_cost)
6
7for acceptance in (0.6, 0.8, 0.9):
8 estimate = modeled_speedup(acceptance, depth=5, draft_cost=0.1)
9 print(f"acceptance={acceptance:.1f}: modeled speedup={estimate:.2f}x")1acceptance=0.6: modeled speedup=1.59x
2acceptance=0.8: modeled speedup=2.46x
3acceptance=0.9: modeled speedup=3.12x
The useful depends on measured acceptance, draft cost, and serving behavior. A small sweep can begin with single-digit depths, but select from route-specific benchmarks rather than a universal default.
1def modeled_speedup(acceptance, depth, draft_cost):
2 expected = sum(acceptance**step for step in range(depth + 1))
3 return expected / (1 + depth * draft_cost)
4
5measurements = {"acceptance": 0.72, "draft_cost": 0.12}
6candidates = {
7 depth: modeled_speedup(measurements["acceptance"], depth, measurements["draft_cost"])
8 for depth in (1, 3, 5, 8)
9}
10best_depth = max(candidates, key=candidates.get)
11print({depth: round(value, 3) for depth, value in candidates.items()})
12print(f"model-selected K to benchmark: {best_depth}")1{1: 1.536, 3: 1.92, 5: 1.921, 8: 1.727}
2model-selected K to benchmark: 5The draft mechanism determines your acceptance rate and your overall speedup. It requires a balance: if the draft is too weak, it gets rejected constantly. If it's too expensive, it erases the latency gains from verification. In practice, most systems use one of these families:
| Approach | Draft source | Main advantage | Main trade-off |
|---|---|---|---|
| Smaller same-family model | Separate assistant model with the same tokenizer | Simple exact speculative-decoding setup | Extra model to load and schedule |
| Medusa heads[3] | Extra heads attached to the target model | No separate model at inference time | Needs extra training and tree verification |
| EAGLE / EAGLE-3[4][5] | Target-coupled speculator over hidden states or direct token heads | Strong proposals without a full second model | More integration complexity |
| Prompt Lookup[6] | Reuse repeated n-grams from context | No extra model or training | Only helps when the context repeats itself |
In the classic direct-token draft setup, require the separate draft model to share the exact same tokenizer as the target model. If token IDs map to different subwords, the target is not verifying the intended candidate sequence; reject that pairing or use a method that explicitly supports different tokenizers.
1draft_vocab = {"return": 14, " label": 88, " expires": 103}
2target_vocab = {"return": 14, " label": 88, " expires": 104}
3required_pieces = ["return", " label", " expires"]
4
5mismatches = [
6 piece for piece in required_pieces
7 if draft_vocab.get(piece) != target_vocab.get(piece)
8]
9print(f"token-id mismatches: {mismatches}")
10print(f"direct draft path allowed: {not mismatches}")1token-id mismatches: [' expires']
2direct draft path allowed: FalseMedusa avoids the need for a separate draft model entirely. Instead, it adds extra prediction heads to the target model itself:
Each Medusa head predicts a different future position from the same hidden state. This design fundamentally changes the architecture from generating a single draft sequence to predicting a tree of possible continuations. Since each head can propose multiple candidates, the candidates form a tree structure rather than a single chain. A specialized tree attention mechanism then evaluates all these candidate paths simultaneously in a single forward pass, efficiently discarding the incorrect branches while keeping the longest accepted path. This approach avoids the overhead and complexity of loading and orchestrating a separate draft model altogether.
The method map above shows that shape visually: hidden state fans out into several future-token heads, and the target verifies the resulting candidate tree.
EAGLE drafts from target-model internals instead of using a separate full assistant. Earlier EAGLE variants predict feature states, which can raise acceptance because those states carry more information than a plain token-only head.[4] EAGLE-3 moves further in that direction: the paper switches to direct token prediction, fuses low, middle, and high target-model layers, and trains the drafter with a training-time-test loop on its own outputs. The paper reports up to 6.5x speedup in its evaluation setup, but exact gains still depend on engine support, batch shape, and prompt mix.[5] Current serving docs such as vLLM expose EAGLE-family speculation, but the exact options and caveats change quickly.[7]
Prompt Lookup Decoding (PLD) takes a completely different approach. Instead of using any neural model for drafting, it searches the current context window for matching n-grams and reuses them as draft tokens. This non-neural method works surprisingly well for tasks with significant repetition or pattern matching.
| Aspect | How it works |
|---|---|
| Draft source | Match n-grams from the prompt/context window |
| Candidate workloads | Code completion, summarization, repetitive text |
| Memory overhead | No extra model weights |
| Main failure mode | Little benefit when the context has little repetition |
The algorithm scans the context window for n-grams (typically 3-5 tokens) that match the end of the currently generated sequence. When it finds a match, it looks at what token followed that n-gram earlier in the context window and uses that as the next draft token. For example, if the model has generated "return label" and the context contains "return label expires Friday," PLD proposes "expires" as the next draft token.
PLD is especially effective on code generation and other repetitive tasks because variable names, function calls, and boilerplate often reappear within the context window.[6]
The key advantage is simplicity: there's no model to load, no training required, and no memory overhead. PLD can be combined with other speculative methods or used as a fallback when no neural draft model is available.
1context = "return label expires Friday. damaged box needs review. return label"
2tokens = context.split()
3suffix = ["return", "label"]
4
5proposal = None
6for index in range(len(tokens) - len(suffix)):
7 if tokens[index:index + len(suffix)] == suffix:
8 proposal = tokens[index + len(suffix)]
9 break
10
11print(f"matched suffix: {' '.join(suffix)}")
12print(f"lookup proposal: {proposal}")1matched suffix: return label
2lookup proposal: expiresAdding speculative decoding to a production system isn't an automatic win. While it looks promising on paper, real-world performance depends heavily on workload characteristics. For the classic two-model draft-and-verify setup, the core trade-off is simple: you spend extra FLOPs on the draft path to save target-model memory bandwidth. Modern serving stacks now expose several speculation families, so the payoff is method-specific rather than a blanket yes-or-no.[7] Deploy blindly and you might actually decrease overall throughput and increase costs.
Before rolling out a separate draft model, you need to check if your workload will actually benefit from the draft-then-verify cycle. The technique is a trade-off: it burns additional compute (FLOPs) to save memory bandwidth. If your system is already compute-bound, this trade-off will usually backfire and reduce overall throughput.
| Scenario | Hypothesis before benchmark | Why test it |
|---|---|---|
| Single-user, low-batch inference | Strong candidate | Target decode may be memory-bandwidth bound |
| Throughput-maximized batching | Measure carefully | Extra draft work can compete with saturated compute |
| Long outputs | Candidate | More decode steps can amortize setup |
| Very short outputs | Weak candidate | Setup and drafting may dominate |
| Repetitive outputs (code, templates) | Candidate | Draft or lookup acceptance may be higher |
| Diverse outputs | Measure carefully | Acceptance may vary with sampling and prompt mix |
Current vLLM docs position separate draft-model speculation as an inter-token-latency optimization for medium-to-low QPS, memory-bound workloads. The same guide separates that setup from other methods such as EAGLE, MTP, n-gram, and suffix speculation, which have different latency-versus-throughput trade-offs.[7]
Serving-engine support changes quickly. For example, vLLM's current docs list several speculation families, but they also call out known feature incompatibilities and separate theoretical losslessness from what you can expect under real hardware numerics.[7] Treat framework support as an operational detail you must validate in your own stack, not as a timeless property of the algorithm.
The practical rollout loop is usually straightforward:
1baseline = {"p95_itl_ms": 46.0, "throughput_tps": 380, "sampler_parity": True}
2canary = {"p95_itl_ms": 29.0, "throughput_tps": 372, "sampler_parity": True}
3minimum_throughput_ratio = 0.95
4
5promote = (
6 canary["sampler_parity"]
7 and canary["p95_itl_ms"] < baseline["p95_itl_ms"]
8 and canary["throughput_tps"] >= baseline["throughput_tps"] * minimum_throughput_ratio
9)
10print(f"inter-token latency improved: {canary['p95_itl_ms'] < baseline['p95_itl_ms']}")
11print(f"canary promoted: {promote}")1inter-token latency improved: True
2canary promoted: TrueSpeculative decoding isn't a universal speedup button. Here are three common failure modes, with symptoms and fixes.
| Symptom | Likely cause | Fix |
|---|---|---|
| Speedup is near 1x or negative | Draft model is too slow or too inaccurate (low acceptance rate) | Benchmark a smaller or better-aligned draft, or switch to Prompt Lookup for repetitive tasks |
| Correctness checks fail | Acceptance test uses different temperature or top-p than the served model | Ensure the verifier and the sampler share the exact same post-processed distribution |
| Memory usage spikes unexpectedly | KV cache wasn't truncated after a rejected token | Implement cache rewind so rejected draft tokens don't persist in the cache |
It uses modified rejection sampling. Accepted draft tokens contribute the overlap between draft and target distributions, while rejected tokens are replaced by samples from the residual target mass. This is exact in theory when verification uses the same post-processed distribution as serving. Real systems can still differ slightly because of finite-precision kernels and engine details.[1][7]
Acceptance probability , speculation depth , and draft-to-target cost ratio . A useful approximation is , but production results also depend on batching, cache behavior, kernels, and workload mix.
Temperature changes the served distributions and therefore acceptance; its direction and magnitude depend on target, draft, and prompt mix. Measure acceptance under the actual sampler, and make the verifier use that same processed distribution.
A separate draft model proposes one chain with another model. Medusa attaches future-token heads to the target model, proposes a tree of candidates from target hidden states, and verifies those candidates with tree attention.[3]
Be cautious in high-QPS, batch-heavy, compute-saturated serving unless benchmarks show a win. Extra draft work can erase latency savings. Other speculation methods can still help, so test by method and workload class.[7]
It drafts by reusing n-gram continuations already present in context. It works best for repetitive workloads such as code, templates, and some summarization because it needs no extra model but depends on repeated context.[6]
Try to answer these questions without looking back at the article.
1. Concrete arithmetic. A 70B model in FP16 has about 140 GB of weights. Under the simplified model, a measured bandwidth input makes one target weight read a large part of low-batch decode time. Why can verifying five draft tokens in one target pass save time?
Sketch: If measured target decode is bandwidth-limited, one verification pass can amortize much of the target weight traffic across several accepted positions. The benchmark still has to account for extra draft work, cache traffic, and rejection.
2. Acceptance probability. A draft token has draft probability 0.4 and target probability 0.2. What is the acceptance probability, and what happens if the token is rejected?
Sketch: Acceptance probability = min(1, 0.2 / 0.4) = 0.5. If rejected, sample a correction from the residual distribution . Tokens the target wanted more than the draft receive positive residual mass; tokens the draft already overestimated receive zero.
3. Negative speedup. You deploy a 1B draft with a 70B target and see only 1.1x speedup. Your acceptance rate is 0.5 and . Should you increase to 10, switch to a larger draft model, or investigate something else?
Sketch: Low acceptance means most drafts get rejected early. Increasing wastes effort on tokens that won't survive. A larger draft might raise but costs more per token. First, check for tokenizer mismatch or temperature mismatch between draft and target. If those are correct, try a smaller, better-aligned draft or switch to Prompt Lookup for repetitive workloads.
You've now seen why speculative decoding works, how the acceptance logic preserves the target distribution, and how to estimate speedup from acceptance rate and draft cost. The core insight is simple: memory bandwidth dominates autoregressive decode, so parallel verification amortizes the weight-loading cost across multiple tokens.
Fast Inference from Transformers via Speculative Decoding.
Leviathan, Y., Kalman, M., & Matias, Y. · 2023 · ICML 2023
Accelerating Large Language Model Decoding with Speculative Sampling.
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, John Jumper · 2023
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads.
Cai, T., et al. · 2024 · ICML 2024
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty.
Li, Y., et al. · 2024 · ICML 2024
EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
Li, Y., Wei, F., Zhang, C., & Zhang, H. · 2025
Accelerating LLM Inference by Reusing the Previous Context via Prompt Lookup Decoding.
Spector, B. & Re, C. · 2024
Speculative Decoding
vLLM Team · 2026 · vLLM Documentation