LearnInference & Production ScaleSpeculative Decoding

🚀HardInference Optimization

Speculative Decoding

Reduce LLM inter-token latency by pairing cheap drafting with target-model verification. Learn the rejection-sampling proof, speedup model, method choices, and production rollout gates.

35 min read

Learning path

Step 138 of 158 in the full curriculum

SLM Specialization & Edge Deployment Long Context Window Management

SLM deployment showed how a small model can fit on constrained hardware. Speculative decoding uses a small or cheap draft path differently: not as the final model, but as a proposal engine that a larger target model verifies.

A large target model writing a deployment summary still emits one token at a time. Speculative decoding adds a smaller draft model that proposes the next few tokens, while the target scores that whole proposed span in one pass. If the draft matches what the target would likely say, several tokens survive. If it diverges, the target repairs the first mismatch and continues. This is the core idea behind speculative decoding: use a fast draft process so a target model can preserve its output distribution in theory while reducing latency when the workload and implementation fit.^{[1]Reference 1Fast Inference from Transformers via Speculative Decoding.https://arxiv.org/abs/2211.17192}^{[2]Reference 2Accelerating Large Language Model Decoding with Speculative Sampling.https://arxiv.org/abs/2302.01318}

The win is not automatic. The expensive part of low-batch LLM generation is often repeatedly moving model weights and state, not arithmetic alone. Speculative decoding targets low-batch LLM inference that's memory-bandwidth bound. A small draft model proposes several tokens, and the large target model scores that proposed chunk in one verification pass instead of spending one full decode step per token, but only when accepted-token savings outweigh the draft and verification overhead.

The inference bottleneck

To understand why speculative decoding works, start with standard autoregressive generation. An LLM emits text one token at a time, and each new token depends on the tokens that came before it. That sequential dependency limits how much work a low-batch decode step can amortize.

It's tempting to blame decode latency only on the arithmetic required by billions of parameters. Modern GPUs are fast at matrix multiplication, but low-batch decode often leaves their compute units waiting for model weights and state to arrive from memory. The bottleneck depends on workload and hardware, so measure it before choosing an optimization.

Arithmetic intensity

Arithmetic intensity measures how much work gets done per byte moved from memory. If one weight load lets the GPU verify 100 candidate tokens, that's high intensity, and the GPU is keeping busy. But if each weight load verifies only 1 token, the compute units are mostly waiting for the next memory transfer.

In GPU terms, this ratio is the number of FLOPs (floating point operations, a measure of computational performance) performed per byte of data loaded from memory:

$\text{Arithmetic Intensity} = \frac{\text{FLOPs}}{\text{Bytes Transferred}}$

Reading the formula

This ratio helps diagnose whether hardware is spending time computing or waiting for data. Under a weight-only FP16 back-of-the-envelope model, single-token decode performs about 1 FLOP per byte of weights moved.^{[2]Reference 2Accelerating Large Language Model Decoding with Speculative Sampling.https://arxiv.org/abs/2302.01318}

During autoregressive decoding, generating one token with a model of $P$ parameters in FP16 (16-bit floating-point) requires:

Compute: $\approx 2P$ FLOPs (one matrix-vector multiplication per layer)
Memory: $\approx 2P$ bytes loaded (entire model weights in FP16)

This gives an arithmetic intensity of ~1 FLOP/byte in the weight-only model. Real decode also pays for KV-cache traffic, activations, kernels, scheduling, and batching, so profile the actual engine before declaring a bottleneck.^{[2]Reference 2Accelerating Large Language Model Decoding with Speculative Sampling.https://arxiv.org/abs/2302.01318}

For an illustrative Qwen3.6-27B BF16 target, the dense weight footprint alone is about 54 GB.^{[3]Reference 3Qwen3.6-27Bhttps://huggingface.co/Qwen/Qwen3.6-27B} Insert a measured or documented device bandwidth into the simplified model before comparing serial decode with verification:

weight-streaming-diagnostic.py

params_b = 27
bytes_per_weight = 2
example_bandwidth_gbs = 3_350  # example input; use the deployed accelerator specification
weights_gb = params_b * bytes_per_weight
weight_stream_ms = weights_gb / example_bandwidth_gbs * 1_000

print(f"BF16 weight footprint: {weights_gb} GB")
print(f"weight-only read time at {example_bandwidth_gbs} GB/s: {weight_stream_ms:.1f} ms")
print("This is a diagnostic lower bound, not measured request latency.")

Output

BF16 weight footprint: 54 GB
weight-only read time at 3350 GB/s: 16.1 ms
This is a diagnostic lower bound, not measured request latency.

Phase	Simplified expectation	Bottleneck to measure	Intuition
Prefill (many tokens together)	Higher intensity	Often compute or mixed	Large matrix work amortizes weight loads
Low-batch decode (1 token)	~1 FLOP/byte in FP16 weight-only model	Often memory bandwidth	Move active weights to emit one new token

One Weight Read, More Verified Tokens — Speculation helps when the target pass is memory-bound: the same weight read can verify several drafted positions, while ordinary decode pays the target pass once per emitted token.

Don't assume low-batch decode is compute-bound without measuring it. Weight traffic is often a dominant cost when the target emits one token at a time.^{[2]Reference 2Accelerating Large Language Model Decoding with Speculative Sampling.https://arxiv.org/abs/2302.01318}

Why verification can win

Speculative decoding is useful because the bottleneck in low-batch decode often isn't the softmax over one token; it's repeatedly moving weights and cache state for each separate decode step. If the target can verify 5 proposed tokens in one pass and reject only the first bad one, the system emits accepted tokens with fewer target-model calls, even though each speculative round does slightly more work.

That's exactly what happens here: because weight movement dominates, verifying a short candidate chunk can be much closer to one target-model pass than to $K$ separate decode passes. We use a cheap draft model to propose a sequence of candidate tokens. The target model then uses teacher forcing, meaning it scores a known candidate sequence in parallel instead of generating those tokens one by one. The weight-loading cost is paid once for the whole drafted chunk, which raises arithmetic intensity and improves throughput when the draft is accurate enough.

The algorithm

Speculative decoding drafts several tokens with a small model, verifies them once with the target model, accepts the prefix, and samples a correction. — The draft model (small, fast) proposes a chain of candidate tokens. The target model (large, slow) scores every position in one forward pass, keeps the accepted prefix, and samples a correction at the first rejection.

The figure is small enough to trace by hand. The draft model proposes "model serves with cache." The target accepts the first three tokens, rejects "cache," and samples a correction from the residual target distribution.

Accept/reject criterion

For each proposed token, the target model asks: "How much probability did I assign to this proposal?" If the target model likes it even more than the draft model did, instant approval. If the target model likes it less, it might still keep it (proportional to how close the preferences are), or reject it and sample a correction instead.

A concrete example shows the verification step. Suppose the prefix so far is "The model" and the draft model proposes three tokens:

Position	Draft token	Draft prob.	Target prob.	Verdict
1	serves	0.40	0.60	Accept (target likes it more)
2	with	0.30	0.15	Roll: accept with probability 0.50
3	cache	0.20	0.25	Accept (target likes it more)

For token 1, the target probability (0.60) is higher than the draft probability (0.40), so the verifier always accepts. For token 2, the target probability (0.15) is lower than the draft (0.30). The verifier flips a weighted coin: it accepts with probability 0.15 / 0.30 = 0.50. If the coin comes up reject, the verifier stops checking further tokens and samples a correction from the residual distribution. Token 3 only matters if token 2 survived.

acceptance-probability.py

draft_probability = 0.30
target_probability = 0.15
accept_probability = min(1.0, target_probability / draft_probability)

print(f"accept probability for 'with': {accept_probability:.2f}")
print("Later drafted positions are discarded after a rejection.")

Output

accept probability for 'with': 0.50
Later drafted positions are discarded after a rejection.

This construction preserves the target distribution exactly. The accepted branch contributes the overlap between the two distributions, and the residual sampler contributes the missing mass. Added together, they reconstruct the target probability for every token.^{[1]Reference 1Fast Inference from Transformers via Speculative Decoding.https://arxiv.org/abs/2211.17192}

Mathematically, for each draft token $t_i$ , both models assign a probability to that token conditioned on the same prefix. We compare the target model's probability $p_{\text{target}}(t_i)$ with the draft model's probability $p_{\text{draft}}(t_i)$ :

$P(\text{accept } t_i) = \min\left(1, \frac{p_{\text{target}}(t_i)}{p_{\text{draft}}(t_i)}\right)$

Reading the formula

Compute the ratio of the big model's probability to the draft model's probability for this token. If the big model likes it more (ratio >= 1), always accept. If the big model likes it less, accept randomly with probability equal to the ratio. The bigger the disagreement, the more likely rejection.

When a token is rejected at position $i$ , we sample a correction token from the residual distribution:

$p_{\text{residual}}(t) = \frac{\max(0, \; p_{\text{target}}(t) - p_{\text{draft}}(t))}{Z}$

where $Z = \sum_t \max(0, \; p_{\text{target}}(t) - p_{\text{draft}}(t))$ is the normalizing constant.

In plain terms

The correction picks from tokens that the big model wanted more than the draft model predicted. If the draft said "the" had 20% probability but the big model wanted 35%, that extra 15% enters the residual pool. Tokens where the draft was already too generous (draft > target) get zero residual probability, since they were over-represented, not under-represented.

Verifier keeps draft-target overlap and samples residual correction mass. — The accept/reject test keeps the overlap between the draft and target distributions. When a token is rejected, the correction sampler draws only from probability mass the target wanted more than the draft did.

This accept/reject scheme is a modified rejection-sampling algorithm. The accepted branch contributes $\min(p_{\text{draft}}(t), p_{\text{target}}(t))$ , and the residual sampler contributes the missing mass $\max(0, p_{\text{target}}(t) - p_{\text{draft}}(t))$ . Add those two terms together and you recover $p_{\text{target}}(t)$ exactly.^{[1]Reference 1Fast Inference from Transformers via Speculative Decoding.https://arxiv.org/abs/2211.17192}

residual-correction.py

tokens = ["cache", "latency", "batch"]
draft = [0.60, 0.30, 0.10]
target = [0.40, 0.50, 0.10]
residual_mass = [max(0.0, p - q) for p, q in zip(target, draft)]
normalizer = sum(residual_mass)
residual = [value / normalizer if normalizer else 0.0 for value in residual_mass]

print(dict(zip(tokens, residual)))
print(f"positive correction mass: {normalizer:.2f}")

Output

{'cache': 0.0, 'latency': 1.0, 'batch': 0.0}
positive correction mass: 0.20

Implementation

This PyTorch version keeps speculative decoding small enough to inspect. It requires two components: a pre-trained target model and a smaller, computationally efficient draft model. The algorithm generates $K$ draft tokens autoregressively with the small model, concatenates them with the current context, and then validates the entire sequence in a single forward pass through the target model.

This version is intentionally pedagogical: it handles the accept/reject loop and residual sampling, but leaves out production concerns like KV-cache reuse, EOS handling, batching, and logits processors. It assumes Hugging Face-style causal-LM logits, with softmax turning each score vector into probabilities and position $j$ predicting token $j + 1$ . In a production sampler, the acceptance test has to use the same post-processed distributions you serve, not a different temperature or top-p configuration.^{[1]Reference 1Fast Inference from Transformers via Speculative Decoding.https://arxiv.org/abs/2211.17192}^{[2]Reference 2Accelerating Large Language Model Decoding with Speculative Sampling.https://arxiv.org/abs/2302.01318}

served-distribution-parity.py

def top_k_normalize(probabilities, k):
    kept = sorted(range(len(probabilities)), key=probabilities.__getitem__, reverse=True)[:k]
    total = sum(probabilities[index] for index in kept)
    return [probabilities[index] / total if index in kept else 0.0 for index in range(len(probabilities))]

raw_target = [0.55, 0.30, 0.15]
raw_draft = [0.40, 0.35, 0.25]
served_target = top_k_normalize(raw_target, k=2)
served_draft = top_k_normalize(raw_draft, k=2)
token_id = 1

raw_accept = min(1.0, raw_target[token_id] / raw_draft[token_id])
served_accept = min(1.0, served_target[token_id] / served_draft[token_id])
print(f"raw acceptance: {raw_accept:.3f}")
print(f"served top-k acceptance: {served_accept:.3f}")
print("Verification must use served probabilities.")

Output

raw acceptance: 0.857
served top-k acceptance: 0.756
Verification must use served probabilities.

implementation.py

import torch
import torch.nn.functional as F

def speculative_decode(
    target_model: torch.nn.Module,
    draft_model: torch.nn.Module,
    input_ids: torch.Tensor,      # (1, seq_len)
    K: int = 5,
    max_new_tokens: int = 100,
) -> torch.Tensor:
    """Pedagogical speculative decoding loop."""
    generated = input_ids.clone()

    tokens_generated = 0
    while tokens_generated < max_new_tokens:
        step_k = min(K, max_new_tokens - tokens_generated)
        prompt_len = generated.shape[1]

        # Step 1: Draft step_k tokens autoregressively with the small model.
        draft_tokens: list[int] = []
        draft_probs: list[torch.Tensor] = []
        draft_input = generated.clone()

        for _ in range(step_k):
            with torch.no_grad():
                logits = draft_model(draft_input).logits[:, -1, :]  # (1, vocab)
            probs = F.softmax(logits, dim=-1)
            token = torch.multinomial(probs, num_samples=1)

            draft_tokens.append(token.item())
            draft_probs.append(probs.squeeze(0))
            draft_input = torch.cat([draft_input, token], dim=-1)

        verify_suffix = torch.tensor(
            [draft_tokens],
            device=generated.device,
            dtype=generated.dtype,
        )
        verify_input = torch.cat([generated, verify_suffix], dim=-1)

        # Step 2: One target pass scores every drafted position at once.
        with torch.no_grad():
            target_logits = target_model(verify_input).logits

        # In Hugging Face causal LMs, position prompt_len - 1 predicts
        # the first drafted token.
        for i in range(step_k):
            pos = prompt_len - 1 + i
            target_p = F.softmax(target_logits[:, pos, :], dim=-1).squeeze(0)
            draft_p = draft_probs[i]
            token_id = draft_tokens[i]

            ratio = (target_p[token_id] / draft_p[token_id]).item()
            if torch.rand(1).item() < min(1.0, ratio):
                continue

            residual = torch.clamp(target_p - draft_p, min=0)
            residual = residual / residual.sum()
            correction = torch.multinomial(residual, num_samples=1)

            accepted_prefix = torch.tensor(
                [draft_tokens[:i]],
                device=generated.device,
                dtype=generated.dtype,
            )
            generated = torch.cat(
                [generated, accepted_prefix, correction.unsqueeze(0)],
                dim=-1,
            )
            tokens_generated += i + 1
            break
        else:
            generated = torch.cat([generated, verify_suffix], dim=-1)
            tokens_generated += step_k

            # The last logit also predicts one bonus token beyond the draft.
            if tokens_generated < max_new_tokens:
                bonus_pos = prompt_len - 1 + step_k
                bonus_probs = F.softmax(target_logits[:, bonus_pos, :], dim=-1)
                bonus = torch.multinomial(bonus_probs, num_samples=1)

                generated = torch.cat([generated, bonus], dim=-1)
                tokens_generated += 1

    return generated

Tracing one step

Suppose input_ids currently contains the tokens for "Explain KV cache". The loop sets step_k = 5 and the draft model autoregressively generates ["reduces", "decode", "latency", "when", "batched"]. The target model then scores all five draft positions in a single forward pass on the concatenated sequence.

If the verifier accepts the first three tokens but rejects the fourth, the code appends ["reduces", "decode", "latency"] plus a correction token sampled from the residual distribution. The loop then resumes from the new prefix, generating another batch of five draft tokens. If all five tokens are accepted, the code appends all five and also samples a bonus token from the last target logit, yielding six new tokens for one target pass.

Speedup analysis

The expected number of tokens generated per verification step depends on the acceptance rate $\alpha$ and the speculation length $K$ . A useful back-of-the-envelope model assumes each drafted token is accepted independently with probability $\alpha$ and that one verification pass costs about one normal target-model decode step. The pick-list analogy works: each accepted item keeps the line moving, but the first wrong item forces a stop and correction.

Under that approximation, the expected tokens per verification round is given by the geometric series:

$\mathbb{E}[\text{tokens per step}] = \frac{1 - \alpha^{K+1}}{1 - \alpha}$

Expected tokens per round

$\alpha$ is the per-token acceptance probability and $K$ is how many tokens the draft model proposes. When $\alpha$ is high, you get close to $K+1$ tokens per round because many drafted tokens survive and you often collect the bonus token too. When $\alpha$ is low, most drafts get rejected early and you fall back toward 1 token per round.

Wall-clock speedup

Wall-clock speedup must also account for the cost ratio $c$ (the time it takes to run the draft model relative to the target model). This gives a useful approximation, not an exact production forecast, because real systems also pay for cache growth, kernel launches, and batching effects.

$\text{Speedup} = \frac{\mathbb{E}[\text{tokens per step}]}{1 + K \cdot c}$

The numerator is how many tokens you get per verification round. The denominator is the modeled cost: 1 target-model pass plus $K$ draft-model passes, each costing fraction $c$ of a target pass. For example, if the draft model is 10x cheaper ( $c = 0.1$ ) and you speculate $K = 5$ tokens, the denominator is $1 + 5 \times 0.1 = 1.5$ .

Worked example

Use illustrative measured inputs. Suppose a candidate draft path costs 10% of a target pass ( $c = 0.1$ ). You set $K = 5$ and observe an acceptance rate of $\alpha = 0.8$ in a benchmark.

Expected tokens per round = $(1 - 0.8^6) / (1 - 0.8) = (1 - 0.262) / 0.2 \approx 3.69$ tokens.

Cost denominator = $1 + 5 \times 0.1 = 1.5$ target-equivalent passes.

Speedup = $3.69 / 1.5 \approx 2.46$ x.

The model predicts about 2.5x for those inputs. If acceptance changes to 0.6, the same K = 5 model predicts about 1.6x; at 0.9, it predicts about 3.1x. These are model outputs to compare against a benchmark, not promised throughput.

Candidate setup	$\alpha$	$K$	Assume $c$	Approx. tokens/round	Approx. speedup
Measured path A	0.6	5	0.1	2.4	1.6x
Measured path A	0.7	5	0.1	2.9	2.0x
Measured path A	0.8	5	0.1	3.7	2.5x
Measured path A	0.9	5	0.1	4.7	3.1x
Measured path B	0.85	8	0.1	5.1	2.8x
Measured path C	0.90	10	0.1	6.9	3.4x

modeled-speculation-speedup.py

def expected_tokens(acceptance: float, depth: int) -> float:
    return sum(acceptance**step for step in range(depth + 1))

def modeled_speedup(acceptance: float, depth: int, draft_cost: float) -> float:
    return expected_tokens(acceptance, depth) / (1 + depth * draft_cost)

for acceptance in (0.6, 0.8, 0.9):
    estimate = modeled_speedup(acceptance, depth=5, draft_cost=0.1)
    print(f"acceptance={acceptance:.1f}: modeled speedup={estimate:.2f}x")

Output

acceptance=0.6: modeled speedup=1.59x
acceptance=0.8: modeled speedup=2.46x
acceptance=0.9: modeled speedup=3.12x

Speculative decoding speedup chart showing speculation depth helping only when acceptance stays high, plus a bar chart showing higher acceptance rate producing larger speedup at the same draft cost. — The speedup curve isn't monotonic in practice. Larger speculation depth helps when acceptance is high, but it wastes draft work when the draft is weak or misconfigured.

The useful $K$ depends on measured acceptance, draft cost, and serving behavior. A small sweep can begin with single-digit depths, but select from route-specific benchmarks rather than a universal default.

choose-speculation-depth.py

def modeled_speedup(acceptance, depth, draft_cost):
    expected = sum(acceptance**step for step in range(depth + 1))
    return expected / (1 + depth * draft_cost)

measurements = {"acceptance": 0.72, "draft_cost": 0.12}
candidates = {
    depth: modeled_speedup(measurements["acceptance"], depth, measurements["draft_cost"])
    for depth in (1, 3, 5, 8)
}
best_depth = max(candidates, key=candidates.get)
print({depth: round(value, 3) for depth, value in candidates.items()})
print(f"model-selected K to benchmark: {best_depth}")

Output

{1: 1.536, 3: 1.92, 5: 1.921, 8: 1.727}
model-selected K to benchmark: 5

Draft model choices

The draft mechanism determines your acceptance rate and your overall speedup. It requires a balance: if the draft is too weak, it gets rejected constantly. If it's too expensive, it erases the latency gains from verification. In practice, most systems use one of these families:

Approach	Draft source	Main advantage	Main trade-off
Smaller same-family model	Separate assistant model with the same tokenizer	Simple exact speculative-decoding setup	Extra model to load and schedule
Medusa heads^{[4]Reference 4Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads.https://arxiv.org/abs/2401.10774}	Extra heads attached to the target model	No separate model at inference time	Needs extra training and tree verification
EAGLE / EAGLE-3^{[5]Reference 5EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty.https://arxiv.org/abs/2401.15077}^{[6]Reference 6EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Testhttps://arxiv.org/abs/2503.01840}	Target-coupled speculator over hidden states or direct token heads	Strong proposals without a full second model	More integration complexity
MTP heads^{[7]Reference 7Speculative Decodinghttps://docs.vllm.ai/en/latest/features/speculative_decoding/}	Checkpoint-native multi-token prediction modules	No separate assistant when supported	Requires checkpoint and engine support
Prompt Lookup^{[7]Reference 7Speculative Decodinghttps://docs.vllm.ai/en/latest/features/speculative_decoding/}	Reuse repeated n-grams from context	No extra model or training	Only helps when the context repeats itself
Suffix decoding^{[7]Reference 7Speculative Decodinghttps://docs.vllm.ai/en/latest/features/speculative_decoding/}	Reuse matching suffixes from previous outputs	No extra model	Fit depends on reusable prior output patterns

Speculation methods by draft source — Classic draft-model speculation, Medusa-style heads, EAGLE, MTP, n-gram lookup, and suffix speculation all share the draft-then-verify idea, but their deployment costs and traffic profiles differ.

In the classic direct-token draft setup, require the separate draft model to share the exact same tokenizer as the target model. If token IDs map to different subwords, the target isn't verifying the intended candidate sequence; reject that pairing or use a method that explicitly supports different tokenizers.

tokenizer-compatibility-gate.py

draft_vocab = {"return": 14, " label": 88, " expires": 103}
target_vocab = {"return": 14, " label": 88, " expires": 104}
required_pieces = ["return", " label", " expires"]

mismatches = [
    piece for piece in required_pieces
    if draft_vocab.get(piece) != target_vocab.get(piece)
]
print(f"token-id mismatches: {mismatches}")
print(f"direct draft path allowed: {not mismatches}")

Output

token-id mismatches: [' expires']
direct draft path allowed: False

Medusa: multi-head speculative decoding

Medusa avoids the need for a separate draft model entirely. Instead, it adds extra prediction heads to the target model itself:

Each Medusa head predicts a different future position from the same hidden state. That changes the draft shape from one sequence to a tree of possible continuations. Since each head can propose multiple candidates, the candidates form a tree structure rather than a single chain. A specialized tree attention mechanism then evaluates all these candidate paths simultaneously in a single forward pass, filters out incorrect branches, and keeps the longest accepted path. The result avoids the overhead and complexity of loading and orchestrating a separate draft model.

The method map above shows that shape visually: hidden state fans out into several future-token heads, and the target verifies the resulting candidate tree.

EAGLE drafts from target-model internals instead of using a separate full assistant. Earlier EAGLE variants predict feature states, which can raise acceptance because those states carry more information than a plain token-only head.^{[5]Reference 5EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty.https://arxiv.org/abs/2401.15077} EAGLE-3 moves further in that direction: the paper switches to direct token prediction, fuses low, middle, and high target-model layers, and trains the drafter with a training-time-test loop on its own outputs. The paper reports up to 6.5x speedup in its evaluation setup, but exact gains still depend on engine support, batch shape, and prompt mix.^{[6]Reference 6EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Testhttps://arxiv.org/abs/2503.01840} Current serving docs such as vLLM expose EAGLE-family speculation, but the exact options and caveats change quickly.^{[7]Reference 7Speculative Decodinghttps://docs.vllm.ai/en/latest/features/speculative_decoding/}

Prompt lookup decoding

Prompt Lookup Decoding (PLD) takes a completely different approach. Instead of using any neural model for drafting, it searches the current context window for matching n-grams and reuses them as draft tokens. This non-neural method works surprisingly well for tasks with repeated text or pattern matching.

Aspect	How it works
Draft source	Match n-grams from the prompt/context window
Candidate workloads	Code completion, summarization, repetitive text
Memory overhead	No extra model weights
Main failure mode	Little benefit when the context has little repetition

The algorithm scans the context window for n-grams (typically 3-5 tokens) that match the end of the currently generated sequence. When it finds a match, it looks at what token followed that n-gram earlier in the context window and uses that as the next draft token. For example, if the model has generated "timeout error" and the context contains "timeout error repeats Friday," PLD proposes "repeats" as the next draft token.

PLD is a candidate for code generation and other repetitive tasks because variable names, function calls, and boilerplate often reappear within the context window.^{[7]Reference 7Speculative Decodinghttps://docs.vllm.ai/en/latest/features/speculative_decoding/}

The key advantage is simplicity: there's no model to load, no training required, and no extra model weights. PLD can be combined with other speculative methods or used as a fallback when no neural draft model is available.

prompt-lookup-candidate.py

context = "timeout error repeats Friday. auth callback needs review. timeout error"
tokens = context.split()
suffix = ["timeout", "error"]

proposal = None
for index in range(len(tokens) - len(suffix)):
    if tokens[index:index + len(suffix)] == suffix:
        proposal = tokens[index + len(suffix)]
        break

print(f"matched suffix: {' '.join(suffix)}")
print(f"lookup proposal: {proposal}")

Output

matched suffix: timeout error
lookup proposal: repeats

Production deployment

Adding speculative decoding to a production system isn't an automatic win. On paper, the idea looks clean; in production, workload characteristics decide. For the classic two-model draft-and-verify setup, the core trade-off is simple: you spend extra FLOPs on the draft path to save target-model memory bandwidth. Modern serving stacks now expose several speculation families, so the payoff is method-specific rather than a blanket yes-or-no.^{[7]Reference 7Speculative Decodinghttps://docs.vllm.ai/en/latest/features/speculative_decoding/} Deploy blindly and you may decrease overall throughput and increase costs.

When the classic draft-model setup helps (and when it doesn't)

Before rolling out a separate draft model, check whether your workload benefits from the draft-then-verify cycle. The technique is a trade-off: it burns additional compute (FLOPs) to save memory bandwidth. If your system is already compute-bound, this trade-off will usually backfire and reduce overall throughput.

Scenario	Hypothesis before benchmark	Why test it
Single-user, low-batch inference	Strong candidate	Target decode may be memory-bandwidth bound
Throughput-maximized batching	Measure carefully	Extra draft work can compete with saturated compute
Long outputs	Candidate	More decode steps can amortize setup
Very short outputs	Weak candidate	Setup and drafting may dominate
Repetitive outputs (code, templates)	Candidate	Draft or lookup acceptance may be higher
Diverse outputs	Measure carefully	Acceptance may vary with sampling and prompt mix

Ship Speculation With Workload Gates — Speculation should be rolled out like any serving optimization: measure the baseline, sweep methods and depth, canary by route, and keep a non-speculative fallback for incompatible features or peak traffic.

Current vLLM docs frame speculation as an inter-token-latency optimization for medium-to-low QPS, memory-bound workloads. The same guide separates draft-model, EAGLE, MTP, n-gram, suffix, and other proposer paths, with method-specific latency-versus-throughput trade-offs.^{[7]Reference 7Speculative Decodinghttps://docs.vllm.ai/en/latest/features/speculative_decoding/}

Serving-engine reality

Serving-engine support changes quickly. For example, vLLM's current docs list several speculation families, but they also call out known feature incompatibilities and separate theoretical losslessness from what you can expect under real hardware numerics.^{[7]Reference 7Speculative Decodinghttps://docs.vllm.ai/en/latest/features/speculative_decoding/} Treat framework support as an operational detail you must validate in your own stack, not as a timeless property of the algorithm.

The practical rollout loop is usually straightforward:

Measure baseline TTFT, inter-token latency, and tokens-per-second throughput separately.
Sweep the draft mechanism and speculation depth on real prompts, not toy strings.
Track acceptance rate, output length, and p95 latency by workload class.
Keep a non-speculative fallback path for peak-QPS periods or incompatible features.

speculation-canary-gate.py

baseline = {"p95_itl_ms": 46.0, "throughput_tps": 380, "sampler_parity": True}
canary = {"p95_itl_ms": 29.0, "throughput_tps": 372, "sampler_parity": True}
minimum_throughput_ratio = 0.95

promote = (
    canary["sampler_parity"]
    and canary["p95_itl_ms"] < baseline["p95_itl_ms"]
    and canary["throughput_tps"] >= baseline["throughput_tps"] * minimum_throughput_ratio
)
print(f"inter-token latency improved: {canary['p95_itl_ms'] < baseline['p95_itl_ms']}")
print(f"canary promoted: {promote}")

Output

inter-token latency improved: True
canary promoted: True

When speculation backfires

Speculative decoding isn't a universal speedup button. These three failure modes are common enough to check explicitly.

Symptom	Likely cause	Fix
Speedup is near 1x or negative	Draft model is too slow or too inaccurate (low acceptance rate)	Benchmark a smaller or better-aligned draft, or switch to Prompt Lookup for repetitive tasks
Correctness checks fail	Acceptance test uses different temperature or top-p than the served model	Ensure the verifier and the sampler share the exact same post-processed distribution
Memory usage spikes unexpectedly	KV cache wasn't truncated after a rejected token	Implement cache rewind so rejected draft tokens don't persist in the cache

Mastery check

Evaluation rubric

Explain why low-batch autoregressive decode is often memory-bandwidth bound, not math-bound.
Describe the draft-then-verify loop and why teacher-forced verification improves arithmetic intensity.
Implement accept/reject logic with probability ratios and residual sampling.
Analyze speedup as a function of acceptance rate, speculation depth $K$ , and draft-to-target cost ratio $c$ .
Compare standard draft-model speculation with Medusa, EAGLE, Prompt Lookup, and serving-engine variants.
Identify when speculation helps production serving and when it can reduce throughput.
Debug tokenizer mismatch, sampler mismatch, KV-cache rewind, and "lossless" implementation caveats.

Follow-up questions

How does speculative decoding preserve the target distribution?

It uses modified rejection sampling. Accepted draft tokens contribute the overlap between draft and target distributions, while rejected tokens are replaced by samples from the residual target mass. This is exact in theory when verification uses the same post-processed distribution as serving. Real systems can still differ slightly because of finite-precision kernels and engine details.^{[1]Reference 1Fast Inference from Transformers via Speculative Decoding.https://arxiv.org/abs/2211.17192}^{[7]Reference 7Speculative Decodinghttps://docs.vllm.ai/en/latest/features/speculative_decoding/}

What determines speedup?

Acceptance probability $\alpha$ , speculation depth $K$ , and draft-to-target cost ratio $c$ . A useful approximation is $(1 - \alpha^{K+1}) / ((1 - \alpha)(1 + cK))$ , but production results also depend on batching, cache behavior, kernels, and workload mix.

How does temperature affect performance?

Temperature changes the served distributions and therefore acceptance; its direction and magnitude depend on target, draft, and prompt mix. Measure acceptance under the actual sampler, and make the verifier use that same processed distribution.

How does Medusa differ from separate draft-model speculation?

A separate draft model proposes one chain with another model. Medusa attaches future-token heads to the target model, proposes a tree of candidates from target hidden states, and verifies those candidates with tree attention.^{[4]Reference 4Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads.https://arxiv.org/abs/2401.10774}

When would you not use classic draft-model speculation?

Be cautious in high-QPS, batch-heavy, compute-saturated serving unless benchmarks show a win. Extra draft work can erase latency savings. Other speculation methods can still help, so test by method and workload class.^{[7]Reference 7Speculative Decodinghttps://docs.vllm.ai/en/latest/features/speculative_decoding/}

What is Prompt Lookup Decoding best for?

It drafts by reusing n-gram continuations already present in context. It's a candidate for repetitive workloads such as code, templates, and some summarization because it needs no extra model but depends on repeated context.^{[7]Reference 7Speculative Decodinghttps://docs.vllm.ai/en/latest/features/speculative_decoding/}

Common pitfalls

Output quality drifts only on sampled traffic. Cause: the acceptance test used different temperature, top-p, top-k, or logits processors than the served sampler. Fix: build the ratio and residual from the exact post-processed distribution you serve.
Acceptance collapses almost immediately. Cause: tokenizer, vocabulary, or prompt-format mismatch between draft and target paths. Fix: verify exact token IDs, chat template, and special-token handling before tuning $K$ .
Speedup stays near 1x or goes negative. Cause: the draft is too slow, too inaccurate, or both. Fix: inspect acceptance rate and draft cost before increasing depth; then shrink the drafter, pick a better-aligned method, or disable speculation on that workload.
Benchmarks look great on one route and bad on another. Cause: speculation depth was tuned on a global average instead of by workload class. Fix: split latency, acceptance, and cost by prompt family, output length, and QPS band.
Outputs break after the first rejection. Cause: rejected draft tokens were left in the target KV cache. Fix: rewind the target cache to the accepted prefix before appending the correction token.
"Lossless" is treated as bitwise-identical across every engine. Cause: the proof is distributional, while real systems still differ in kernels, numerics, and feature support. Fix: compare outputs and latency against a non-speculative baseline on real traffic, not theory alone.^{[7]Reference 7Speculative Decodinghttps://docs.vllm.ai/en/latest/features/speculative_decoding/}

Check your understanding

Try to answer these questions without looking back at the article.

Concrete arithmetic. Qwen3.6-27B in BF16 has about 54 GB of dense weights. Under the simplified model, a measured bandwidth input makes one target weight read a large part of low-batch decode time. Why can verifying five draft tokens in one target pass save time?

Sketch: If measured target decode is bandwidth-limited, one verification pass can amortize much of the target weight traffic across several accepted positions. The benchmark still has to account for extra draft work, cache traffic, and rejection.

Acceptance probability. A draft token has draft probability 0.4 and target probability 0.2. What is the acceptance probability, and what happens if the token is rejected?

Sketch: Acceptance probability = min(1, 0.2 / 0.4) = 0.5. If rejected, sample a correction from the residual distribution $p_{\text{residual}}(t) = \max(0, p_{\text{target}}(t) - p_{\text{draft}}(t)) / Z$ . Tokens the target wanted more than the draft receive positive residual mass; tokens the draft already overestimated receive zero.

Negative speedup. You deploy a small draft with a Qwen3.6-27B target and see only 1.1x speedup. Your acceptance rate is 0.5 and $K = 3$ . Should you increase $K$ to 10, switch to a larger draft model, or investigate something else?

Sketch: Low acceptance means most drafts get rejected early. Increasing $K$ wastes effort on tokens that won't survive. A larger draft might raise $\alpha$ but costs more per token. First, check for tokenizer mismatch or temperature mismatch between draft and target. If those are correct, try a smaller, better-aligned draft or switch to Prompt Lookup for repetitive workloads.

Speculative decoding decision points

Under the usual back-of-the-envelope model, autoregressive decode runs at about 1 FLOP/byte, so memory bandwidth, not raw math throughput, is the bottleneck.
Draft $K$ tokens with a cheap model, then verify all $K$ in one target model forward pass.
With the same served distributions and correct residual sampling, modified rejection sampling reconstructs the target distribution in theory; engine numerics and integration still require checks.
$\text{Speedup} = \frac{1 - \alpha^{K+1}}{(1 - \alpha)(1 + K \cdot c)}$ is a useful approximation, not an exact production forecast.
Smaller assistant models, Medusa, EAGLE, and Prompt Lookup Decoding all trade simplicity against proposal quality and integration cost.
Classic draft-model speculative decoding is most attractive for latency-sensitive, memory-bound workloads, while other speculation methods have more method-specific QPS trade-offs.

You've now seen why speculative decoding works, how the acceptance logic preserves the target distribution, and how to estimate speedup from acceptance rate and draft cost. The core insight is simple: memory bandwidth dominates autoregressive decode, so parallel verification amortizes the weight-loading cost across multiple tokens.

Practice drill

Build a speculation canary scorecard for one latency-sensitive route:

Log acceptance rate by route, speculation depth, draft cost, target verification latency, and tokenizer or prompt-format parity.
Compare p50 and p95 TPOT, throughput, and output-distribution parity against the non-speculative baseline.
Add one failure trace where a rejected token invalidates later draft tokens and show how the KV cache is rewound.
Write promotion and rollback gates for latency, throughput, acceptance, and exact sampler parity.

The scorecard should prove speculation is saving target passes without changing the served distribution or hurting capacity.

Next Step

Continue to Long Context Window Management

Speculative decoding exploits the memory wall in single-token decode; long contexts push that wall even harder because the KV cache grows with every token and attention costs rise. The next chapter covers that growth directly: KV-cache math, <span data-glossary="prefill">prefill</span>-vs-decode trade-offs, <span data-glossary="rope">RoPE</span> scaling, and when to use long-context inference versus retrieval augmentation.

PreviousSLM Specialization & Edge Deployment

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Fast Inference from Transformers via Speculative Decoding.

Leviathan, Y., Kalman, M., & Matias, Y. · 2023 · ICML 2023

Accelerating Large Language Model Decoding with Speculative Sampling.

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, John Jumper · 2023

Qwen3.6-27B

Qwen Team · 2026

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads.

Cai, T., et al. · 2024 · ICML 2024

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty.

Li, Y., et al. · 2024 · ICML 2024

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Li, Y., Wei, F., Zhang, C., & Zhang, H. · 2025

Speculative Decoding

vLLM Team · 2026 · vLLM Documentation

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Speculative Decoding

What is speculative decoding trying to reduce: model quality, target-model calls, or output length?

The inference bottleneck

Arithmetic intensity

Reading the formula

Why does speculative decoding help more in low-batch decode than in a large prefill?

Why verification can win

What does teacher forcing mean in the verification step?

The algorithm

Accept/reject criterion

Why does the verifier stop at the first rejected draft token instead of checking later draft tokens?

Reading the formula

In plain terms

A draft token has draft probability 0.20 and target probability 0.35. What is the acceptance probability?

Implementation

Why must the acceptance test use the same temperature, top-p, top-k, and logits processors as serving?

Tracing one step

Why can a speculative round emit K+1 tokens when all K draft tokens are accepted?

Speedup analysis

Expected tokens per round

Wall-clock speedup

Worked example

If acceptance drops from 0.8 to 0.5, should you usually increase K first?

Draft model choices

What is the core draft-model trade-off?

Medusa: multi-head speculative decoding

Why do Medusa-style heads remove one major deployment burden of classic speculative decoding?

Prompt lookup decoding

Why is Prompt Lookup Decoding often strong for code completion but weak for open-ended creative writing?

Production deployment

When the classic draft-model setup helps (and when it doesn't)

When is classic draft-model speculation most likely to help production serving?

Serving-engine reality

Which metrics should you split by workload class before deciding speculation is working?

When speculation backfires

What target-model state must be restored after a speculative rejection?

Mastery check

Evaluation rubric

Follow-up questions

How does speculative decoding preserve the target distribution?

What determines speedup?

How does temperature affect performance?

How does Medusa differ from separate draft-model speculation?

When would you not use classic draft-model speculation?

What is Prompt Lookup Decoding best for?

Common pitfalls

Check your understanding

Speculative decoding decision points

Practice drill

Mastery Check

Discussion