LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnInference & Production ScaleSpeculative Decoding
🚀HardInference Optimization

Speculative Decoding

Reduce LLM inter-token latency by pairing cheap drafting with target-model verification. Learn the rejection-sampling proof, speedup model, method choices, and production rollout gates.

34 min read
Learning path
Step 134 of 155 in the full curriculum
SLM Specialization & Edge DeploymentLong Context Window Management

SLM deployment showed how a small model can fit on constrained hardware. Speculative decoding uses a small or cheap draft path differently: not as the final model, but as a proposal engine that a larger target model verifies.

Speculative decoding can speed up generation by letting a cheap draft model propose that a larger target model verifies. This chapter explains the target-distribution proof and the measurements needed before claiming a latency improvement.

Imagine your warehouse uses a slow senior routing checker to write daily shipping reports. The checker reviews every sentence one at a time. Now imagine adding a fast draft scanner that proposes several sentences ahead, and the checker reviews the batch in one pass. If the drafts are good, the checker approves them. If not, it fixes the first mistake and continues. This is the core idea behind speculative decoding: use a fast draft process so a target model can preserve its output distribution in theory while reducing latency when the workload and implementation fit.[1][2]

This can work because the expensive part of low-batch LLM generation is often repeatedly moving model weights and state, not only arithmetic. A small draft model proposes several tokens, and the large target model scores that proposed chunk in one verification pass instead of spending one full decode step per token.

Speculative decoding targets low-batch LLM inference that is memory-bandwidth bound. Verifying a short drafted chunk still costs extra work, and it helps only when accepted-token savings outweigh the draft and verification overhead.


The inference bottleneck

To understand why speculative decoding works, it helps to first understand the fundamental limitations of standard autoregressive generation. In a standard setup, an LLM generates text one token at a time, and each new token depends on all the previously generated tokens. This sequential dependency creates a massive bottleneck.

Most engineers instinctively assume that the slow speed of LLM inference is due to the sheer volume of mathematical calculations required by billions of parameters. However, modern are incredibly fast at performing matrix multiplications. The real bottleneck lies elsewhere: moving data around. The processor is constantly starved for data, waiting for the massive model weights to be fetched from memory before it can perform its rapid calculations.

Arithmetic intensity

Think of a fulfillment line. Arithmetic intensity measures how much work gets done for each delivery of input data. If one data load lets the GPU verify 100 candidate tokens, that's high intensity, and the GPU is keeping busy. But if each data load verifies only 1 token, the compute units are mostly waiting for the next memory transfer.

In GPU terms, this ratio is the number of FLOPs (floating point operations, a measure of computational performance) performed per byte of data loaded from memory:

Arithmetic Intensity=FLOPsBytes Transferred\text{Arithmetic Intensity} = \frac{\text{FLOPs}}{\text{Bytes Transferred}}Arithmetic Intensity=Bytes TransferredFLOPs​

Reading the formula

This ratio helps diagnose whether hardware is spending time computing or waiting for data. Under a weight-only back-of-the-envelope model, single-token decode performs about 1 FLOP per byte of weights moved.[2]

During autoregressive decoding, generating one token with a model of PPP parameters in FP16 (16-bit floating-point) requires:

  • Compute: ≈2P\approx 2P≈2P FLOPs (one matrix-vector multiplication per layer)
  • Memory: ≈2P\approx 2P≈2P bytes loaded (entire model weights in FP16)

This gives an arithmetic intensity of ~1 FLOP/byte in the weight-only model. Real decode also pays for KV-cache traffic, activations, kernels, scheduling, and batching, so profile the actual engine before declaring a bottleneck.[2]

For an illustrative FP16 70B target, the weight footprint alone is about 140 GB. Insert a measured or documented device bandwidth into the simplified model before comparing serial decode with verification:

weight-streaming-diagnostic.py
1params_b = 70 2bytes_per_weight = 2 3example_bandwidth_gbs = 3_350 # example input; use the deployed accelerator specification 4weights_gb = params_b * bytes_per_weight 5weight_stream_ms = weights_gb / example_bandwidth_gbs * 1_000 6 7print(f"FP16 weight footprint: {weights_gb} GB") 8print(f"weight-only read time at {example_bandwidth_gbs} GB/s: {weight_stream_ms:.1f} ms") 9print("This is a diagnostic lower bound, not measured request latency.")
Output
1FP16 weight footprint: 140 GB 2weight-only read time at 3350 GB/s: 41.8 ms 3This is a diagnostic lower bound, not measured request latency.
PhaseSimplified expectationBottleneck to measureIntuition
Prefill (many tokens together)Higher intensityOften compute or mixedLarge matrix work amortizes weight loads
Low-batch decode (1 token)~1 FLOP/byte in FP16 weight-only modelOften memory bandwidthMove active weights to emit one new token
One Weight Read, More Verified Tokens One Weight Read, More Verified Tokens
Speculation helps when the target pass is memory-bound: the same weight read can verify several drafted positions, while ordinary decode pays the target pass once per emitted token.

Do not assume low-batch decode is compute-bound without measuring it. Weight traffic is often a dominant cost when the target emits one token at a time.[2]

The key insight

Here's what makes speculative decoding so clever. Imagine a slow quality-control station where every package must be scanned one at a time. The bottleneck isn't checking the barcode (that's fast); it's moving each package through the scanner separately. Now imagine if the station could inspect 5 proposed packages in one batch and reject only the first bad one. The line moves much faster, even though the station does slightly more work per batch.

That's exactly what happens here: because weight movement dominates, verifying a short candidate chunk can be much closer to one target-model pass than to KKK separate decode passes. We use a cheap draft model to propose a sequence of candidate tokens. The target model then uses teacher forcing, meaning it scores a known candidate sequence in parallel instead of generating those tokens one by one. The weight-loading cost is paid once for the whole drafted chunk, which raises arithmetic intensity and improves throughput when the draft is accurate enough.


The algorithm

accepted prefix accepted prefix
The draft model (small, fast) proposes a chain of candidate tokens. The target model (large, slow) scores every position in one forward pass, keeps the longest matching prefix, and corrects the first mismatch.

The figure is small enough to trace by hand. The draft model proposes "order ships from refund." The target accepts the first three tokens, rejects "refund," and samples a correction from the residual target distribution.

Accept/reject criterion

Think of the target model as a warehouse verifier checking a draft pick list. For each proposed token, the verifier asks: "Would I have chosen this token too?" If the target model likes it even more than the draft model did, instant approval. If the target model likes it less, it might still keep it (proportional to how close the preferences are), or reject it and write its own token instead.

Let's walk through a concrete example. Suppose the prefix so far is "The order" and the draft model proposes three tokens:

PositionDraft tokenDraft prob.Target prob.Verdict
1ships0.400.60Accept (target likes it more)
2from0.300.15Roll: accept with probability 0.50
3Seattle0.200.25Accept (target likes it more)

For token 1, the target probability (0.60) is higher than the draft probability (0.40), so the verifier always accepts. For token 2, the target probability (0.15) is lower than the draft (0.30). The verifier flips a weighted coin: it accepts with probability 0.15 / 0.30 = 0.50. If the coin comes up reject, the verifier stops checking further tokens and samples a correction from the residual distribution. Token 3 only matters if token 2 survived.

acceptance-probability.py
1draft_probability = 0.30 2target_probability = 0.15 3accept_probability = min(1.0, target_probability / draft_probability) 4 5print(f"accept probability for 'from': {accept_probability:.2f}") 6print("Later drafted positions are discarded after a rejection.")
Output
1accept probability for 'from': 0.50 2Later drafted positions are discarded after a rejection.

This construction preserves the target distribution exactly. The accepted branch contributes the overlap between the two distributions, and the residual sampler contributes the missing mass. Added together, they reconstruct the target probability for every token.[1]

Mathematically, for each draft token tit_iti​, both models assign a probability to that token conditioned on the same prefix. We compare the target model's probability ptarget(ti)p_{\text{target}}(t_i)ptarget​(ti​) with the draft model's probability pdraft(ti)p_{\text{draft}}(t_i)pdraft​(ti​):

P(accept ti)=min⁡(1,ptarget(ti)pdraft(ti))P(\text{accept } t_i) = \min\left(1, \frac{p_{\text{target}}(t_i)}{p_{\text{draft}}(t_i)}\right)P(accept ti​)=min(1,pdraft​(ti​)ptarget​(ti​)​)

Reading the formula

Compute the ratio of the big model's probability to the draft model's probability for this token. If the big model likes it more (ratio >= 1), always accept. If the big model likes it less, accept randomly with probability equal to the ratio. The bigger the disagreement, the more likely rejection.

When a token is rejected at position iii, we sample a correction token from the residual distribution:

presidual(t)=max⁡(0,  ptarget(t)−pdraft(t))Zp_{\text{residual}}(t) = \frac{\max(0, \; p_{\text{target}}(t) - p_{\text{draft}}(t))}{Z}presidual​(t)=Zmax(0,ptarget​(t)−pdraft​(t))​

where Z=∑tmax⁡(0,  ptarget(t)−pdraft(t))Z = \sum_t \max(0, \; p_{\text{target}}(t) - p_{\text{draft}}(t))Z=∑t​max(0,ptarget​(t)−pdraft​(t)) is the normalizing constant.

In plain terms

The correction picks from tokens that the big model wanted more than the draft model predicted. If the draft said "the" had 20% probability but the big model wanted 35%, that extra 15% enters the residual pool. Tokens where the draft was already too generous (draft > target) get zero residual probability, since they were over-represented, not under-represented.

Acceptance Keeps Overlap, Residual Restores Missing Mass Acceptance Keeps Overlap, Residual Restores Missing Mass
The accept/reject test keeps the overlap between the draft and target distributions. When a token is rejected, the correction sampler draws only from probability mass the target wanted more than the draft did.

This accept/reject scheme is a modified rejection-sampling algorithm. The accepted branch contributes min⁡(pdraft(t),ptarget(t))\min(p_{\text{draft}}(t), p_{\text{target}}(t))min(pdraft​(t),ptarget​(t)), and the residual sampler contributes the missing mass max⁡(0,ptarget(t)−pdraft(t))\max(0, p_{\text{target}}(t) - p_{\text{draft}}(t))max(0,ptarget​(t)−pdraft​(t)). Add those two terms together and you recover ptarget(t)p_{\text{target}}(t)ptarget​(t) exactly.[1]

residual-correction.py
1tokens = ["cat", "dog", "mouse"] 2draft = [0.60, 0.30, 0.10] 3target = [0.40, 0.50, 0.10] 4residual_mass = [max(0.0, p - q) for p, q in zip(target, draft)] 5normalizer = sum(residual_mass) 6residual = [value / normalizer if normalizer else 0.0 for value in residual_mass] 7 8print(dict(zip(tokens, residual))) 9print(f"positive correction mass: {normalizer:.2f}")
Output
1{'cat': 0.0, 'dog': 1.0, 'mouse': 0.0} 2positive correction mass: 0.20

Implementation

The following code demonstrates a simplified version of the speculative decoding algorithm using PyTorch. It requires two components: a pre-trained target model and a smaller, computationally efficient draft model. The algorithm generates KKK draft tokens autoregressively with the small model, concatenates them with the current context, and then validates the entire sequence in a single forward pass through the target model.

This version is intentionally pedagogical: it handles the accept/reject loop and residual sampling, but leaves out production concerns like KV-cache reuse, EOS handling, batching, and logits processors. It assumes Hugging Face-style causal-LM logits, where position jjj predicts token j+1j + 1j+1. In a production sampler, the acceptance test has to use the same post-processed distributions you actually serve, not a different temperature or top-p configuration.[1][2]

served-distribution-parity.py
1def top_k_normalize(probabilities, k): 2 kept = sorted(range(len(probabilities)), key=probabilities.__getitem__, reverse=True)[:k] 3 total = sum(probabilities[index] for index in kept) 4 return [probabilities[index] / total if index in kept else 0.0 for index in range(len(probabilities))] 5 6raw_target = [0.55, 0.30, 0.15] 7raw_draft = [0.40, 0.35, 0.25] 8served_target = top_k_normalize(raw_target, k=2) 9served_draft = top_k_normalize(raw_draft, k=2) 10token_id = 1 11 12raw_accept = min(1.0, raw_target[token_id] / raw_draft[token_id]) 13served_accept = min(1.0, served_target[token_id] / served_draft[token_id]) 14print(f"raw acceptance: {raw_accept:.3f}") 15print(f"served top-k acceptance: {served_accept:.3f}") 16print("Verification must use served probabilities.")
Output
1raw acceptance: 0.857 2served top-k acceptance: 0.756 3Verification must use served probabilities.
implementation.py
1import torch 2import torch.nn.functional as F 3 4def speculative_decode( 5 target_model: torch.nn.Module, 6 draft_model: torch.nn.Module, 7 input_ids: torch.Tensor, # (1, seq_len) 8 K: int = 5, 9 max_new_tokens: int = 100, 10) -> torch.Tensor: 11 """Pedagogical speculative decoding loop.""" 12 generated = input_ids.clone() 13 14 tokens_generated = 0 15 while tokens_generated < max_new_tokens: 16 step_k = min(K, max_new_tokens - tokens_generated) 17 prompt_len = generated.shape[1] 18 19 # Step 1: Draft step_k tokens autoregressively with the small model. 20 draft_tokens: list[int] = [] 21 draft_probs: list[torch.Tensor] = [] 22 draft_input = generated.clone() 23 24 for _ in range(step_k): 25 with torch.no_grad(): 26 logits = draft_model(draft_input).logits[:, -1, :] # (1, vocab) 27 probs = F.softmax(logits, dim=-1) 28 token = torch.multinomial(probs, num_samples=1) 29 30 draft_tokens.append(token.item()) 31 draft_probs.append(probs.squeeze(0)) 32 draft_input = torch.cat([draft_input, token], dim=-1) 33 34 verify_suffix = torch.tensor( 35 [draft_tokens], 36 device=generated.device, 37 dtype=generated.dtype, 38 ) 39 verify_input = torch.cat([generated, verify_suffix], dim=-1) 40 41 # Step 2: One target pass scores every drafted position at once. 42 with torch.no_grad(): 43 target_logits = target_model(verify_input).logits 44 45 # In Hugging Face causal LMs, position prompt_len - 1 predicts 46 # the first drafted token. 47 for i in range(step_k): 48 pos = prompt_len - 1 + i 49 target_p = F.softmax(target_logits[:, pos, :], dim=-1).squeeze(0) 50 draft_p = draft_probs[i] 51 token_id = draft_tokens[i] 52 53 ratio = (target_p[token_id] / draft_p[token_id]).item() 54 if torch.rand(1).item() < min(1.0, ratio): 55 continue 56 57 residual = torch.clamp(target_p - draft_p, min=0) 58 residual = residual / residual.sum() 59 correction = torch.multinomial(residual, num_samples=1) 60 61 accepted_prefix = torch.tensor( 62 [draft_tokens[:i]], 63 device=generated.device, 64 dtype=generated.dtype, 65 ) 66 generated = torch.cat( 67 [generated, accepted_prefix, correction.unsqueeze(0)], 68 dim=-1, 69 ) 70 tokens_generated += i + 1 71 break 72 else: 73 generated = torch.cat([generated, verify_suffix], dim=-1) 74 tokens_generated += step_k 75 76 # The last logit also predicts one bonus token beyond the draft. 77 if tokens_generated < max_new_tokens: 78 bonus_pos = prompt_len - 1 + step_k 79 bonus_probs = F.softmax(target_logits[:, bonus_pos, :], dim=-1) 80 bonus = torch.multinomial(bonus_probs, num_samples=1) 81 82 generated = torch.cat([generated, bonus], dim=-1) 83 tokens_generated += 1 84 85 return generated

Tracing one step

Suppose input_ids currently contains the tokens for "Track order 42". The loop sets step_k = 5 and the draft model autoregressively generates ["ships", "from", "Seattle", "on", "Friday"]. The target model then scores all five draft positions in a single forward pass on the concatenated sequence.

If the verifier accepts the first three tokens but rejects the fourth, the code appends ["ships", "from", "Seattle"] plus a correction token sampled from the residual distribution. The loop then resumes from the new prefix, generating another batch of five draft tokens. If all five tokens are accepted, the code appends all five and also samples a bonus token from the last target logit, yielding six new tokens for one target pass.


Speedup analysis

The expected number of tokens generated per verification step depends on the acceptance rate α\alphaα and the speculation length KKK. A useful back-of-the-envelope model assumes each drafted token is accepted independently with probability α\alphaα and that one verification pass costs about one normal target-model decode step. Think of this like drafting a pick list where each accepted item keeps the line moving, but the first wrong item forces a stop and correction.

Under that approximation, the expected tokens per verification round is given by the geometric series:

E[tokens per step]=1−αK+11−α\mathbb{E}[\text{tokens per step}] = \frac{1 - \alpha^{K+1}}{1 - \alpha}E[tokens per step]=1−α1−αK+1​

Expected tokens per round

α\alphaα is the per-token acceptance probability and KKK is how many tokens the draft model proposes. When α\alphaα is high, you get close to K+1K+1K+1 tokens per round because many drafted tokens survive and you often collect the bonus token too. When α\alphaα is low, most drafts get rejected early and you fall back toward 1 token per round.

Wall-clock speedup

Wall-clock speedup must also account for the cost ratio ccc (the time it takes to run the draft model relative to the target model). This gives a useful approximation, not an exact production forecast, because real systems also pay for cache growth, kernel launches, and batching effects.

Speedup=E[tokens per step]1+K⋅c\text{Speedup} = \frac{\mathbb{E}[\text{tokens per step}]}{1 + K \cdot c}Speedup=1+K⋅cE[tokens per step]​

The numerator is how many tokens you get per verification round. The denominator is the modeled cost: 1 target-model pass plus KKK draft-model passes, each costing fraction ccc of a target pass. For example, if the draft model is 10x cheaper (c=0.1c = 0.1c=0.1) and you speculate K=5K = 5K=5 tokens, the denominator is 1+5×0.1=1.51 + 5 \times 0.1 = 1.51+5×0.1=1.5.

Worked example

Use illustrative measured inputs. Suppose a candidate draft path costs 10% of a target pass (c=0.1c = 0.1c=0.1). You set K=5K = 5K=5 and observe an acceptance rate of α=0.8\alpha = 0.8α=0.8 in a benchmark.

Expected tokens per round = (1−0.86)/(1−0.8)=(1−0.262)/0.2≈3.69(1 - 0.8^6) / (1 - 0.8) = (1 - 0.262) / 0.2 \approx 3.69(1−0.86)/(1−0.8)=(1−0.262)/0.2≈3.69 tokens.

Cost denominator = 1+5×0.1=1.51 + 5 \times 0.1 = 1.51+5×0.1=1.5 target-equivalent passes.

Speedup = 3.69/1.5≈2.463.69 / 1.5 \approx 2.463.69/1.5≈2.46x.

The model predicts about 2.5x for those inputs. If acceptance changes to 0.6, the same K = 5 model predicts about 1.6x; at 0.9, it predicts about 3.1x. These are model outputs to compare against a benchmark, not promised throughput.

Candidate setupα\alphaαKKKAssume cccApprox. tokens/roundApprox. speedup
Measured path A0.650.12.41.6x
Measured path A0.750.12.92.0x
Measured path A0.850.13.72.5x
Measured path A0.950.14.73.1x
Measured path B0.8580.15.12.8x
Measured path C0.90100.16.93.4x
modeled-speculation-speedup.py
1def expected_tokens(acceptance: float, depth: int) -> float: 2 return sum(acceptance**step for step in range(depth + 1)) 3 4def modeled_speedup(acceptance: float, depth: int, draft_cost: float) -> float: 5 return expected_tokens(acceptance, depth) / (1 + depth * draft_cost) 6 7for acceptance in (0.6, 0.8, 0.9): 8 estimate = modeled_speedup(acceptance, depth=5, draft_cost=0.1) 9 print(f"acceptance={acceptance:.1f}: modeled speedup={estimate:.2f}x")
Output
1acceptance=0.6: modeled speedup=1.59x 2acceptance=0.8: modeled speedup=2.46x 3acceptance=0.9: modeled speedup=3.12x
Speedup Comes From Acceptance Rate, Depth, and Draft Cost Speedup Comes From Acceptance Rate, Depth, and Draft Cost
The speedup curve is not monotonic in practice. Larger speculation depth helps when acceptance is high, but it wastes draft work when the draft is weak or misconfigured.

The useful KKK depends on measured acceptance, draft cost, and serving behavior. A small sweep can begin with single-digit depths, but select from route-specific benchmarks rather than a universal default.

choose-speculation-depth.py
1def modeled_speedup(acceptance, depth, draft_cost): 2 expected = sum(acceptance**step for step in range(depth + 1)) 3 return expected / (1 + depth * draft_cost) 4 5measurements = {"acceptance": 0.72, "draft_cost": 0.12} 6candidates = { 7 depth: modeled_speedup(measurements["acceptance"], depth, measurements["draft_cost"]) 8 for depth in (1, 3, 5, 8) 9} 10best_depth = max(candidates, key=candidates.get) 11print({depth: round(value, 3) for depth, value in candidates.items()}) 12print(f"model-selected K to benchmark: {best_depth}")
Output
1{1: 1.536, 3: 1.92, 5: 1.921, 8: 1.727} 2model-selected K to benchmark: 5


Draft model choices

The draft mechanism determines your acceptance rate and your overall speedup. It requires a balance: if the draft is too weak, it gets rejected constantly. If it's too expensive, it erases the latency gains from verification. In practice, most systems use one of these families:

ApproachDraft sourceMain advantageMain trade-off
Smaller same-family modelSeparate assistant model with the same tokenizerSimple exact speculative-decoding setupExtra model to load and schedule
Medusa heads[3]Extra heads attached to the target modelNo separate model at inference timeNeeds extra training and tree verification
EAGLE / EAGLE-3[4][5]Target-coupled speculator over hidden states or direct token headsStrong proposals without a full second modelMore integration complexity
Prompt Lookup[6]Reuse repeated n-grams from contextNo extra model or trainingOnly helps when the context repeats itself
Draft model Draft model
Classic draft-model speculation, Medusa-style heads, EAGLE, MTP, n-gram lookup, and suffix speculation all share the draft-then-verify idea, but their deployment costs and traffic profiles differ.

In the classic direct-token draft setup, require the separate draft model to share the exact same tokenizer as the target model. If token IDs map to different subwords, the target is not verifying the intended candidate sequence; reject that pairing or use a method that explicitly supports different tokenizers.

tokenizer-compatibility-gate.py
1draft_vocab = {"return": 14, " label": 88, " expires": 103} 2target_vocab = {"return": 14, " label": 88, " expires": 104} 3required_pieces = ["return", " label", " expires"] 4 5mismatches = [ 6 piece for piece in required_pieces 7 if draft_vocab.get(piece) != target_vocab.get(piece) 8] 9print(f"token-id mismatches: {mismatches}") 10print(f"direct draft path allowed: {not mismatches}")
Output
1token-id mismatches: [' expires'] 2direct draft path allowed: False

Medusa: multi-head speculative decoding

Medusa avoids the need for a separate draft model entirely. Instead, it adds extra prediction heads to the target model itself:

Each Medusa head predicts a different future position from the same hidden state. This design fundamentally changes the architecture from generating a single draft sequence to predicting a tree of possible continuations. Since each head can propose multiple candidates, the candidates form a tree structure rather than a single chain. A specialized tree attention mechanism then evaluates all these candidate paths simultaneously in a single forward pass, efficiently discarding the incorrect branches while keeping the longest accepted path. This approach avoids the overhead and complexity of loading and orchestrating a separate draft model altogether.

The method map above shows that shape visually: hidden state fans out into several future-token heads, and the target verifies the resulting candidate tree.

EAGLE drafts from target-model internals instead of using a separate full assistant. Earlier EAGLE variants predict feature states, which can raise acceptance because those states carry more information than a plain token-only head.[4] EAGLE-3 moves further in that direction: the paper switches to direct token prediction, fuses low, middle, and high target-model layers, and trains the drafter with a training-time-test loop on its own outputs. The paper reports up to 6.5x speedup in its evaluation setup, but exact gains still depend on engine support, batch shape, and prompt mix.[5] Current serving docs such as vLLM expose EAGLE-family speculation, but the exact options and caveats change quickly.[7]

Prompt Lookup Decoding

Prompt Lookup Decoding (PLD) takes a completely different approach. Instead of using any neural model for drafting, it searches the current context window for matching n-grams and reuses them as draft tokens. This non-neural method works surprisingly well for tasks with significant repetition or pattern matching.

AspectHow it works
Draft sourceMatch n-grams from the prompt/context window
Candidate workloadsCode completion, summarization, repetitive text
Memory overheadNo extra model weights
Main failure modeLittle benefit when the context has little repetition

The algorithm scans the context window for n-grams (typically 3-5 tokens) that match the end of the currently generated sequence. When it finds a match, it looks at what token followed that n-gram earlier in the context window and uses that as the next draft token. For example, if the model has generated "return label" and the context contains "return label expires Friday," PLD proposes "expires" as the next draft token.

PLD is especially effective on code generation and other repetitive tasks because variable names, function calls, and boilerplate often reappear within the context window.[6]

The key advantage is simplicity: there's no model to load, no training required, and no memory overhead. PLD can be combined with other speculative methods or used as a fallback when no neural draft model is available.

prompt-lookup-candidate.py
1context = "return label expires Friday. damaged box needs review. return label" 2tokens = context.split() 3suffix = ["return", "label"] 4 5proposal = None 6for index in range(len(tokens) - len(suffix)): 7 if tokens[index:index + len(suffix)] == suffix: 8 proposal = tokens[index + len(suffix)] 9 break 10 11print(f"matched suffix: {' '.join(suffix)}") 12print(f"lookup proposal: {proposal}")
Output
1matched suffix: return label 2lookup proposal: expires


Production deployment

Adding speculative decoding to a production system isn't an automatic win. While it looks promising on paper, real-world performance depends heavily on workload characteristics. For the classic two-model draft-and-verify setup, the core trade-off is simple: you spend extra FLOPs on the draft path to save target-model memory bandwidth. Modern serving stacks now expose several speculation families, so the payoff is method-specific rather than a blanket yes-or-no.[7] Deploy blindly and you might actually decrease overall throughput and increase costs.

When the classic draft-model setup helps (and when it doesn't)

Before rolling out a separate draft model, you need to check if your workload will actually benefit from the draft-then-verify cycle. The technique is a trade-off: it burns additional compute (FLOPs) to save memory bandwidth. If your system is already compute-bound, this trade-off will usually backfire and reduce overall throughput.

ScenarioHypothesis before benchmarkWhy test it
Single-user, low-batch inferenceStrong candidateTarget decode may be memory-bandwidth bound
Throughput-maximized batchingMeasure carefullyExtra draft work can compete with saturated compute
Long outputsCandidateMore decode steps can amortize setup
Very short outputsWeak candidateSetup and drafting may dominate
Repetitive outputs (code, templates)CandidateDraft or lookup acceptance may be higher
Diverse outputsMeasure carefullyAcceptance may vary with sampling and prompt mix
Ship Speculation With Workload Gates Ship Speculation With Workload Gates
Speculation should be rolled out like any serving optimization: measure the baseline, sweep methods and depth, canary by route, and keep a non-speculative fallback for incompatible features or peak traffic.

Current vLLM docs position separate draft-model speculation as an inter-token-latency optimization for medium-to-low QPS, memory-bound workloads. The same guide separates that setup from other methods such as EAGLE, MTP, n-gram, and suffix speculation, which have different latency-versus-throughput trade-offs.[7]

Serving-engine reality

Serving-engine support changes quickly. For example, vLLM's current docs list several speculation families, but they also call out known feature incompatibilities and separate theoretical losslessness from what you can expect under real hardware numerics.[7] Treat framework support as an operational detail you must validate in your own stack, not as a timeless property of the algorithm.

The practical rollout loop is usually straightforward:

  1. Measure baseline inter-token latency and throughput separately.
  2. Sweep the draft mechanism and speculation depth on real prompts, not toy strings.
  3. Track acceptance rate, output length, and p95 latency by workload class.
  4. Keep a non-speculative fallback path for peak-QPS periods or incompatible features.
speculation-canary-gate.py
1baseline = {"p95_itl_ms": 46.0, "throughput_tps": 380, "sampler_parity": True} 2canary = {"p95_itl_ms": 29.0, "throughput_tps": 372, "sampler_parity": True} 3minimum_throughput_ratio = 0.95 4 5promote = ( 6 canary["sampler_parity"] 7 and canary["p95_itl_ms"] < baseline["p95_itl_ms"] 8 and canary["throughput_tps"] >= baseline["throughput_tps"] * minimum_throughput_ratio 9) 10print(f"inter-token latency improved: {canary['p95_itl_ms'] < baseline['p95_itl_ms']}") 11print(f"canary promoted: {promote}")
Output
1inter-token latency improved: True 2canary promoted: True

When speculation backfires

Speculative decoding isn't a universal speedup button. Here are three common failure modes, with symptoms and fixes.

SymptomLikely causeFix
Speedup is near 1x or negativeDraft model is too slow or too inaccurate (low acceptance rate)Benchmark a smaller or better-aligned draft, or switch to Prompt Lookup for repetitive tasks
Correctness checks failAcceptance test uses different temperature or top-p than the served modelEnsure the verifier and the sampler share the exact same post-processed distribution
Memory usage spikes unexpectedlyKV cache wasn't truncated after a rejected tokenImplement cache rewind so rejected draft tokens don't persist in the cache


Mastery check

Evaluation rubric

  • Explain why low-batch autoregressive decode is often memory-bandwidth bound, not math-bound.
  • Describe the draft-then-verify loop and why teacher-forced verification improves arithmetic intensity.
  • Implement accept/reject logic with probability ratios and residual sampling.
  • Analyze speedup as a function of acceptance rate, speculation depth KKK, and draft-to-target cost ratio ccc.
  • Compare standard draft-model speculation with Medusa, EAGLE, Prompt Lookup, and serving-engine variants.
  • Identify when speculation helps production serving and when it can reduce throughput.
  • Debug tokenizer mismatch, sampler mismatch, KV-cache rewind, and "lossless" implementation caveats.

Follow-up questions

How does speculative decoding preserve the target distribution?

It uses modified rejection sampling. Accepted draft tokens contribute the overlap between draft and target distributions, while rejected tokens are replaced by samples from the residual target mass. This is exact in theory when verification uses the same post-processed distribution as serving. Real systems can still differ slightly because of finite-precision kernels and engine details.[1][7]

What determines speedup?

Acceptance probability α\alphaα, speculation depth KKK, and draft-to-target cost ratio ccc. A useful approximation is (1−αK+1)/((1−α)(1+cK))(1 - \alpha^{K+1}) / ((1 - \alpha)(1 + cK))(1−αK+1)/((1−α)(1+cK)), but production results also depend on batching, cache behavior, kernels, and workload mix.

How does temperature affect performance?

Temperature changes the served distributions and therefore acceptance; its direction and magnitude depend on target, draft, and prompt mix. Measure acceptance under the actual sampler, and make the verifier use that same processed distribution.

How does Medusa differ from separate draft-model speculation?

A separate draft model proposes one chain with another model. Medusa attaches future-token heads to the target model, proposes a tree of candidates from target hidden states, and verifies those candidates with tree attention.[3]

When would you not use classic draft-model speculation?

Be cautious in high-QPS, batch-heavy, compute-saturated serving unless benchmarks show a win. Extra draft work can erase latency savings. Other speculation methods can still help, so test by method and workload class.[7]

What is Prompt Lookup Decoding best for?

It drafts by reusing n-gram continuations already present in context. It works best for repetitive workloads such as code, templates, and some summarization because it needs no extra model but depends on repeated context.[6]

Common pitfalls

  • Output quality drifts only on sampled traffic. Cause: the acceptance test used different temperature, top-p, top-k, or logits processors than the served sampler. Fix: build the ratio and residual from the exact post-processed distribution you serve.
  • Acceptance collapses almost immediately. Cause: tokenizer, vocabulary, or prompt-format mismatch between draft and target paths. Fix: verify exact token IDs, chat template, and special-token handling before tuning KKK.
  • Speedup stays near 1x or goes negative. Cause: the draft is too slow, too inaccurate, or both. Fix: inspect acceptance rate and draft cost before increasing depth; then shrink the drafter, pick a better-aligned method, or disable speculation on that workload.
  • Benchmarks look great on one route and bad on another. Cause: speculation depth was tuned on a global average instead of by workload class. Fix: split latency, acceptance, and cost by prompt family, output length, and QPS band.
  • Outputs break after the first rejection. Cause: rejected draft tokens were left in the target KV cache. Fix: rewind the target cache to the accepted prefix before appending the correction token.
  • "Lossless" is treated as bitwise-identical across every engine. Cause: the proof is distributional, while real systems still differ in kernels, numerics, and feature support. Fix: compare outputs and latency against a non-speculative baseline on real traffic, not only against theory.[7]

Check your understanding

Try to answer these questions without looking back at the article.

1. Concrete arithmetic. A 70B model in FP16 has about 140 GB of weights. Under the simplified model, a measured bandwidth input makes one target weight read a large part of low-batch decode time. Why can verifying five draft tokens in one target pass save time?

Sketch: If measured target decode is bandwidth-limited, one verification pass can amortize much of the target weight traffic across several accepted positions. The benchmark still has to account for extra draft work, cache traffic, and rejection.

2. Acceptance probability. A draft token has draft probability 0.4 and target probability 0.2. What is the acceptance probability, and what happens if the token is rejected?

Sketch: Acceptance probability = min(1, 0.2 / 0.4) = 0.5. If rejected, sample a correction from the residual distribution presidual(t)=max⁡(0,ptarget(t)−pdraft(t))/Zp_{\text{residual}}(t) = \max(0, p_{\text{target}}(t) - p_{\text{draft}}(t)) / Zpresidual​(t)=max(0,ptarget​(t)−pdraft​(t))/Z. Tokens the target wanted more than the draft receive positive residual mass; tokens the draft already overestimated receive zero.

3. Negative speedup. You deploy a 1B draft with a 70B target and see only 1.1x speedup. Your acceptance rate is 0.5 and K=3K = 3K=3. Should you increase KKK to 10, switch to a larger draft model, or investigate something else?

Sketch: Low acceptance means most drafts get rejected early. Increasing KKK wastes effort on tokens that won't survive. A larger draft might raise α\alphaα but costs more per token. First, check for tokenizer mismatch or temperature mismatch between draft and target. If those are correct, try a smaller, better-aligned draft or switch to Prompt Lookup for repetitive workloads.


Key takeaways and next steps

  1. Why it works: under the usual back-of-the-envelope model, autoregressive decode runs at about 1 FLOP/byte, so memory bandwidth, not raw math throughput, is the bottleneck.
  2. Core algorithm: draft KKK tokens with a cheap model, verify all KKK in one target model forward pass.
  3. Mathematical guarantee: with the same served distributions and correct residual sampling, modified rejection sampling reconstructs the target distribution in theory; engine numerics and integration still require checks.
  4. Speedup model: Speedup=1−αK+1(1−α)(1+K⋅c)\text{Speedup} = \frac{1 - \alpha^{K+1}}{(1 - \alpha)(1 + K \cdot c)}Speedup=(1−α)(1+K⋅c)1−αK+1​ is a useful approximation, not an exact production forecast.
  5. Draft choices: smaller assistant models, Medusa, EAGLE, and Prompt Lookup Decoding all trade simplicity against proposal quality and integration cost.
  6. Production: classic draft-model speculative decoding is most attractive for latency-sensitive, memory-bound workloads, while other speculation methods have more method-specific QPS trade-offs.

You've now seen why speculative decoding works, how the acceptance logic preserves the target distribution, and how to estimate speedup from acceptance rate and draft cost. The core insight is simple: memory bandwidth dominates autoregressive decode, so parallel verification amortizes the weight-loading cost across multiple tokens.

Next Step
Continue to Long Context Window Management

Speculative decoding exploits the memory wall in single-token decode; long contexts push that wall even harder because the KV cache grows with every token and attention costs rise. In the next chapter, you'll learn how to manage that growth: KV-cache math, prefill-vs-decode trade-offs, RoPE scaling, and when to use long-context inference versus retrieval augmentation.

PreviousSLM Specialization & Edge Deployment
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Fast Inference from Transformers via Speculative Decoding.

Leviathan, Y., Kalman, M., & Matias, Y. · 2023 · ICML 2023

Accelerating Large Language Model Decoding with Speculative Sampling.

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, John Jumper · 2023

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads.

Cai, T., et al. · 2024 · ICML 2024

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty.

Li, Y., et al. · 2024 · ICML 2024

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Li, Y., Wei, F., Zhang, C., & Zhang, H. · 2025

Accelerating LLM Inference by Reusing the Previous Context via Prompt Lookup Decoding.

Spector, B. & Re, C. · 2024

Speculative Decoding

vLLM Team · 2026 · vLLM Documentation