LeetLLM
LearnPracticeFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Practice
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnML Algorithms & EvaluationDecoding Algorithms
🚀MediumInference Optimization

Decoding Algorithms

Turn retrieved evidence into controlled text by implementing stable softmax, sampling filters, constrained decoding, beam search, and reproducible generation audits.

16 min read
Learning path
Step 35 of 155 in the full curriculum
Core Retrieval AlgorithmsExperiment Design and A/B Testing

The previous chapter put the right policy passage into a support assistant's context. That wasn't the final answer. A language model still has to turn its next-token scores into actual text, one token at a time.

Suppose retrieval found this evidence:

Annual-plan purchases are refundable within 30 days. This customer purchased 12 days ago.

The answer should communicate eligibility. Yet a model may assign some probability to pending, denied, or an irrelevant token such as coupon. Decoding is the algorithm that selects a continuation from those scores. It can make output deterministic, diverse, shorter, constrained, or faster to serve. It can't rescue missing evidence or guarantee that a fluent token is supported by the policy.

In this chapter, you'll build the critical pieces yourself: stable softmax, greedy generation, temperature and truncation, seeded sampling, output masks, beam search, length-aware scoring, and a decoder evaluation receipt. Those are the controls that must be understood before comparing an LLM product experiment.

Refund evidence places a purchase at day 12 inside a 30-day policy window, above a candidate matrix comparing softmax probability, greedy selection, schema allowance, and policy support for eligible, pending, denied, and coupon. Refund evidence places a purchase at day 12 inside a 30-day policy window, above a candidate matrix comparing softmax probability, greedy selection, schema allowance, and policy support for eligible, pending, denied, and coupon.
The same next-token distribution supports different questions. Greedy selects the largest probability, a schema removes invalid forms such as coupon, and the retrieved 30-day policy is still needed to establish that only eligible is supported for a purchase made 12 days ago.

Start With One Next-Token Distribution

For readable arithmetic, pretend each word below is a single token. The prompt ends with:

text
1Evidence: Annual-plan purchases are refundable within 30 days. Purchase age: 12 days. 2Answer: Your refund request is

Assume a model produces these raw scores, called logits, for the next token:

TokenLogitInterpretation
eligible3.0Supported by the retrieved policy.
pending2.5Plausible service language, but it doesn't answer eligibility.
denied1.0Conflicts with the evidence.
coupon-1.0Off-topic tail token.

Logits aren't probabilities: they can be negative and don't sum to one. Softmax turns them into a probability distribution:

pi=exp⁡(zi−m)∑jexp⁡(zj−m)wherem=max⁡jzjp_i = \frac{\exp(z_i - m)}{\sum_j \exp(z_j - m)} \qquad \text{where} \qquad m = \max_j z_jpi​=∑j​exp(zj​−m)exp(zi​−m)​wherem=jmax​zj​

Subtracting the maximum logit doesn't change the probabilities because it rescales every exponential by the same constant. It does prevent overflow when scores become large, which matters in real inference code.

stable-softmax.py
1from math import exp 2 3tokens = ["eligible", "pending", "denied", "coupon"] 4logits = [3.0, 2.5, 1.0, -1.0] 5 6def softmax(scores: list[float]) -> list[float]: 7 offset = max(scores) 8 weights = [exp(score - offset) for score in scores] 9 total = sum(weights) 10 return [weight / total for weight in weights] 11 12probabilities = softmax(logits) 13for token, score, probability in zip(tokens, logits, probabilities): 14 print(f"{token:8} logit={score:>4.1f} probability={probability:.3f}") 15print(f"sum={sum(probabilities):.3f}")
Output
1eligible logit= 3.0 probability=0.568 2pending logit= 2.5 probability=0.345 3denied logit= 1.0 probability=0.077 4coupon logit=-1.0 probability=0.010 5sum=1.000

The distribution tells a precise story: eligible is most likely, but the model hasn't assigned zero probability to unsupported text. Sampling policy therefore affects product behavior.

Numerical Stability Is Part of Correctness

The stable formula isn't an optional optimization. Directly calculating exp(1000) overflows ordinary floating-point arithmetic even though the correct softmax distribution is well behaved.

avoid-softmax-overflow.py
1from math import exp 2 3large_logits = [1000.0, 999.0, 998.0] 4 5try: 6 raw_weights = [exp(score) for score in large_logits] 7 naive = [weight / sum(raw_weights) for weight in raw_weights] 8 print("naive:", naive) 9except OverflowError: 10 print("naive softmax: overflow") 11 12offset = max(large_logits) 13stable_weights = [exp(score - offset) for score in large_logits] 14stable = [weight / sum(stable_weights) for weight in stable_weights] 15print("stable:", [round(probability, 3) for probability in stable])
Output
1naive softmax: overflow 2stable: [0.665, 0.245, 0.09]

A decoder with unstable arithmetic doesn't produce "slightly different" text. It produces invalid probabilities or fails to generate at all.

Stable softmax matrix tracing eligible, pending, denied, and coupon from raw logits through max-shifted scores, exponential weights, normalized probabilities, greedy selection, and the updated response prefix. Stable softmax matrix tracing eligible, pending, denied, and coupon from raw logits through max-shifted scores, exponential weights, normalized probabilities, greedy selection, and the updated response prefix.
For this captured step, subtracting the maximum logit produces shifted scores from 0.0 to -4.0. Their exponential weights sum to 1.760; normalization gives eligible probability 0.568, greedy selection appends it, and the updated prefix becomes the next model input.

Greedy Decoding Gives a Reproducible Baseline

Generation is autoregressive: after selecting one token, the system appends it to the sequence and asks the model for new logits for the next position. In a transformer service, a key-value (KV) cache stores attention state used by later steps; memory management for that growing cache is a central serving concern.[1]

Greedy decoding chooses the highest-logit token at every step. Here are two captured model steps for our policy answer. This is enough to implement the decoder without pretending we have a full model in the notebook.

greedy-generation-loop.py
1captured_logits = [ 2 {"eligible": 3.0, "pending": 2.5, "denied": 1.0, "coupon": -1.0}, 3 {".": 4.0, "today": 1.5, "after": 0.5}, 4] 5 6def greedy_token(scores: dict[str, float]) -> str: 7 return max(scores, key=scores.get) 8 9generated = [] 10for step, scores in enumerate(captured_logits, start=1): 11 chosen = greedy_token(scores) 12 generated.append(chosen) 13 print(f"step {step}: chose {chosen!r} from {len(scores)} candidates") 14 15answer = "Your refund request is " + " ".join(generated).replace(" .", ".") 16print(answer)
Output
1step 1: chose 'eligible' from 4 candidates 2step 2: chose '.' from 3 candidates 3Your refund request is eligible.

Greedy generation is valuable for tests and structured responses because the policy contains no random draw. Its limitation is just as important: picking the best next token doesn't explore a weaker first token that could lead to a better complete sequence.

Temperature Reshapes Probability Mass

Temperature divides every logit by a positive value TTT before softmax:

pi(T)=softmax⁡(ziT)p_i(T) = \operatorname{softmax}\left(\frac{z_i}{T}\right)pi​(T)=softmax(Tzi​​)
  • T < 1 sharpens the distribution, pushing more mass toward the current winner.
  • T > 1 flattens it, giving lower-scored tokens more opportunity to be sampled.
  • Temperature doesn't delete any candidate with finite logit.

The formula requires T > 0. For a deterministic baseline, use greedy selection rather than dividing logits by zero.

Entropy measures the uncertainty of a distribution. Higher entropy here means the decoder has more plausible ways to vary the next token, including harmful ones.

temperature-and-entropy.py
1from math import exp, log 2 3tokens = ["eligible", "pending", "denied", "coupon"] 4logits = [3.0, 2.5, 1.0, -1.0] 5 6def softmax(scores: list[float]) -> list[float]: 7 offset = max(scores) 8 weights = [exp(score - offset) for score in scores] 9 total = sum(weights) 10 return [weight / total for weight in weights] 11 12def entropy(probabilities: list[float]) -> float: 13 return -sum(p * log(p) for p in probabilities) 14 15for temperature in (0.5, 1.0, 2.0): 16 probabilities = softmax([score / temperature for score in logits]) 17 table = dict(zip(tokens, probabilities)) 18 print( 19 f"T={temperature:.1f}: eligible={table['eligible']:.3f} " 20 f"coupon={table['coupon']:.3f} entropy={entropy(probabilities):.3f}" 21 )
Output
1T=0.5: eligible=0.721 coupon=0.000 entropy=0.647 2T=1.0: eligible=0.568 coupon=0.010 entropy=0.933 3T=2.0: eligible=0.438 coupon=0.059 entropy=1.190

At higher temperature, the unsupported coupon continuation receives more probability along with genuinely useful variation. The printed 0.000 value at T=0.5 is rounded; the finite-logit token still has a small positive probability. Temperature is a diversity control, not a groundedness control.

Sampling Needs a Recorded Randomness Contract

A sampling decoder draws from probabilities rather than taking the maximum. This is useful when multiple outputs are acceptable, but it makes comparisons noisy unless you record the policy and the random seed used by your own experiment harness.

The next lab uses Python's random-number generator, so the same seed exactly replays the same draws in this local implementation.

seeded-sampling.py
1from random import Random 2 3tokens = ["eligible", "pending", "denied", "coupon"] 4probabilities = [0.568, 0.345, 0.077, 0.010] 5 6def draw_sequence(seed: int, length: int = 8) -> list[str]: 7 rng = Random(seed) 8 return rng.choices(tokens, weights=probabilities, k=length) 9 10for seed in (7, 7, 21): 11 print(f"seed={seed}: {' '.join(draw_sequence(seed))}")
Output
1seed=7: eligible eligible pending eligible eligible eligible eligible eligible 2seed=7: eligible eligible pending eligible eligible eligible eligible eligible 3seed=21: eligible pending pending eligible eligible pending pending eligible

Production backends may have their own reproducibility contract, and hardware or implementation changes can defeat byte-for-byte replay. At minimum, store model identifier, prompt version, decoder settings, token limit, and any supported seed alongside each evaluation output.

Top-k and Top-p Truncate the Tail

Temperature leaves every token in play. Top-k sampling keeps up to k highest-probability candidates, then renormalizes their probability mass before drawing. Top-k was used for neural story generation before nucleus sampling was proposed.[2]

top-k-filter.py
1tokens = ["eligible", "pending", "denied", "coupon"] 2probabilities = [0.568, 0.345, 0.077, 0.010] 3 4def top_k(items: list[str], probs: list[float], k: int) -> list[tuple[str, float]]: 5 ranked = sorted(zip(items, probs), key=lambda row: row[1], reverse=True)[:k] 6 retained_mass = sum(probability for _, probability in ranked) 7 return [(token, probability / retained_mass) for token, probability in ranked] 8 9for token, probability in top_k(tokens, probabilities, k=2): 10 print(f"{token:8} renormalized={probability:.3f}")
Output
1eligible renormalized=0.622 2pending renormalized=0.378

Top-p, also called nucleus sampling, keeps the smallest high-probability prefix whose cumulative mass reaches a threshold p, then renormalizes and samples from that nucleus. Holtzman et al. proposed it after finding that open-ended likelihood-maximizing generation can become repetitive, while unrestricted sampling can draw from an unreliable tail.[3]

top-p-adapts-to-the-distribution.py
1def nucleus(probabilities: dict[str, float], threshold: float) -> list[str]: 2 ranked = sorted(probabilities.items(), key=lambda row: row[1], reverse=True) 3 kept = [] 4 cumulative = 0.0 5 for token, probability in ranked: 6 kept.append(token) 7 cumulative += probability 8 if cumulative >= threshold: 9 break 10 return kept 11 12peaked = {"eligible": 0.568, "pending": 0.345, "denied": 0.077, "coupon": 0.010} 13flat = {"eligible": 0.28, "pending": 0.25, "denied": 0.24, "coupon": 0.23} 14 15print("peaked p=0.90:", nucleus(peaked, 0.90)) 16print("flat p=0.90:", nucleus(flat, 0.90))
Output
1peaked p=0.90: ['eligible', 'pending'] 2flat p=0.90: ['eligible', 'pending', 'denied', 'coupon']

This is the defining contrast:

Decoder controlChanges probabilities?Removes candidates?Candidate count
TemperatureYesNoUnchanged
Top-kRenormalizes retained tokensYesUp to k
Top-pRenormalizes retained tokensYesAdapts to each distribution

Constraints Enforce Form, Not Truth

An e-commerce assistant often needs structured output, such as a refund_status field with allowed values. A token mask can remove invalid values before selection. This is constrained decoding.

Make the distinction explicit:

  • A schema mask can stop the assistant from emitting coupon as a refund status.
  • Evidence validation is still needed to stop the assistant from choosing a valid but unsupported value such as denied.
mask-invalid-status-tokens.py
1logits = {"coupon": 6.0, "denied": 4.0, "eligible": 3.0, "pending": 2.0} 2schema_values = {"eligible", "pending", "denied"} 3supported_by_policy = {"eligible"} 4 5def constrained_greedy(scores: dict[str, float], allowed: set[str]) -> str: 6 candidates = {token: score for token, score in scores.items() if token in allowed} 7 if not candidates: 8 raise ValueError("constraint removed every token") 9 return max(candidates, key=candidates.get) 10 11print("unconstrained:", max(logits, key=logits.get)) 12print("schema-only:", constrained_greedy(logits, schema_values)) 13print("policy-supported:", constrained_greedy(logits, supported_by_policy))
Output
1unconstrained: coupon 2schema-only: denied 3policy-supported: eligible

The output shows why structured generation and grounded generation are different requirements. Good production evaluations test both.

This fixture makes each status one token. In a real tokenizer, a valid value may span several tokens. Production constrained decoding tracks allowed prefixes after every generated token rather than applying one static final-value mask.

Beam Search Compares Partial Sequences

Greedy selection can miss a better continuation at a later step. Suppose the response prefix is Your refund request, and the model has two ways to continue:

First tokenProbabilityStrongest next tokenProbabilityJoint probability after two steps
eligible0.40.0.200.080
is0.35eligible0.700.245
denied0.25.0.300.075

Greedy picks eligible immediately. At this fixed two-step horizon, beam search with width 2 retains both eligible and is, expands each path, then selects is eligible because that partial path scores higher.

Beam search keeps two paths and late score wins. Beam search keeps two paths and late score wins.
Path width follows probability. Greedy commits to eligible at 0.40, while beam width two preserves is at 0.35; its 0.70 continuation produces the winning 0.245 joint score.

This teaching fixture gives each retained prefix one captured second-token continuation. A real beam decoder expands every allowed next token for every retained prefix, combines the new log-probability scores, and prunes back to the configured beam width after each step. Unfinished prefixes remain candidates until an end-of-sequence token or another stop condition marks them complete.

Implement the fixture's two-step search in log space. Adding log probabilities is numerically safer than multiplying many tiny probabilities across a long sequence.

two-step-beam-search.py
1from math import exp, log 2 3first_step = {"eligible": 0.40, "is": 0.35, "denied": 0.25} 4second_step = { 5 "eligible": {".": 0.20}, 6 "is": {"eligible": 0.70}, 7 "denied": {".": 0.30}, 8} 9 10greedy_first = max(first_step, key=first_step.get) 11greedy_second = max(second_step[greedy_first], key=second_step[greedy_first].get) 12greedy_probability = first_step[greedy_first] * second_step[greedy_first][greedy_second] 13 14prefixes = sorted(first_step, key=first_step.get, reverse=True)[:2] 15candidates = [] 16for first in prefixes: 17 for second, p_second in second_step[first].items(): 18 candidates.append(((first, second), log(first_step[first]) + log(p_second))) 19 20winner, winner_log_probability = max(candidates, key=lambda row: row[1]) 21print("greedy:", f"{greedy_first} {greedy_second}", f"probability={greedy_probability:.3f}") 22print("beam:", " ".join(winner), f"probability={exp(winner_log_probability):.3f}")
Output
1greedy: eligible . probability=0.080 2beam: is eligible probability=0.245

Beam search is useful when complete-sequence likelihood matters, such as input-grounded transformation or tightly formatted output. For open-ended generation, Holtzman et al. found that maximization-based decoding, including beam search, can produce bland or repetitive text.[3]

Length Changes the Score You Are Optimizing

Raw sequence probability shrinks with each generated token because it multiplies values no larger than one. That creates a preference for short completions. Beam-search systems therefore define a length-scoring policy rather than comparing raw totals blindly. For example, Google's neural machine translation system used length normalization and a coverage penalty in its beam search.[4]

The simplified lab below uses average log probability to make the issue visible. It isn't a universal beam-search formula; a serving stack must document the formula it actually uses.

length-aware-sequence-score.py
1candidates = [ 2 {"text": "eligible", "log_probability": -0.25, "tokens": 1}, 3 {"text": "eligible within 30 days", "log_probability": -0.42, "tokens": 4}, 4] 5 6raw_winner = max(candidates, key=lambda row: row["log_probability"]) 7normalized_winner = max(candidates, key=lambda row: row["log_probability"] / row["tokens"]) 8 9print("raw winner:", raw_winner["text"]) 10print("length-normalized winner:", normalized_winner["text"]) 11for row in candidates: 12 average = row["log_probability"] / row["tokens"] 13 print(f"{row['text']:24} raw={row['log_probability']:.3f} average={average:.3f}")
Output
1raw winner: eligible 2length-normalized winner: eligible within 30 days 3eligible raw=-0.250 average=-0.250 4eligible within 30 days raw=-0.420 average=-0.105

Search scores optimize a measurable objective. They don't automatically optimize usefulness, groundedness, or customer clarity. Those need evaluations.

Speculative Decoding Is a Serving Optimization

Temperature, truncation, constraints, and beam search change how text is selected. Speculative decoding addresses a different bottleneck: autoregressive generation normally needs a costly target-model step for each accepted token.

Leviathan, Kalman, and Matias describe a draft-and-verify algorithm in which a faster approximation model proposes tokens and the target model checks several proposals in parallel. Their exact sampling procedure preserves the target model's output distribution while reducing latency in suitable workloads.[5]

Keep this boundary clear:

TechniquePrimary purposeCan it change selected-output distribution?
Temperature / top-k / top-pChange diversity and tail riskYes
Beam search / length scoringSearch for high-scoring sequencesYes
Token constraintsEnforce allowable formsYes
Exact speculative decodingProduce target distribution fasterNo

A shortcut that accepts draft tokens without the target algorithm's verification and correction isn't equivalent speculative decoding. It is a different generator and must be evaluated as such.

Evaluate Decoder Settings Before an A/B Test

Suppose a model change and a sampler change land together. If support resolution improves, you won't know which change caused it. Every generation evaluation should keep a decoder receipt beside each output:

  • model and prompt version
  • evidence or retrieval snapshot identifier
  • decoding mode, temperature, top-k, top-p, beam width, and length policy
  • seed when the stack supports one
  • maximum tokens, stop rules, and any output constraints
  • groundedness judgment and task outcome

Here is a tiny offline evaluation table. It treats eligible as the only response supported by the retrieved policy and compares three recorded decoder runs.

decoder-receipt-evaluation.py
1runs = { 2 "greedy T=1": ["eligible", "eligible", "eligible", "eligible"], 3 "sample T=2": ["eligible", "pending", "coupon", "denied"], 4 "top-p p=.9": ["eligible", "pending", "eligible", "eligible"], 5} 6supported = {"eligible"} 7 8for receipt, outputs in runs.items(): 9 grounded = sum(output in supported for output in outputs) 10 rate = grounded / len(outputs) 11 print(f"{receipt:12} grounded={grounded}/{len(outputs)} rate={rate:.2f}")
Output
1greedy T=1 grounded=4/4 rate=1.00 2sample T=2 grounded=1/4 rate=0.25 3top-p p=.9 grounded=3/4 rate=0.75

This isn't proof that greedy is always best. It proves the procedure: compare decoders on held-out evidence and preserve settings so the next experiment measures the product change you meant to test.

A Decoder Review Checklist

Before shipping a generated support answer, separate these questions:

Review questionFailure symptomEvidence to collect
Are logits converted stably?NaN, overflow, or invalid probabilities.Large-logit unit test.
Is diversity deliberate?Unsupported tail tokens appear unexpectedly.Temperature/top-p sweep with groundedness labels.
Are formats enforced?Invalid status or JSON value reaches application code.Constraint and schema tests.
Is sequence scoring appropriate?Replies are too short or oddly generic.Beam/length policy report.
Is evaluation reproducible?Two experiments can't be compared.Decoder receipts and fixed eval cases.
Is latency optimization equivalent?Faster serving silently changes answers.Target-distribution audit for speculative serving.

Practice Tasks

  1. Add logits [1000.0, 999.0, 998.0] to stable-softmax.py. Explain why subtracting the maximum changes numeric safety without changing the probabilities.
  2. Change top-p-adapts-to-the-distribution.py to use threshold=0.70. Predict which tokens remain for the peaked and flat distributions before running it.
  3. Extend mask-invalid-status-tokens.py with a two-token allowed value such as needs review. Sketch the prefix states a production decoder must allow.
  4. Add a second candidate continuation under each retained prefix in two-step-beam-search.py. Prune back to beam width 2 after expansion and confirm the winner.
  5. Write one decoder receipt for a refund-support evaluation run: model version, prompt version, evidence snapshot, selection settings, token limit, stop rules, constraints, seed support, groundedness label, and latency.

Practice guidance

  1. Stable probabilities stay [0.665, 0.245, 0.090]. Shifting logits changes their absolute values, not their differences.
  2. With p=0.70, the peaked distribution needs eligible and pending; the flat distribution needs eligible, pending, and denied.
  3. A static final-value set isn't enough. After generating needs, the decoder must still allow the continuation that completes needs review while rejecting prefixes that can never reach a valid value.
  4. Expand every retained prefix, add log probabilities, sort all new prefixes, and keep only the best two. That's the repeated prune step omitted from the tiny walkthrough.
  5. Keep the decoder receipt beside output and judgment so model, retrieval, decoder, and serving changes can be separated before online testing.

Key Concepts

  • logits and numerically stable softmax
  • greedy autoregressive decoding
  • temperature and entropy
  • top-k and nucleus (top-p) sampling
  • seeded local sampling and decoder receipts
  • constrained decoding versus groundedness
  • beam search in log-probability space
  • length-aware sequence scoring
  • speculative decoding as latency optimization

Evaluation rubric

  • Foundational: Converts refund-response logits into stable probabilities and explains why eligible is grounded while coupon isn't.
  • Intermediate: Implements temperature, top-k, top-p, constraints, and beam search and predicts each lab's result before running it.
  • Advanced: Designs an evaluation receipt that distinguishes model, retrieval, decoding, and serving changes before an online experiment.

Follow-up questions

Common pitfalls

SymptomCauseFix
Large logits lead to invalid probabilities.Softmax exponentiated raw values without shifting.Subtract the maximum logit before exp.
Output becomes more varied but also less supported.Temperature flattened probability mass into the tail.Evaluate groundedness, then use deliberate truncation or constraints.
Valid JSON contains a policy-wrong status.Schema enforcement was mistaken for factual validation.Test generated values against retrieved evidence.
Beam output is extremely short.Raw sequence scores favor fewer multiplied probabilities.Declare and evaluate a length-scoring policy.
A model experiment can't be reproduced.Decoder settings or retrieval snapshot wasn't logged.Store a decoder receipt with every evaluated output.
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.A decoder must convert next-token logits [1000.0, 999.0, 998.0] into a softmax distribution without overflowing exp. Softmax probabilities are unchanged if the same constant is subtracted from every logit. Which implementation is correct?
2.A generation loop has answer prefix Your refund request is. At step 1 the logits are eligible=3.0, pending=2.5, denied=1.0, and coupon=-1.0. At step 2, after appending the chosen token, the logits are .=4.0, today=1.5, and after=0.5. Greedy decoding chooses the highest-logit token at each step. What text is produced?
3.A sampler applies p_i(T)=softmax(z_i/T) with positive T to the same finite logits. Compared with T=1, what should T=2 do?
4.A local evaluation sampler calls Python's Random(seed).choices with fixed tokens, weights, and length. The same code is run twice with seed=7, then a later production backend changes hardware and sampling implementation. Which reproducibility claim is justified?
5.Top-k with k=2 keeps the two highest-probability tokens. Top-p with p=0.90 keeps the smallest ranked prefix whose cumulative mass reaches 0.90. For peaked probabilities {eligible: 0.568, pending: 0.345, denied: 0.077, coupon: 0.010} and flat probabilities {eligible: 0.28, pending: 0.25, denied: 0.24, coupon: 0.23}, which retained sets are correct?
6.A refund-status decoder has logits coupon = 6.0, denied = 4.0, eligible = 3.0, and pending = 2.0. The schema allows only {eligible, pending, denied}, while the retrieved policy supports only eligible. If greedy decoding masks only to the schema, what happens?
7.At a fixed two-step horizon, first-token probabilities are eligible=0.40, is=0.35, and denied=0.25. Captured continuations are eligible -> . with 0.20, is -> eligible with 0.70, and denied -> . with 0.30. Greedy commits to the highest first token. A width-2 beam keeps the two highest first prefixes, expands them, and compares added log probabilities. Which result is correct?
8.A beam-search report compares two candidates: eligible has log probability -0.25 over 1 token, and eligible within 30 days has log probability -0.42 over 4 tokens. Raw scoring chooses the larger log probability; average-log scoring uses log probability divided by token count. Which winners are correct?
9.A serving team adds a small draft model that proposes several tokens, and the target model verifies and corrects those proposals using exact speculative decoding. Compared with changing temperature or top-p, what should the team expect?
10.An offline refund-support evaluation is run before an A/B test. Reviewers want to separate model, retrieval, decoding, and serving effects for each generated answer. Which receipt should be stored per output?

10 questions remaining.

Next Step
Continue to Experiment Design and A/B Testing

You can now separate model scores, decoding policy, groundedness, and serving behavior. Next you will design a controlled experiment that can tell whether a retrieval or generation change actually improves the support product.

PreviousCore Retrieval Algorithms
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Efficient Memory Management for Large Language Model Serving with PagedAttention.

Kwon, W., et al. · 2023 · SOSP 2023

Hierarchical Neural Story Generation.

Fan, A., Lewis, M., & Dauphin, Y. · 2018 · ACL 2018

The Curious Case of Neural Text Degeneration.

Holtzman, A., et al. · 2020 · ICLR 2020

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.

Wu, Y., et al. · 2016

Fast Inference from Transformers via Speculative Decoding.

Leviathan, Y., Kalman, M., & Matias, Y. · 2023 · ICML 2023