LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnTransformer Deep DivesDecoding Strategies: Greedy to Nucleus
📝HardNLP Fundamentals

Decoding Strategies: Greedy to Nucleus

Compare decoding strategies for text generation: greedy, beam search, top-k, nucleus (top-p), temperature, repetition controls, and newer variants like min-p.

37 min read
Learning path
Step 92 of 155 in the full curriculum
Mechanistic InterpretabilityScaling Laws & Compute-Optimal Training

The previous deep dives followed information through transformer internals until the model produced logits. Decoding strategy controls how that final probability distribution becomes one visible completion. This chapter compares greedy decoding, beam search, temperature sampling, , nucleus sampling, and min-p so you can tune output quality intentionally.

Imagine an order-status assistant that hasn't chosen its next yet. Internally, it may assign a 70% chance to shipped, a 20% chance to delayed, and a 10% chance to refunded. Language models do this at every generation step: they produce raw scores for many possible next tokens, convert those scores into probabilities, then use a decoding policy to pick what appears on screen.

That final choice has real product consequences. A support bot usually needs stable, factual phrasing. A brainstorming tool needs more variety. A code generator often benefits from determinism, while a story generator may sound lifeless if it picks the highest-probability token every time. Decoding is the control surface that turns the same model distribution into those different behaviors.

For how decoding fits into the broader inference pipeline and techniques like speculative decoding, see those articles. Speculative decoding accelerates generation by asking a draft model to propose tokens that the target model can verify in parallel.[1]

Greedy decoding, beam search, and nucleus sampling compared on order-status completions. Greedy decoding, beam search, and nucleus sampling compared on order-status completions.
Trace the three rows left to right. Greedy commits to one token at every step. Beam search keeps three candidate paths alive and prunes the worst. Nucleus sampling draws randomly from the high-probability pool, so the path can change across runs.

How LLMs Choose the Next Token

Before we compare strategies, we need to see what happens inside the model at each generation step.

Autoregressive generation

LLMs generate text one token at a time. They predict the next token, append it to the prompt, and repeat. This is called autoregressive generation (auto = self, regressive = predicting the next step from previous ones).

At every step the model outputs a vector of raw scores called logits, one score per token in the vocabulary. A function called converts those logits into probabilities that sum to 1. The decoding strategy then decides which token to emit.

A concrete probability distribution

Suppose the prompt so far is "The carrier scan shows the package is". Here is a plausible probability distribution for the very next token:

TokenLogitProbability
delayed2.000.45
in1.600.30
on0.900.15
broken-1.110.02
...(many tail tokens)...0.08 total

These numbers are fabricated for clarity. Logits are shown only for named candidates; omitted tail tokens jointly carry the final probability mass. One token (delayed) leads the pack, a few others are plausible, and many individually weak tokens share the tail.

Greedy decoding is like picking the best-looking warehouse route right now. It might lead to a hub cutoff you can't meet. Beam search is like keeping several route plans alive in parallel and picking the one with the best total arrival time at the end. The locally optimal choice isn't necessarily the globally best route.

The question we'll answer in this article is: given this distribution, how do we pick the next token? And how does that choice change the whole sentence?


Greedy Decoding and Its Trap

Why the most likely next word isn't the best sentence

The simplest strategy is greedy decoding: at every step, pick the single most probable token and commit to it.

Using the table above, greedy decoding chooses delayed. The sentence becomes:

"The carrier scan shows the package is delayed"

That's a fine sentence. But greedy decoding can fail when the best next word leads to a worse overall sentence. Imagine a slightly different distribution where delayed is most likely, but in followed by transit produces a much more natural and informative completion. Greedy doesn't explore that possibility because it locks in delayed immediately.

This is the greedy trap: the locally best token doesn't imply the globally best sequence. It's like a delivery driver who repeatedly takes the fastest next street and ends up in a traffic jam, while a slightly slower first turn would have opened a clear highway.

Human-written text often doesn't follow the model's single most likely continuation. Holtzman et al. show that maximum-likelihood decoding can drift toward bland, repetitive text, which is why pure argmax-style decoding often doesn't sound human.[2]

greedy-can-miss-higher-scoring-sequence.py
1first_step = {"delayed": 0.55, "in": 0.45} 2continuations = {"delayed": {".": 0.40}, "in": {"transit": 0.90}} 3 4greedy_first = max(first_step, key=first_step.get) 5path_probabilities = { 6 "delayed .": first_step["delayed"] * continuations["delayed"]["."], 7 "in transit": first_step["in"] * continuations["in"]["transit"], 8} 9highest_sequence = max(path_probabilities, key=path_probabilities.get) 10 11print(f"greedy first token: {greedy_first}") 12print(f"sequence probabilities: {path_probabilities}") 13print(f"higher-probability sequence: {highest_sequence}") 14 15assert greedy_first == "delayed" 16assert highest_sequence == "in transit"
Local versus sequence score
1greedy first token: delayed 2sequence probabilities: {'delayed .': 0.22000000000000003, 'in transit': 0.405} 3higher-probability sequence: in transit

The algorithm

At each step, select the token with the highest probability:

wt=arg⁡max⁡wP(w∣w<t)w_t = \arg\max_{w} P(w | w_{<t})wt​=argmaxw​P(w∣w<t​)

Reading the formula

At each step, look at every possible next token, pick the one with the highest probability, and commit to it. Simple and fast, but myopic, similar to choosing the next warehouse hop without checking whether the full route can still meet the SLA.

Here is a basic implementation of greedy decoding in PyTorch. It assumes a single prompt in the batch, a model whose forward call returns .logits, and an integer EOS token ID. The loop repeatedly takes the argmax token and stops when it reaches EOS or the maximum number of decoding steps:

reading-the-formula.py
1import torch 2 3def greedy_decode( 4 model: torch.nn.Module, 5 prompt_ids: torch.Tensor, 6 max_length: int, 7 eos_token_id: int 8) -> torch.Tensor: 9 """Generate one sequence with greedy decoding (batch size 1).""" 10 input_ids = prompt_ids 11 with torch.no_grad(): 12 for _ in range(max_length): 13 logits = model(input_ids).logits[:, -1, :] 14 next_token = logits.argmax(dim=-1, keepdim=True) 15 input_ids = torch.cat([input_ids, next_token], dim=-1) 16 if next_token[0, 0].item() == eos_token_id: 17 break 18 return input_ids

Strengths

  • Deterministic and fast: O(V)O(V)O(V) per step (single argmax over vocabulary)
  • Useful baseline for constrained tasks: classification-style outputs or extraction when output variability is undesirable

Limitations

  • Suboptimal globally: The locally best token doesn't imply the best sequence
  • Repetitive in open-ended generation: Maximum-likelihood decoding can fall into loops or generic output in evaluated settings.[2]

Beam Search

Algorithm

Instead of keeping only the best token, maintain B (beam width) candidate sequences, expanding each by the top-k tokens. This simplified PyTorch snippet assumes batch size 1, a model whose forward call returns .logits, and the highest-scoring partial sequences at every step. For readability, it leaves out batching and length normalization, which we'll cover in the next subsection:

algorithm.py
1import torch 2 3def beam_search( 4 model: torch.nn.Module, 5 prompt_ids: torch.Tensor, 6 beam_width: int = 5, 7 max_length: int = 100, 8 eos_token_id: int = 2 9) -> torch.Tensor: 10 """Generate one sequence with beam search (batch size 1).""" 11 # Each beam: (sequence, cumulative_log_prob) 12 beams = [(prompt_ids, 0.0)] 13 completed = [] 14 15 with torch.no_grad(): 16 for _ in range(max_length): 17 all_candidates = [] 18 for seq, score in beams: 19 # If this beam already ended, record it 20 if seq[0, -1].item() == eos_token_id: 21 completed.append((seq, score)) 22 continue 23 24 # After step 1, beams have different lengths, so process each separately. 25 # In production you would use KV caching or padded batching instead. 26 logits = model(seq).logits[:, -1, :] 27 log_probs = torch.log_softmax(logits, dim=-1) 28 top_k = log_probs.topk(beam_width, dim=-1) 29 30 for i in range(beam_width): 31 new_seq = torch.cat([seq, top_k.indices[:, i:i+1]], dim=-1) 32 new_score = score + top_k.values[:, i].item() 33 all_candidates.append((new_seq, new_score)) 34 35 if not all_candidates: 36 break 37 38 # Keep top beam_width candidates 39 beams = sorted(all_candidates, key=lambda x: x[1], 40 reverse=True)[:beam_width] 41 42 return max(completed or beams, key=lambda x: x[1])[0]

Length penalty

One common length penalty, used in the Google Neural Machine Translation system, is:[3]

score(Y)=log⁡P(Y)(5+∣Y∣)α/6α\text{score}(Y) = \frac{\log P(Y)}{(5 + |Y|)^\alpha / 6^\alpha}score(Y)=(5+∣Y∣)α/6αlogP(Y)​

Reading the formula

Without correction, longer sequences receive lower unnormalized probability as each extra token multiplies the sequence probability by a value smaller than 1. The exponent α\alphaα controls how strongly you compensate for that short-sequence bias. When α=0\alpha = 0α=0, there's no correction. Larger values reduce the bias toward short completions.

α\alphaαEffectUse Case
0No normalization (favors short)Short outputs
0.6Balanced (common neural machine translation setting)Translation
1.0Stronger bias toward longer outputsSummarization
beam-length-normalization.py
1def gnmt_score(log_probability, length, alpha=0.6): 2 penalty = ((5 + length) / 6) ** alpha 3 return log_probability / penalty 4 5short = {"text": "late", "log_probability": -1.00, "length": 2} 6complete = {"text": "delayed in transit", "log_probability": -1.08, "length": 5} 7 8raw_choice = max([short, complete], key=lambda item: item["log_probability"])["text"] 9normalized_choice = max( 10 [short, complete], 11 key=lambda item: gnmt_score(item["log_probability"], item["length"]), 12)["text"] 13print(f"raw sequence score chooses: {raw_choice}") 14print(f"length-normalized score chooses: {normalized_choice}") 15 16assert raw_choice == "late" 17assert normalized_choice == "delayed in transit"
Length-normalized ranking
1raw sequence score chooses: late 2length-normalized score chooses: delayed in transit

When to use beam search

  • Good fit: machine translation. The output usually has one target meaning to preserve.
  • Good fit: structured summarization. The output should stay close to source evidence.
  • Poor fit: open-ended chat or creative writing. High-likelihood paths often become generic.

Counterintuitively, beam search with larger beams can produce more likely but less interesting text.[4] In open-ended generation, increasing beam width can reduce output quality because the most probable sequence is often generic and repetitive.


Temperature Scaling

What temperature does

Before we add randomness, we need a dial that controls how much randomness to allow. Temperature is that dial. It reshapes the probability distribution before we sample from it.

Return to our running example. The raw model gives us these probabilities for the next token after "The carrier scan shows the package is":

TokenOriginal Probability
delayed0.45
in0.30
on0.15
broken0.02
tail0.08

Here is what happens when we apply different temperatures and then run softmax:

TokenT=0.3T = 0.3T=0.3 (sharp)T=1.0T = 1.0T=1.0 (original)T=1.5T = 1.5T=1.5 (flat)
delayed~0.780.45~0.37
in~0.200.30~0.28
on~0.020.15~0.18
broken~0.000.02~0.05
tail~0.000.08~0.12

For this worked table only, the tail is treated as one aggregated category and each original probability is reshaped proportionally to Pi1/TP_i^{1/T}Pi1/T​ before renormalization. A real decoder divides each individual token logit by TTT before softmax; it never aggregates the tail first. At T=0.3T = 0.3T=0.3, delayed is so dominant that sampling usually picks it. At T=1.5T = 1.5T=1.5, the probabilities spread out, and even broken becomes a possible sample. As T→0T \to 0T→0, decoding collapses toward greedy argmax. As T→∞T \to \inftyT→∞, the distribution approaches uniform.

Temperature controls distribution sharpness like a routing strictness dial. Low temperature (T<1T < 1T<1) keeps the sampler focused on the top choices, like a fulfillment router that picks the safest carrier. High temperature (T>1T > 1T>1) spreads probability across more tokens, like allowing less common recovery actions when the situation is ambiguous.

The formula

Temperature modifies the logit distribution before softmax (the function that converts raw model scores into probabilities):

P(wi)=exp⁡(zi/T)∑jexp⁡(zj/T)P(w_i) = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}P(wi​)=∑j​exp(zj​/T)exp(zi​/T)​

Reading the formula

Divide the raw logits by temperature TTT before applying softmax. When T<1T < 1T<1, the division amplifies differences between logits, making the top choice even more dominant. When T>1T > 1T>1, it compresses differences, spreading probability more evenly across options. An implementation should special-case temperature=0 as deterministic decoding rather than divide by zero.

The following function applies temperature scaling to raw logits. It divides logits by the temperature before they pass through softmax. A temperature below 1.0 sharpens the distribution; a value above 1.0 flattens it:

reading-the-formula-2.py
1import torch 2 3def apply_temperature(logits: torch.Tensor, temperature: float = 1.0) -> torch.Tensor: 4 """Scale logits by temperature. Handle temperature=0 as greedy decoding elsewhere.""" 5 if temperature <= 0: 6 raise ValueError("temperature must be > 0; use greedy decoding for temperature=0") 7 return logits / temperature 8 9scaled = apply_temperature(torch.tensor([2.0, 1.0]), temperature=0.5) 10print("scaled logits:", scaled.tolist()) 11try: 12 apply_temperature(torch.tensor([2.0, 1.0]), temperature=0.0) 13except ValueError as error: 14 print("zero temperature:", error) 15 16assert scaled.tolist() == [4.0, 2.0]
Temperature guard
1scaled logits: [4.0, 2.0] 2zero temperature: temperature must be > 0; use greedy decoding for temperature=0

Mathematical intuition

Temperature TTTEffect on DistributionEntropy
T→0T \to 0T→0Approaches one-hot (greedy)near 0
T=1T = 1T=1Original model distributionBaseline
T>1T > 1T>1Flattened (more uniform)Increases
T→∞T \to \inftyT→∞Approaches uniform randomnear log⁡V\log VlogV
mathematical-intuition.py
1from math import exp 2 3def softmax(logits): 4 shifted = [logit - max(logits) for logit in logits] 5 weights = [exp(logit) for logit in shifted] 6 total = sum(weights) 7 return [weight / total for weight in weights] 8 9def apply_temperature(logits, temperature): 10 if temperature <= 0: 11 raise ValueError("temperature must be > 0; use greedy decoding for temperature=0") 12 return [logit / temperature for logit in logits] 13 14logits = [2.0, 1.0, 0.0] 15low_t = softmax(apply_temperature(logits, 0.5)) 16high_t = softmax(apply_temperature(logits, 2.0)) 17 18assert low_t[0] > high_t[0] 19assert high_t[-1] > low_t[-1] 20assert round(sum(high_t), 12) == 1.0
Low, medium, and high temperature distributions across ranked token candidates. Low, medium, and high temperature distributions across ranked token candidates.
Low temperature concentrates mass in top-ranked candidates. High temperature spreads probability across more survivors.

Why it works

Dividing logits by T<1T < 1T<1 amplifies differences between logits, making the distribution peakier. When T>1T > 1T>1, the same operation shrinks differences, making probabilities more similar. This controls the sharpness of the sampling distribution.

Starting sweep candidates

Evaluation sliceTemperature candidatesWhat to measure
Code generation0.0, 0.2, 0.5Tests passed, format validity, diversity
Factual QA / inventory lookup0.0, 0.2, 0.5Grounded accuracy and unsupported claims
General chat / customer support0.3, 0.7, 1.0Helpfulness, repetition, policy adherence
Creative writing / descriptions0.7, 1.0, 1.3Diversity and coherence


Top-k Sampling

Algorithm

Top-k sampling was popularized in neural story generation as a way to restrict sampling to the top kkk most probable tokens, then renormalize the distribution.[5] Let's walk through it with our running example.

Suppose the model gives these probabilities after the prompt "The carrier scan shows the package is":

TokenProbabilityCumulative
delayed0.450.45
in0.300.75
on0.150.90
broken0.020.92
omitted tail tokens, individually below 0.020.08 total1.00

With top-k where k=2k = 2k=2, we keep only delayed and in, renormalize their probabilities to sum to 1, and sample from that smaller set. on, broken, and the tail are locked out.

This implementation of top-k sampling limits the probability distribution to only the kkk most likely next tokens. It masks out all other tokens by setting their logits to negative infinity before applying softmax and sampling from the remaining probability mass:

algorithm-2.py
1import torch 2 3def top_k_sampling(logits: torch.Tensor, k: int = 50) -> torch.Tensor: 4 """Sample from the top-k most probable tokens.""" 5 if k < 1: 6 raise ValueError("k must be >= 1") 7 k = min(k, logits.shape[-1]) 8 top_k_values, top_k_indices = logits.topk(k, dim=-1) 9 filtered = torch.full_like(logits, float('-inf')) 10 filtered.scatter_(dim=-1, index=top_k_indices, src=top_k_values) 11 probs = torch.softmax(filtered, dim=-1) 12 return torch.multinomial(probs, 1)

The fixed-k problem

The primary limitation of top-k sampling is its rigidity. A useful value for kkk varies sharply depending on the context of the generation:

ContextDistribution ShapeCandidate kkk values to evaluate
"Your return policy allows"Potentially peaked1, 3, 10
"This product is great for"Potentially flatter10, 30, 50
"The warehouse"Prompt-dependentMeasure rather than assume

When the distribution is highly peaked, a large kkk like 50 forces the sampler to include dozens of irrelevant tail tokens. If the temperature is high, those tail tokens can accumulate enough probability mass to be selected, causing off-topic generation. Conversely, when the distribution is flat, a small kkk like 10 can cut off valid continuations and make the output too narrow. This inability to adapt to distribution shape motivated dynamic truncation approaches.

the-fixed-k-problem.py
1def top_k_tokens(tokens_and_probs, k): 2 return [token for token, _ in sorted(tokens_and_probs, key=lambda item: item[1], reverse=True)[:k]] 3 4def top_p_tokens(tokens_and_probs, p): 5 kept = [] 6 cumulative = 0.0 7 for token, probability in sorted(tokens_and_probs, key=lambda item: item[1], reverse=True): 8 kept.append(token) 9 cumulative += probability 10 if cumulative >= p: 11 break 12 return kept 13 14distribution = [("delayed", 0.45), ("in", 0.30), ("on", 0.15), ("broken", 0.02), ("tail", 0.08)] 15 16assert top_k_tokens(distribution, 2) == ["delayed", "in"] 17assert top_p_tokens(distribution, 0.8) == ["delayed", "in", "on"]

Nucleus (top-p) Sampling

Algorithm

Top-k keeps exactly 50 candidate tokens if k=50k = 50k=50, even when only two are plausible or hundreds deserve consideration. Top-p keeps adding candidate tokens until their cumulative probability covers the requested mass. On an easy order-status reply, only a few candidates qualify; on an ambiguous escalation, the candidate set is much longer. This dynamic threshold adapts naturally to each context.

Instead of fixing kkk, dynamically select the smallest high-probability prefix whose cumulative probability exceeds threshold ppp. This approach, also known as nucleus sampling, addresses the fixed candidate-count limitation of top-k.[2]

Using the same probability table:

TokenProbabilityCumulative
delayed0.450.45
in0.300.75
on0.150.90
broken0.020.92
omitted tail tokens, individually below 0.020.08 total1.00

With top-p where p=0.8p = 0.8p=0.8, we include tokens until the cumulative probability reaches 0.8. That means delayed, in, and on are in the nucleus. broken and the tail are excluded. If the distribution were more peaked and delayed had 0.85 probability, the nucleus would contain only that one token.

Vp={w(1),…,w(m)},m=min⁡{j:∑i=1jP(w(i))≥p}V_p = \{w_{(1)}, \ldots, w_{(m)}\}, \qquad m = \min \left\{j : \sum_{i=1}^{j} P(w_{(i)}) \ge p \right\}Vp​={w(1)​,…,w(m)​},m=min{j:i=1∑j​P(w(i)​)≥p}

where w(i)w_{(i)}w(i)​ are sorted by decreasing probability.

nucleus-renormalization.py
1def nucleus_distribution(tokens_and_probs, threshold): 2 kept = [] 3 mass = 0.0 4 for token, probability in sorted(tokens_and_probs, key=lambda item: item[1], reverse=True): 5 kept.append((token, probability)) 6 mass += probability 7 if mass >= threshold: 8 break 9 return [(token, round(probability / mass, 3)) for token, probability in kept] 10 11tokens = [("delayed", 0.45), ("in", 0.30), ("on", 0.15), ("broken", 0.02)] 12nucleus = nucleus_distribution(tokens, threshold=0.8) 13print("renormalized nucleus:", nucleus) 14 15assert nucleus == [("delayed", 0.5), ("in", 0.333), ("on", 0.167)] 16assert round(sum(probability for _, probability in nucleus), 3) == 1.0
Nucleus renormalization
1renormalized nucleus: [('delayed', 0.5), ('in', 0.333), ('on', 0.167)]

Reading the formula

Sort all tokens by probability, then include tokens from the top until their cumulative probability reaches ppp (e.g., 0.9). On easy predictions where one token has 95% probability, only that token qualifies. On hard predictions where many tokens share probability, a large set is included. The candidate set adapts to the distribution shape.

Here is how nucleus sampling is implemented in PyTorch. The function sorts the logits, computes cumulative probability mass, masks away tokens outside the nucleus, and then samples from the renormalized distribution:

reading-the-formula-3.py
1import torch 2 3def nucleus_sampling(logits: torch.Tensor, p: float = 0.9) -> torch.Tensor: 4 """Top-p (nucleus) sampling: dynamic vocabulary truncation.""" 5 if not 0.0 < p <= 1.0: 6 raise ValueError("p must be in (0, 1]") 7 8 sorted_logits, sorted_indices = torch.sort(logits, descending=True, dim=-1) 9 sorted_probs = torch.softmax(sorted_logits, dim=-1) 10 cumulative_probs = torch.cumsum(sorted_probs, dim=-1) 11 12 # Remove tokens with cumulative probability above threshold 13 sorted_indices_to_remove = cumulative_probs > p 14 # Keep at least one token 15 sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone() 16 sorted_indices_to_remove[..., 0] = False 17 18 indices_to_remove = torch.zeros_like(logits, dtype=torch.bool) 19 indices_to_remove.scatter_(dim=-1, index=sorted_indices, src=sorted_indices_to_remove) 20 21 filtered_logits = logits.clone() 22 filtered_logits[indices_to_remove] = float('-inf') 23 probs = torch.softmax(filtered_logits, dim=-1) 24 return torch.multinomial(probs, 1)

Why many systems start around p=0.9p = 0.9p=0.9

Distribution TypeIllustrative Tokens IncludedBehavior
Peaked (factual context)2-3 tokensNearly deterministic
Moderate (general text)20-50 tokensBalanced diversity
Flat (creative context)100+ tokensHighly diverse

Top-p adapts to the distribution shape: flatter distributions produce larger candidate sets.

Those counts are heuristics, not guarantees. The actual nucleus size depends on the model, tokenizer, prompt, and temperature.

Top-p's weakness: the long tail problem

Top-p has a failure mode to evaluate at high temperatures (T>1T > 1T>1): temperature flattening can cause many individually weak tokens to enter the cumulative-mass set. In those settings, a p=0.9p = 0.9p=0.9 nucleus can expand substantially. This is one motivation for newer confidence-scaled variants such as min-p.

temperature-expands-a-nucleus.py
1from math import exp 2 3def probabilities(logits, temperature): 4 weights = [exp(logit / temperature) for logit in logits] 5 total = sum(weights) 6 return [weight / total for weight in weights] 7 8def nucleus_size(probs, threshold): 9 mass = 0.0 10 for index, probability in enumerate(sorted(probs, reverse=True), start=1): 11 mass += probability 12 if mass >= threshold: 13 return index 14 15logits = [4.0, 2.0, 1.0, 0.0, -0.5, -1.0] 16focused = nucleus_size(probabilities(logits, 0.7), threshold=0.9) 17flattened = nucleus_size(probabilities(logits, 1.5), threshold=0.9) 18print(f"nucleus size at T=0.7: {focused}") 19print(f"nucleus size at T=1.5: {flattened}") 20 21assert flattened > focused
Temperature changes nucleus size
1nucleus size at T=0.7: 1 2nucleus size at T=1.5: 3

Beyond nucleus: min-p

The insight

Top-p ranks candidate next tokens and keeps enough of them to cover 90% of total probability mass. If probability is concentrated in a few actions, the shortlist stays small. If probability is spread out, the shortlist gets long. Min-p instead says "only keep candidates whose probability is at least 10% of the top-ranked candidate." If that candidate is dominant, the bar is high. If it is weak, the bar is lower. The threshold adapts to relative peak probability, not a fixed cumulative cutoff.

Min-p, proposed by Nguyen et al. and published as an ICLR 2025 oral paper, scales the cutoff by the top token's probability instead of using a cumulative-mass threshold:[6]

Vmin⁡-p={w:P(w)≥ρ⋅max⁡vP(v)}V_{\min\text{-}p} = \{w : P(w) \geq \rho \cdot \max_v P(v)\}Vmin-p​={w:P(w)≥ρ⋅maxv​P(v)}

Reading the formula

Find the most likely token, then keep tokens whose probability is at least ρ\rhoρ fraction of that top token's probability. If the top token has 80% probability and ρ=0.1\rho = 0.1ρ=0.1, tokens above 8% are kept. If the top token has 5%, the bar drops to 0.5%, adapting to distribution shape.

What min-p changes relative to top-p

ScenarioTop-p (p=0.9p = 0.9p=0.9)Min-p (ρ=0.1\rho = 0.1ρ=0.1)
Model confident (Pmax⁡=0.8P_{\max} = 0.8Pmax​=0.8)Keeps enough tokens to reach 90% mass, which can still include a long tailKeeps only tokens with P≥0.08P \geq 0.08P≥0.08
Model uncertain (Pmax⁡=0.05P_{\max} = 0.05Pmax​=0.05)Still targets 90% cumulative massLowers the cutoff to P≥0.005P \geq 0.005P≥0.005, so more alternatives survive
High temperature (T=1.5T = 1.5T=1.5)Flattened tails can enter the nucleusRelative cutoff can trim more of that tail
Top-k, top-p, and min-p compared on peaked and flat token distributions, showing which candidates survive each truncation rule. Top-k, top-p, and min-p compared on peaked and flat token distributions, showing which candidates survive each truncation rule.
Read each row left to right. Top-k always keeps the same count. Top-p keeps enough tokens to hit cumulative mass. Min-p keeps tokens that stay close to the strongest token.
what-min-p-changes-relative-to-top-p.py
1def min_p_tokens(tokens_and_probs, rho): 2 max_probability = max(probability for _, probability in tokens_and_probs) 3 threshold = rho * max_probability 4 return [token for token, probability in tokens_and_probs if probability >= threshold] 5 6confident = [("shipped", 0.80), ("delayed", 0.09), ("purple", 0.01)] 7uncertain = [("shipped", 0.05), ("delayed", 0.04), ("in", 0.03)] 8 9assert min_p_tokens(confident, rho=0.1) == ["shipped", "delayed"] 10assert min_p_tokens(uncertain, rho=0.5) == ["shipped", "delayed", "in"]

The min-p sampling function dynamically calculates a threshold based on the maximum probability in the distribution. It takes raw logits, a min_p scaling factor, and a temperature as inputs. The function applies temperature, zeroes out token probabilities that fall below the dynamic threshold (calculated as min_p times the maximum probability), and renormalizes before returning a sampled token:

what-min-p-changes-relative-to-top-p-2.py
1import torch 2 3def min_p_sampling(logits: torch.Tensor, min_p: float = 0.1, temperature: float = 1.0) -> torch.Tensor: 4 """Min-p sampling: confidence-scaled dynamic truncation.""" 5 if not 0.0 <= min_p <= 1.0: 6 raise ValueError("min_p must be in [0, 1]") 7 if temperature <= 0: 8 raise ValueError("temperature must be > 0; use greedy decoding for temperature=0") 9 10 # Apply temperature 11 logits = logits / temperature 12 probs = torch.softmax(logits, dim=-1) 13 14 # Dynamic threshold: min_p * max probability 15 max_prob = probs.max(dim=-1, keepdim=True).values 16 threshold = min_p * max_prob 17 18 # Zero out tokens below threshold 19 filtered_probs = probs.clone() 20 filtered_probs[probs < threshold] = 0.0 21 22 # Renormalize and sample 23 filtered_probs = filtered_probs / filtered_probs.sum(dim=-1, keepdim=True) 24 return torch.multinomial(filtered_probs, 1)

When to think about min-p

Min-p is a newer truncation heuristic to evaluate, rather than a universal successor to nucleus sampling. Nguyen et al. propose it for controlling low-probability candidates admitted by top-p at higher temperatures.[6] A subsequent critical reanalysis reports that min-p didn't reliably improve quality-diversity tradeoffs against commonly used samplers in its experiments and disputes broad adoption claims.[7] Meanwhile, DeepSeek-R1 reports temperature 0.6 with top-p 0.95 in one published evaluation configuration.[8] Treat any of these settings as experiment inputs, not presets for a different model or product.

MethodThresholdAdapts to distribution shape?Main tradeoff
Top-kFixed kkk tokensNoSimple, but rigid
Top-pFixed ppp cumulative massPartlyDynamic, but can admit a long tail
Min-pρ×Pmax⁡\rho \times P_{\max}ρ×Pmax​YesRelative cutoff; compare empirically with top-p

Repetition Penalty

The problem

Autoregressive models can fall into degenerate repetition, where the generation process gets stuck in a loop. For example, a model might repeatedly generate the same phrase:

"The order shipped late. The order shipped late. The order shipped..."

How the penalty works

The repetition penalty originates in CTRL, which applies a penalty greater than 1 to previously generated tokens and uses 1.2 in its reported sampling setup.[9] A sign-aware multiplicative rule divides positive repeated-token logits and multiplies negative repeated-token logits, moving both downward in preference. The function below implements that rule without mutating its input tensor:

how-the-penalty-works.py
1import torch 2 3def apply_repetition_penalty( 4 logits: torch.Tensor, 5 generated_ids: torch.Tensor, 6 penalty: float = 1.2 7) -> torch.Tensor: 8 """Penalize tokens that already appeared in the output (multiplicative).""" 9 adjusted = logits.clone() 10 11 for i in range(adjusted.shape[0]): 12 unique_tokens = torch.unique(generated_ids[i]) 13 14 for token_id in unique_tokens: 15 if adjusted[i, token_id] > 0: 16 adjusted[i, token_id] /= penalty 17 else: 18 adjusted[i, token_id] *= penalty 19 20 return adjusted 21 22logits = torch.tensor([[2.4, -0.5, 0.7]]) 23penalized = apply_repetition_penalty(logits, torch.tensor([[0, 1]]), penalty=1.2) 24print("original logits:", logits.tolist()[0]) 25print("penalized logits:", [round(value, 3) for value in penalized.tolist()[0]]) 26 27assert penalized[0, 0].item() < logits[0, 0].item() 28assert penalized[0, 1].item() < logits[0, 1].item() 29assert penalized[0, 2].item() == logits[0, 2].item()
Sign-aware repetition penalty
1original logits: [2.4000000953674316, -0.5, 0.699999988079071] 2penalized logits: [2.0, -0.6, 0.7]

Frequency and presence penalties

A common formulation is:

logitadjusted=logit−α⋅count(token)−β⋅1[count(token)>0]\text{logit}_{\text{adjusted}} = \text{logit} - \alpha \cdot \text{count}(\text{token}) - \beta \cdot \mathbb{1}[\text{count}(\text{token}) > 0]logitadjusted​=logit−α⋅count(token)−β⋅1[count(token)>0]

Reading the formula

Two knobs discourage repetition. The frequency penalty α\alphaα grows with each use: if a token appears 5 times, it gets 5 times the penalty. The presence penalty β\betaβ is a flat one-time penalty the moment a token is used at all, encouraging topic diversity. Exact formulas and defaults vary by stack, but this captures the core idea.


Combining Strategies in Production

In practice, production systems stack several logit transformations together. The high-level mental model is stable:

  1. Start with raw logits.
  2. Apply logit processors such as penalties, masks, or forced-token constraints.
  3. Apply sampling warpers such as temperature, top-k, top-p, or min-p.
  4. Sample the next token, or take argmax for deterministic decoding.

The exact order is implementation-specific. Some stacks apply temperature before truncation, while others place temperature later in the sampler chain. In interviews, the important point isn't memorizing one canonical order. It's knowing that these controls are layered transformations of the logits, and you should check the implementation of the stack you're using.

This concrete stack makes that layered mental model visible on one next-token step.

Production sampler stack showing logits moving through penalty, temperature, and truncation before final token selection. Production sampler stack showing logits moving through penalty, temperature, and truncation before final token selection.
Production decoders layer a few simple transforms: adjust logits, reshape the distribution, cut the tail, then choose argmax or sample.

The following tested mini-pipeline shows one common pattern: apply repetition penalty, special-case greedy decoding when temperature=0, otherwise scale logits and sample from a top-p nucleus. Min-p fits into the same slot as top-p in real systems.

combining-strategies-in-production.py
1import torch 2 3def apply_temperature(logits: torch.Tensor, temperature: float = 1.0) -> torch.Tensor: 4 if temperature <= 0: 5 raise ValueError("temperature must be > 0; use greedy decoding for temperature=0") 6 return logits / temperature 7 8def nucleus_sampling(logits: torch.Tensor, p: float = 0.9) -> torch.Tensor: 9 if not 0.0 < p <= 1.0: 10 raise ValueError("p must be in (0, 1]") 11 12 sorted_logits, sorted_indices = torch.sort(logits, descending=True, dim=-1) 13 sorted_probs = torch.softmax(sorted_logits, dim=-1) 14 cumulative_probs = torch.cumsum(sorted_probs, dim=-1) 15 16 sorted_indices_to_remove = cumulative_probs > p 17 sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone() 18 sorted_indices_to_remove[..., 0] = False 19 20 indices_to_remove = torch.zeros_like(logits, dtype=torch.bool) 21 indices_to_remove.scatter_(dim=-1, index=sorted_indices, src=sorted_indices_to_remove) 22 23 filtered_logits = logits.clone() 24 filtered_logits[indices_to_remove] = float("-inf") 25 probs = torch.softmax(filtered_logits, dim=-1) 26 return torch.multinomial(probs, 1) 27 28def apply_repetition_penalty( 29 logits: torch.Tensor, 30 generated_ids: torch.Tensor, 31 penalty: float = 1.2, 32) -> torch.Tensor: 33 if penalty <= 0: 34 raise ValueError("penalty must be positive") 35 36 adjusted = logits.clone() 37 for batch_index in range(adjusted.shape[0]): 38 for token_id in torch.unique(generated_ids[batch_index]): 39 if adjusted[batch_index, token_id] > 0: 40 adjusted[batch_index, token_id] /= penalty 41 else: 42 adjusted[batch_index, token_id] *= penalty 43 return adjusted 44 45def sample_next_token( 46 logits: torch.Tensor, 47 generated_ids: torch.Tensor, 48 temperature: float = 0.8, 49 top_p: float = 0.9, 50 repetition_penalty: float = 1.1, 51) -> torch.Tensor: 52 """One common pipeline: penalties -> temperature -> nucleus sampling.""" 53 logits = apply_repetition_penalty(logits, generated_ids, repetition_penalty) 54 55 if temperature == 0: 56 return logits.argmax(dim=-1, keepdim=True) 57 58 logits = apply_temperature(logits, temperature) 59 return nucleus_sampling(logits, p=top_p) 60 61torch.manual_seed(0) 62logits = torch.tensor([[2.0, 1.2, 0.2, -1.0]]) 63generated_ids = torch.tensor([[0, 1, 1]]) 64 65greedy_token = sample_next_token(logits, generated_ids, temperature=0.0) 66sampled_token = sample_next_token(logits, generated_ids, temperature=0.8, top_p=0.9) 67 68print(f"greedy token id: {greedy_token.item()}") 69print(f"sampled token id: {sampled_token.item()}")
Output
1greedy token id: 0 2sampled token id: 1

Evaluation configurations

Evaluation sliceTemperature candidatesSampler candidatesRepetition-penalty candidatesGoal
Code generation0.0, 0.2, 0.5Greedy, top-p=0.951.0, 1.05Tests and format validity
Factual QA0.0, 0.2, 0.5Greedy, top-p=0.91.0, 1.05Grounded accuracy
Chat0.3, 0.7, 1.0Top-p=0.9, min-p=0.11.0, 1.1Helpfulness and repetition
Creative writing0.7, 1.0, 1.3Top-p=0.95, min-p=0.051.0, 1.1Diversity and coherence
Example published eval setting (DeepSeek-R1)0.6Top-p=0.95Not reportedOne published configuration[8]

The first four rows are candidate sweeps, not recommended defaults. The right settings depend on the model, tokenizer, task, and measured failure costs.


Practice: Debug This Output

Here is a short debugging exercise you can reason through without running code. Read the symptoms, identify the likely cause, and pick a fix.

Scenario 1: Repetitive boilerplate

Symptom: A customer-service chatbot keeps ending every reply with "Please let me know if you need anything else. Please let me know if you need anything else."

Likely cause: The repetition penalty is set to 1.0 (no penalty) or temperature is so low that the model keeps rediscovering the same high-probability closing phrase.

Experiment: Compare a mild repetition penalty against a slightly wider sampling distribution on held-out replies, tracking task correctness and loop rate.

Scenario 2: Gibberish at high creativity

Symptom: A brainstorming assistant set to temperature=1.5 and top_p=0.95 occasionally outputs tokens that don't form real words, like "flarble" or "xquz."

Likely cause: At high temperature the distribution flattens, and top-p's cumulative threshold pulls in a long tail of very low-probability tokens. Some of those are nonsense.

Experiment: Hold temperature fixed and compare a tighter top-p value against min-p; measure coherence and diversity instead of assuming one filter improves both.

Scenario 3: Factual QA drifts

Symptom: A support bot answering "What is your return policy?" sometimes says "No returns accepted" or "Returns are unlimited."

Likely cause: Temperature is too high for a factual task, or top-p is so wide that unlikely but wrong answers get sampled.

Fix: First improve grounding or output constraints, then compare conservative decoding settings on an accuracy evaluation. Lower temperature can't repair a wrong high-probability answer.


The Complete Comparison

When choosing a decoding strategy for a new application, first determine whether output variance is allowed and which failures matter. Then compare candidate algorithms on task metrics, repetition, format validity, and latency.

StrategyDeterministic?Adapts to context?Candidate evaluation fit
GreedyYesNoExtraction, classification
Beam searchYesNoTranslation, summarization
Top-kNoNoSimple truncation baseline
Top-pNoYes, by cumulative massGeneral generation
Min-pNoYesConfidence-scaled truncation to compare with top-p
TemperatureModifierModifierControls global diversity
Rep penaltyModifierModifierSuppresses repeated-token reuse

Follow-up questions

A code assistant became repetitive after you lowered temperature from 0.8 to 0.2, but correctness stayed good. What sampler change comes first?

Add a mild repetition or frequency penalty before raising temperature again. The immediate problem is looped phrasing, not loss of task accuracy, so stay in the decoding stack first and only widen temperature if the penalty isn't enough.

A translation benchmark improves with beam width 5, but chat quality gets worse. Why is that not a contradiction?

Translation has a narrow target meaning, so keeping several high-probability candidates helps recover a better full sequence. Chat has many acceptable answers, so larger beams often over-optimize for generic high-likelihood continuations and drain personality from the response.[4]

High-temperature brainstorming still feels oddly narrow after you set top_k=10. What is wrong with that setup?

Top-k is a fixed cutoff. With a flat creative distribution, k=10 may still exclude many reasonable alternatives. Compare dynamic filters such as top-p or min-p on the target distribution rather than increasing temperature alone.

A new model starts producing nonsense tail tokens at temperature=1.4 and top_p=0.95. Why might min-p help?

At high temperature, many individually weak tokens can enter a top-p nucleus because their cumulative mass adds up. Min-p ties the cutoff to the top token's probability, so if the model still has a clear best candidate, more of the tail gets dropped.

Same sampler knobs, new runtime, different answers. What is your debugging plan?

Log the actual sampler path: penalties, masks, temperature, truncation, and final token choice. Verify operation order, default values, special handling of temperature=0, and any engine-specific processors. Treat the runtime as code to inspect, not a black box that must share another framework's semantics.

What You Should Be Able To Defend

DecisionPass bar
Greedy vs samplingYou can name one task where greedy is the right default and one where it will likely sound degenerate.
Beam searchYou can explain why beam width can help translation yet hurt open-ended chat, and when length penalty matters.
TemperatureYou can connect lower or higher temperature to the actual shape change in one worked probability distribution.
Top-k vs top-p vs min-pYou can choose which truncation rule better fits a peaked distribution and which better fits a flat one.
Penalty knobsYou can explain when repetition, frequency, and presence penalties solve style loops versus when they don't touch factual errors.
Production choiceYou can defend one sampler stack for factual QA, one for code, and one for creative chat under a real latency or quality goal.
Implementation bugsYou can name two runtime mistakes that change outputs even when the visible knobs look the same.

Key Takeaways

  • Decoding shapes quality: The choice of decoding strategy (deterministic vs. stochastic) changes whether the model output tends toward precision and repetition or diversity and creativity.
  • A useful mental progression: Start with static truncation (top-k), then dynamic mass-based truncation (nucleus/top-p), then newer confidence-scaled variants such as min-p.
  • Likelihood isn't generation quality: Holtzman et al. show that maximization-based decoding can produce repetitive, bland open-ended text even when likelihood is high.[2]
  • Temperature vs. Truncation: Temperature (TTT) modifies the shape of the distribution (sharpness), while top-p/min-p modify the tail (truncation). They are orthogonal controls used together.
  • Sampler order is framework-specific: The right mental model is layered logit transformations, not one universal ordering rule. Check the implementation of the stack you're using.

Common pitfalls

Beam search made chat sound robotic

Symptom: Answers became safer and more repetitive after you increased beam width.

Likely cause: Beam search is pushing toward the most likely overall continuation, which is often generic in open-ended dialogue.

Fix: Use beam search for translation or tightly grounded summarization. For chat, switch back to a sampling strategy and tune temperature plus truncation instead.

Temperature tweak didn't fix hallucinations

Symptom: You lowered temperature, but the model still states wrong facts confidently.

Likely cause: Temperature changes distribution sharpness, not factual grounding. If the model lacks evidence, it can still choose a wrong but high-probability continuation.

Fix: Keep decoding conservative for factual tasks, but fix retrieval, prompt grounding, or output constraints rather than treating temperature as a truth knob.

High temperature plus top-p pulled in nonsense tail tokens

Symptom: Creative mode started producing weird fragments or off-topic words.

Likely cause: Higher temperature flattened the distribution, so top-p admitted a long tail of individually weak candidates.

Fix: Lower temperature, tighten top-p, or test min-p so the cutoff tracks the strength of the best token.

Fixed top-k clipped valid options

Symptom: Outputs stayed narrow in creative prompts and erratic in factual prompts with the same k.

Likely cause: Top-k doesn't adapt to distribution shape. The same cutoff is too small in flat contexts and too large in peaked ones.

Fix: Treat top-k as a simple baseline. Compare top-p or min-p when candidate-set size should react to distribution shape.

Same knobs changed behavior after a runtime swap

Symptom: Another engine gives different outputs even with matching documented settings.

Likely cause: Sampler order, default values, or special cases such as temperature=0 differ across implementations.

Fix: Inspect the actual sampler path, log intermediate logits if needed, and validate behavior on representative prompts before blaming the model.

temperature=0 was sent through the sampling math directly

Symptom: The runtime crashes, returns NaNs, or behaves inconsistently when someone sets temperature to zero.

Likely cause: The implementation divided logits by zero instead of treating temperature=0 as a greedy special case.

Fix: Handle temperature=0 explicitly as deterministic decoding, and keep the sampling path only for temperatures greater than zero.

Top-p masked tokens but didn't renormalize survivors

Symptom: Sampling looks biased or unstable even though the nucleus cutoff seems correct.

Likely cause: Tokens outside the nucleus were removed, but the remaining probabilities weren't renormalized before sampling.

Fix: After masking the tail, renormalize the surviving probabilities so the sampler draws from a valid distribution.


Going Deeper

The papers behind these techniques are cited throughout the article: Holtzman et al. (2020) on neural text degeneration,[2] Meister et al. (2020) on beam search paradox,[4] Fan et al. (2018) on top-k sampling,[5] Wu et al. (2016) on GNMT beam-search length normalization,[3] and Nguyen et al. (2025) on min-p as a newer confidence-scaled variant,[6] along with a 2025 critical reanalysis that questions min-p's reported gains.[7]

Related Articles

  • Inference Mechanics: TTFT, TPS, and KV Cache - How decoding fits into the broader inference pipeline.
  • Speculative Decoding - Accelerating generation by drafting tokens with a smaller model.
  • Perplexity and Language Model Evaluation - Understanding why human text has high perplexity.
Next Step
Continue to Scaling Laws & Compute-Optimal Training

You can now trace how transformer internals produce logits and how decoding turns them into output tokens. Next, shift from running a trained model to deciding how to train one: scaling laws explain how parameters, data, and compute should be balanced before an expensive training run.

PreviousMechanistic Interpretability
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Fast Inference from Transformers via Speculative Decoding.

Leviathan, Y., Kalman, M., & Matias, Y. · 2023 · ICML 2023

The Curious Case of Neural Text Degeneration.

Holtzman, A., et al. · 2020 · ICLR 2020

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.

Wu, Y., et al. · 2016

If Beam Search is the Answer, What was the Question?.

Meister, C., Cotterell, R., & Vieira, T. · 2020 · EMNLP 2020

Hierarchical Neural Story Generation.

Fan, A., Lewis, M., & Dauphin, Y. · 2018 · ACL 2018

Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs.

Nguyen, et al. · 2025 · ICLR 2025 Oral

Min-p, Max Exaggeration: A Critical Analysis of Min-p Sampling in Language Models

Schaeffer, R., Kazdan, J., & Denisov-Blanch, Y. · 2025

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.

DeepSeek Team · 2025

CTRL: A Conditional Transformer Language Model for Controllable Generation.

Keskar, N. S., et al. · 2019