LeetLLM
LearnFeaturesPricingBlog
Menu
LearnFeaturesPricingBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Pricing
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

Β© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 76 articles completed

πŸ§ͺAI Engineering Foundations0/11
The Bitter Lesson & ComputeTokenization: BPE & SentencePieceWord to Contextual EmbeddingsSentence Embeddings & Contrastive LossDimensionality Reduction for EmbeddingsEmbedding Similarity & QuantizationScaled Dot-Product AttentionPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNDecoding Strategies: Greedy to NucleusPerplexity & model evaluation
⚑Inference Systems & Optimization0/12
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceSpeculative DecodingLong Context Window ManagementModel Quantization: GPTQ, AWQ & GGUFMixture of Experts (MoE)Mamba & State Space ModelsReasoning & Test-Time Compute
πŸ”Advanced Retrieval & Enterprise Memory0/7
Chunking StrategiesVector DB Internals: HNSW & IVFHybrid Search: Dense + SparseProduction RAG PipelinesAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access Control
πŸ€–Agentic Architecture & Orchestration0/13
CoT, ToT & Self-Consistency PromptingStructured Output GenerationFunction Calling & Tool UseMCP & Tool Protocol StandardsReAct & Plan-and-ExecuteAgent Memory & PersistenceHuman-in-the-Loop AgentsGuardrails & Safety FiltersPrompt Injection DefenseCode Generation & SandboxingAgent Failure & RecoveryMulti-Agent OrchestrationAI Agent Evaluation and Benchmarking
πŸ“ŠEvaluation & Reliability0/6
LLM Benchmarks & LimitationsLLM-as-a-Judge EvaluationA/B Testing for LLMsLLM Observability & MonitoringHallucination Detection & MitigationBias & Fairness in LLMs
πŸ› οΈLLMOps & Production Engineering0/4
Semantic Caching & Cost OptimizationLLM Cost Engineering and Token EconomicsModel Versioning & DeploymentGPU Serving & Autoscaling
🧬Training, Alignment & Reasoning0/13
Scaling Laws & Compute TrainingPre-training Data at ScaleInstruction Tuning & Chat TemplatesMixed Precision TrainingDistributed Training: FSDP & ZeROPrompt Optimization with DSPyRecursive Language Models (RLM)LoRA & Parameter-Efficient TuningKnowledge DistillationModel Merging and Weight InterpolationConstitutional AI & Red TeamingRLHF & DPO AlignmentRLVR & Verifiable Rewards
πŸ—οΈSystem Design Case Studies0/10
Automated Support AgentContent Moderation SystemLLM-Powered Search EngineCode Completion SystemMulti-Tenant LLM PlatformReasoning & Test-Time ComputeReal-Time Voice AI AgentVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image Generation
Track Your Progress

Create a free account to save your reading progress across devices and unlock the full learning experience.

LeetLLM Premium
  • All question breakdowns
  • Architecture diagrams
  • Model answers & rubrics
  • Follow-up Q&A analysis
  • New content weekly
Back to Topics
LearnAI Engineering FoundationsDecoding Strategies: Greedy to Nucleus
πŸ“HardNLP Fundamentals

Decoding Strategies: Greedy to Nucleus

Compare decoding strategies for text generation: greedy, beam search, top-k, nucleus (top-p), and min-p sampling, with temperature scaling and repetition penalty.

45 min readGoogle, OpenAI, Meta +210 key concepts

When a language model generates text, it doesn't produce words directly. It produces a ranked list of probabilities for what the next word should be. "The" might have a 30% chance, "A" might have 15%, "In" might have 10%, and so on across the entire vocabulary. Decoding is the strategy for choosing which word to actually pick from that list, and it dramatically affects whether the output is creative, repetitive, coherent, or nonsensical. (For how decoding fits into the broader inference pipeline and techniques like speculative decoding that accelerate it, see those deep-dives.)

🎯 Core concept: Decoding is where the model meets the real world. Understanding the evolution from greedy β†’ beam β†’ sampling strategies, how temperature reshapes probabilities, and the differences between top-k, top-p, and min-p is essential for tuning LLM outputs in production.

Diagram Diagram
Three decoding strategies compared left to right: Greedy (always picks top token), Beam Search (tracks k=3 candidates per step), and Nucleus Sampling (samples from top-p=0.9 tokens). A gradient bar shows the tradeoff from deterministic to creative output. Three decoding strategies compared left to right: Greedy (always picks top token), Beam Search (tracks k=3 candidates per step), and Nucleus Sampling (samples from top-p=0.9 tokens). A gradient bar shows the tradeoff from deterministic to creative output.

Greedy decoding

Algorithm

πŸ—ΊοΈ Analogy, GPS navigation: Greedy decoding is like a GPS that always takes the best-looking turn right now. It might lead you down a beautiful street that dead-ends. Beam search is like having 5 GPS devices running in parallel, each exploring a different route, and you pick the one with the best total travel time at the end. The locally optimal choice isn't always globally optimal.

At each step, select the token with the highest probability:

wt=arg⁑max⁑wP(w∣w<t)w_t = \arg\max_{w} P(w | w_{<t})wt​=argmaxw​P(w∣w<t​)

Reading the formula: at each step, look at every possible next word, pick the one with the highest probability, and commit to it. Simple and fast, but myopic, like always turning onto the prettiest street without checking if it leads to your destination.

Here is a basic implementation of greedy decoding in PyTorch. It takes a model and prompt, then iteratively selects the highest probability token using argmax until the max length or end-of-sequence token is reached:

python
1import torch 2 3def greedy_decode( 4 model: torch.nn.Module, 5 prompt_ids: torch.Tensor, 6 max_length: int, 7 eos_token_id: int 8) -> torch.Tensor: 9 input_ids = prompt_ids 10 for _ in range(max_length): 11 logits = model(input_ids).logits[:, -1, :] 12 next_token = logits.argmax(dim=-1, keepdim=True) 13 input_ids = torch.cat([input_ids, next_token], dim=-1) 14 if next_token.item() == eos_token_id: 15 break 16 return input_ids

Strengths

  • β€’Deterministic and fast: O(V)O(V)O(V) per step (single argmax over vocabulary)
  • β€’Suitable for structured tasks: classification, extraction, factual Q&A with clear answers

Limitations

  • β€’Suboptimal globally: The locally best token doesn't guarantee the best sequence
  • β€’Repetitive: Tends to get stuck in loops ("I think that I think that I think...")
  • β€’Degenerate text: Produces generic, predictable output

πŸ’‘ Key insight from Holtzman et al. (2020):[1] Human-written text does NOT follow the most probable trajectory. The tokens humans choose often rank 10th–100th in the model's distribution. This means optimizing for maximum likelihood (greedy/beam) produces fundamentally un-human text.


Beam search

Algorithm

Instead of keeping only the best token, maintain B (beam width) candidate sequences, expanding each by the top-k tokens. This PyTorch snippet demonstrates beam search by tracking the top B candidate sequences at each step. It calculates the log probabilities for all possible next tokens across all current beams, and retains only the B sequences with the highest cumulative scores:

python
1def beam_search( 2 model: torch.nn.Module, 3 prompt_ids: torch.Tensor, 4 beam_width: int = 5, 5 max_length: int = 100, 6 eos_token_id: int = 2 7) -> torch.Tensor: 8 # Each beam: (sequence, cumulative_log_prob) 9 beams = [(prompt_ids, 0.0)] 10 completed = [] 11 12 for step in range(max_length): 13 all_candidates = [] 14 for seq, score in beams: 15 if seq[-1] == eos_token_id: 16 completed.append((seq, score)) 17 continue 18 19 logits = model(seq).logits[:, -1, :] 20 log_probs = torch.log_softmax(logits, dim=-1) 21 top_k = log_probs.topk(beam_width) 22 23 for i in range(beam_width): 24 new_seq = torch.cat([seq, top_k.indices[:, i:i+1]], dim=-1) 25 new_score = score + top_k.values[:, i].item() 26 all_candidates.append((new_seq, new_score)) 27 28 # Keep top beam_width candidates 29 beams = sorted(all_candidates, key=lambda x: x[1], 30 reverse=True)[:beam_width] 31 32 return max(completed or beams, key=lambda x: x[1])[0]

Length Penalty

Raw log-probability scoring biases toward shorter sequences (each additional token multiplies probabilities < 1). The length penalty normalizes:

score(Y)=log⁑P(Y)∣Y∣α\text{score}(Y) = \frac{\log P(Y)}{|Y|^\alpha}score(Y)=∣Y∣αlogP(Y)​

Reading the formula: without correction, longer sequences always score lower (each extra word multiplies by a probability < 1). Dividing by the length raised to Ξ±\alphaΞ± compensates: Ξ±=0\alpha = 0Ξ±=0 means no correction (short wins), Ξ±=1\alpha = 1Ξ±=1 fully normalizes for length (each word judged equally).

Ξ±\alphaΞ±EffectUse Case
0No normalization (favors short)Short outputs
0.6Balanced (common default)Translation
1.0Full normalization (favors long)Summarization

When to Use Beam Search

  • β€’βœ… Machine translation: One "correct" answer to find
  • β€’βœ… Summarization: Structured output with clear references
  • β€’βŒ NOT for: Open-ended generation, creative writing, chat

⚠️ Counterintuitive: Beam search with larger beams produces more likely but less interesting text.[2] In open-ended generation, increasing beam width actually decreases output quality because the most probable sequence is generic and repetitive.


Temperature scaling

Mechanism

🚿 Analogy, shower temperature dial: Temperature controls the model's "sharpness" the same way a shower dial controls water temperature. Low temperature (T<1T < 1T<1) makes the model laser-focused on its top choices, like ice-cold water: bracing and precise. High temperature (T>1T > 1T>1) spreads probability across many tokens, like turning the dial to hot: more relaxed and exploratory. At T=0T = 0T=0, you get exactly one answer (greedy). At T=∞T = \inftyT=∞, it's completely random.

Temperature modifies the logit distribution before softmax:

P(wi)=exp⁑(zi/T)βˆ‘jexp⁑(zj/T)P(w_i) = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}P(wi​)=βˆ‘j​exp(zj​/T)exp(zi​/T)​

Reading the formula: divide the raw logits by temperature TTT before applying softmax. When T<1T < 1T<1, the division amplifies differences between logits, making the top choice even more dominant. When T>1T > 1T>1, it compresses differences, spreading probability more evenly across options. At Tβ†’0T \to 0Tβ†’0, it becomes greedy; at Tβ†’βˆžT \to \inftyTβ†’βˆž, it becomes random.

The following function applies temperature scaling by dividing the raw logits by the temperature parameter before they are passed to the softmax function. A temperature less than 1.0 makes the distribution sharper, while greater than 1.0 flattens it:

python
1def apply_temperature(logits: torch.Tensor, temperature: float = 1.0) -> torch.Tensor: 2 """Scale logits by temperature before softmax.""" 3 return logits / temperature

Mathematical Intuition

Temperature TTTEffect on DistributionEntropy
T→0T \to 0T→0Approaches one-hot (greedy)→ 0
T=1T = 1T=1Original model distributionBaseline
T>1T > 1T>1Flattened (more uniform)Increases
Tβ†’βˆžT \to \inftyTβ†’βˆžApproaches uniform randomβ†’ log⁑V\log VlogV
Token probability distribution showing temperature scaling: low temperature sharpens the distribution toward the top token, high temperature flattens it for more diverse sampling. Token probability distribution showing temperature scaling: low temperature sharpens the distribution toward the top token, high temperature flattens it for more diverse sampling.

Why it works: Dividing logits by T<1T < 1T<1 amplifies differences between logits, making the distribution peakier. Dividing by T>1T > 1T>1 shrinks differences, making probabilities more similar. This controls the sharpness of the model's confidence.

Practical Guidance

Use CaseTemperatureRationale
Code generation0.0–0.3Correctness > creativity
Factual QA0.1–0.3Precision matters
General chat0.7–0.9Natural variety
Creative writing1.0–1.5Encourage diversity
Brainstorming1.2–2.0Maximum creativity

Top-k sampling

Algorithm[3]

Restrict sampling to the top kkk most probable tokens, then renormalize. This implementation of top-k sampling limits the probability distribution to only the kkk most likely next tokens. It masks out all other tokens by setting their logits to negative infinity before applying softmax and sampling from the remaining probability mass:

python
1def top_k_sampling(logits: torch.Tensor, k: int = 50) -> torch.Tensor: 2 """Sample from the top-k most probable tokens.""" 3 top_k_values, top_k_indices = logits.topk(k) 4 filtered = torch.full_like(logits, float('-inf')) 5 filtered.scatter_(1, top_k_indices, top_k_values) 6 probs = torch.softmax(filtered, dim=-1) 7 return torch.multinomial(probs, 1)

The Fixed-k Problem

The optimal kkk varies dramatically by context:

ContextDistribution ShapeIdeal kkk
"The capital of France is"Very peaked~3
"I enjoy eating"Moderately flat~50
"The"Very flat~500

Fixed kkk either cuts valid tokens (too small) or includes garbage tokens (too large). This motivated dynamic approaches.


Nucleus (top-p) sampling

Algorithm[1]

πŸŽ‰ Analogy, nightclub VIP list: Top-k is like a bouncer who always lets in exactly 50 people, even if the club is empty (too many) or packed (not enough). Top-p is like a bouncer who keeps admitting people until 90% of the night's expected crowd is inside. On a slow night, only a few VIPs get in; on a busy night, the guest list is much longer. This dynamic threshold adapts naturally to each context.

Instead of fixing kkk, dynamically select the smallest set of tokens whose cumulative probability exceeds threshold ppp:

Vp=min⁑{Vβ€²βŠ†V:βˆ‘w∈Vβ€²P(w)β‰₯p}V_p = \min\{V' \subseteq V : \sum_{w \in V'} P(w) \geq p\}Vp​=min{Vβ€²βŠ†V:βˆ‘w∈V′​P(w)β‰₯p}

Reading the formula: sort all tokens by probability, then include tokens from the top until their cumulative probability reaches ppp (e.g., 0.9). On easy predictions where one token has 95% probability, only that token qualifies. On hard predictions where many tokens share probability, a large set is included. The vocabulary size adapts to the model's confidence.

Here is how nucleus sampling is implemented in PyTorch. It sorts the logits, calculates cumulative probabilities, and zeroes out any tokens that exceed the target threshold ppp before renormalizing and sampling:

python
1def nucleus_sampling(logits: torch.Tensor, p: float = 0.9) -> torch.Tensor: 2 """Top-p (nucleus) sampling: dynamic vocabulary truncation.""" 3 sorted_logits, sorted_indices = logits.sort(descending=True) 4 cumulative_probs = torch.softmax(sorted_logits, dim=-1).cumsum(dim=-1) 5 6 # Remove tokens with cumulative probability above threshold 7 sorted_indices_to_remove = cumulative_probs > p 8 # Keep at least one token 9 sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone() 10 sorted_indices_to_remove[..., 0] = False 11 12 indices_to_remove = sorted_indices_to_remove.scatter( 13 1, sorted_indices, sorted_indices_to_remove 14 ) 15 logits[indices_to_remove] = float('-inf') 16 probs = torch.softmax(logits, dim=-1) 17 return torch.multinomial(probs, 1)

Why p=0.9p = 0.9p=0.9 Works Well

Distribution TypeTokens IncludedBehavior
Peaked (factual context)2-3 tokensNearly deterministic
Moderate (general text)20-50 tokensBalanced diversity
Flat (creative context)100+ tokensHighly diverse

Top-p adapts automatically to the model's confidence: the more uncertain the model, the more tokens it considers.

Top-p's Weakness: The Long Tail Problem

Top-p has a subtle flaw at high temperatures (T>1T > 1T>1): temperature flattening causes many low-probability tokens to collectively exceed the threshold. At T=1.5T = 1.5T=1.5 with p=0.9p = 0.9p=0.9, you might include hundreds of tokens, many with negligible individual probability, leading to incoherent outputs. This is exactly the problem min-p solves.


Min-p sampling: the modern standard

The Insight[4]

🎯 Analogy, hiring bar: Top-p is like saying "hire the cheapest 90% of applicants." At a bad company (uncertain model), that includes some terrible candidates. Min-p is like saying "only hire people at least 10% as qualified as the best applicant." If your top candidate is stellar, the bar is high. If your top candidate is mediocre, the bar is low. The threshold adapts to the quality of what's available, not to an arbitrary cutoff.

Min-p (ICLR 2025 Oral) uses the top token's probability as a scaling factor for the threshold:

Vmin⁑-p={w:P(w)β‰₯ρ⋅max⁑vP(v)}V_{\min\text{-}p} = \{w : P(w) \geq \rho \cdot \max_v P(v)\}Vmin-p​={w:P(w)β‰₯ρ⋅maxv​P(v)}

Reading the formula: find the most likely token, then keep only tokens whose probability is at least ρ\rhoρ fraction of that top token's probability. If the best token has 80% probability and ρ=0.1\rho = 0.1ρ=0.1, only tokens above 8% are kept. If the best has only 5%, the bar drops to 0.5%, automatically adapting to context difficulty.

Why Min-p Beats Top-p

ScenarioTop-p (p=0.9p = 0.9p=0.9)Min-p (ρ=0.1\rho = 0.1ρ=0.1)
Model confident (Pmax⁑=0.8P_{\max} = 0.8Pmax​=0.8)Includes many low-prob tokens in the 0.9 massOnly tokens with Pβ‰₯0.08P \geq 0.08Pβ‰₯0.08 βœ…
Model uncertain (Pmax⁑=0.05P_{\max} = 0.05Pmax​=0.05)Might cut off valid alternativesIncludes tokens with Pβ‰₯0.005P \geq 0.005Pβ‰₯0.005 βœ…
High temperature (T=1.5T = 1.5T=1.5)Long tail of garbage tokensThreshold scales with confidence βœ…

The min-p sampling function dynamically calculates a threshold based on the maximum probability in the distribution multiplied by min_p. It then zeroes out any token probabilities that fall below this threshold and renormalizes the remaining distribution:

python
1def min_p_sampling(logits: torch.Tensor, min_p: float = 0.1, temperature: float = 1.0) -> torch.Tensor: 2 """Min-p sampling: confidence-scaled dynamic truncation.""" 3 # Apply temperature 4 logits = logits / temperature 5 probs = torch.softmax(logits, dim=-1) 6 7 # Dynamic threshold: min_p Γ— max probability 8 max_prob = probs.max(dim=-1, keepdim=True).values 9 threshold = min_p * max_prob 10 11 # Zero out tokens below threshold 12 filtered_probs = probs.clone() 13 filtered_probs[probs < threshold] = 0.0 14 15 # Renormalize and sample 16 filtered_probs = filtered_probs / filtered_probs.sum(dim=-1, keepdim=True) 17 return torch.multinomial(filtered_probs, 1)

Min-p Adoption

Min-p has been integrated into Hugging Face Transformers, vLLM, llama.cpp, and Ollama. It's the recommended sampling method for DeepSeek-R1 (ρ=0.05\rho = 0.05ρ=0.05) and widely adopted for creative text generation where top-p struggles at high temperatures.

MethodThresholdAdapts to Confidence?High-T Behavior
Top-kFixed kkk tokens❌ NoIncludes garbage
Top-pFixed ppp cumulative mass⚠️ PartiallyLong tail problem
Min-pρ×Pmax⁑\rho \times P_{\max}ρ×Pmaxβ€‹βœ… YesClean truncation

Repetition penalty

The Problem

Autoregressive models are prone to degenerate repetition:

text
1"The cat sat on the mat. The cat sat on the mat. The cat sat on..."

Repetition Penalty[5]

Divide the logits of previously generated tokens by a penalty factor. This function applies a multiplicative repetition penalty to the logits of tokens that have already been generated. If a token is in the generated_ids, its logit is divided by the penalty factor (for positive logits) or multiplied (for negative logits) to reduce its likelihood of being selected again:

python
1def apply_repetition_penalty( 2 logits: torch.Tensor, 3 generated_ids: torch.Tensor, 4 penalty: float = 1.2 5) -> torch.Tensor: 6 """Penalize tokens that already appeared in the output (multiplicative).""" 7 # Assumes logits shape: (batch_size, vocab_size) 8 # Assumes generated_ids shape: (batch_size, seq_len) 9 batch_size = logits.shape[0] 10 11 for i in range(batch_size): 12 # Get unique tokens generated for this sequence 13 unique_tokens = torch.unique(generated_ids[i]) 14 15 for token_id in unique_tokens: 16 if logits[i, token_id] > 0: 17 logits[i, token_id] /= penalty 18 else: 19 logits[i, token_id] *= penalty 20 21 return logits

Frequency and Presence Penalty (OpenAI Style)

OpenAI's API exposes two fine-grained controls:

logitadjusted=logitβˆ’Ξ±β‹…count(token)βˆ’Ξ²β‹…1[count(token)>0]\text{logit}_{\text{adjusted}} = \text{logit} - \alpha \cdot \text{count}(\text{token}) - \beta \cdot \mathbb{1}[\text{count}(\text{token}) > 0]logitadjusted​=logitβˆ’Ξ±β‹…count(token)βˆ’Ξ²β‹…1[count(token)>0]

Reading the formula: two knobs to discourage repetition. The frequency penalty Ξ±\alphaΞ± grows with each use (say a word 5 times, it gets 5Γ— the penalty, discouraging overuse). The presence penalty Ξ²\betaΞ² is a flat one-time penalty the moment a token is used at all (encouraging topic diversity). Together they let you tune how much repetition is acceptable.


Combining strategies in production

In practice, multiple strategies are combined in a specific order. The standard pipeline applies modifications that reshape the distribution before applying truncation.

Diagram Diagram

The following pipeline function demonstrates the standard production order of operations for token generation. It first applies the repetition penalty, then scales the logits by temperature, and finally performs truncation sampling (using min-p in this example) to select the final token:

python
1def generate_token( 2 logits: torch.Tensor, 3 generated_ids: torch.Tensor, 4 temperature: float = 0.8, 5 min_p: float = 0.1, 6 repetition_penalty: float = 1.1, 7) -> torch.Tensor: 8 """Production-grade token generation pipeline.""" 9 # Step 1: Apply repetition penalty (before temperature) 10 logits = apply_repetition_penalty(logits, generated_ids, 11 repetition_penalty) 12 13 # Step 2: Apply temperature 14 logits = logits / temperature 15 16 # Step 3: Apply min-p sampling 17 # Note: We pass temperature=1.0 because we already applied it in Step 2. 18 # Applying it again would double-scale (equivalent to T^2). 19 next_token = min_p_sampling(logits, min_p=min_p, temperature=1.0) 20 21 return next_token

Production Configurations

Use CaseTemperatureSamplerRep PenaltyNotes
Code generation0.0–0.2Greedy or top-p=0.951.0Correctness is paramount
Factual QA0.1–0.3Top-p=0.91.0Precision over diversity
Chat0.7–0.9Min-p=0.11.1Natural, non-repetitive
Creative writing1.0–1.5Min-p=0.051.2High diversity, coherent
Brainstorming1.2–2.0Min-p=0.021.3Maximum exploration
Reasoning (DeepSeek-R1)0.6Min-p=0.051.0Chain-of-thought quality

The complete comparison

StrategyDeterministic?Adapts to Context?Best For
Greedyβœ… Yes❌ NoExtraction, classification
Beam searchβœ… Yes❌ NoTranslation, summarization
Top-k❌ No❌ NoLegacy, mostly replaced
Top-p❌ No⚠️ PartiallyGeneral generation
Min-p❌ Noβœ… YesCreative + high-T generation
TemperatureModifierModifierControls global diversity
Rep penaltyModifierModifierPrevents degeneration

Key Takeaways

Summary

  • β€’Decoding defines quality: The choice of decoding strategy (deterministic vs. stochastic) determines whether the model output is precise and repetitive or diverse and creative.
  • β€’Evolution of sampling: The field has moved from static truncation (top-k) to dynamic mass-based truncation (nucleus/top-p) to confidence-scaled truncation (min-p).
  • β€’Human text is high-perplexity: Optimizing for maximum likelihood (greedy/beam) produces unnatural, repetitive text because humans rarely pick the most probable token.
  • β€’Temperature vs. Truncation: Temperature (TTT) modifies the shape of the distribution (sharpness), while top-p/min-p modify the tail (truncation). They are orthogonal controls used together.
  • β€’Production pipelines: A robust generation pipeline applies repetition penalty first, then temperature scaling, and finally truncation sampling (min-p or top-p).

Common misconceptions

  • β€’"Beam search is always better": In open-ended generation, beam search often degrades quality by finding high-probability, generic loops. It is best for tasks with a single correct answer (translation).
  • β€’"Top-p fixes high temperature": At high TTT, the distribution flattens, causing top-p to include a long tail of irrelevant tokens. Min-p solves this by scaling the threshold relative to the top token's probability.
  • β€’"Top-k adapts to context": Top-k is rigid. It cuts off valid tokens in flat distributions and includes garbage in peaked ones. Only top-p and min-p adapt to the model's uncertainty.

Evaluation Rubric
  • 1
    Explains greedy as argmax and its degeneration issues
  • 2
    Shows beam search explores multiple hypotheses with length penalty
  • 3
    Derives temperature's effect on softmax distribution sharpness
  • 4
    Compares top-k vs top-p for dynamic vocabulary truncation
  • 5
    Explains min-p's confidence-scaled threshold advantage over top-p
  • 6
    Discusses top-p's failure at high temperatures (long tail)
  • 7
    Knows production configurations for different use cases
  • 8
    Discusses repetition penalty mechanisms
Common Pitfalls
  • Confusing temperature with top-p. They are independent controls
  • Thinking beam search is always better than greedy for all tasks
  • Not knowing top-p's failure at high temperatures (long tail problem)
  • Saying top-k adapts to context. It's fixed; only top-p and min-p adapt
  • Forgetting the order of operations in the sampling pipeline
  • Not knowing about min-p. ICLR 2025 Oral, adopted by major frameworks
Follow-up Questions to Expect

Key Concepts Tested
Greedy decoding as argmax: deterministic but degenerateBeam search with length penalty: structured tasks onlyTemperature scaling: controls distribution sharpness mathematicallyTop-k sampling: fixed vocabulary truncationTop-p (nucleus) sampling: dynamic cumulative probability thresholdMin-p sampling: confidence-scaled threshold, robust at high temperatureTop-p failure at high temperatures (long tail problem)Repetition and frequency/presence penaltiesProduction pipeline order: repetition penalty β†’ temperature β†’ truncationHuman text has high perplexity: maximum likelihood is un-human
References

The Curious Case of Neural Text Degeneration.

Holtzman, A., et al. Β· 2020 Β· ICLR 2020

CTRL: A Conditional Transformer Language Model for Controllable Generation.

Keskar, N. S., et al. Β· 2019

If Beam Search is the Answer, What was the Question?

Meister, C., Cotterell, R., & Vieira, T. Β· 2020 Β· EMNLP 2020

Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs.

Nguyen, et al. Β· 2025 Β· ICLR 2025 Oral

Hierarchical Neural Story Generation.

Fan, A., Lewis, M., & Dauphin, Y. Β· 2018 Β· ACL 2018

Your account is free and you can post anonymously if you choose.