Compare decoding strategies for text generation: greedy, beam search, top-k, nucleus (top-p), and min-p sampling, with temperature scaling and repetition penalty.
When a language model generates text, it doesn't produce words directly. It produces a ranked list of probabilities for what the next word should be. "The" might have a 30% chance, "A" might have 15%, "In" might have 10%, and so on across the entire vocabulary. Decoding is the strategy for choosing which word to actually pick from that list, and it dramatically affects whether the output is creative, repetitive, coherent, or nonsensical. (For how decoding fits into the broader inference pipeline and techniques like speculative decoding that accelerate it, see those deep-dives.)
π― Core concept: Decoding is where the model meets the real world. Understanding the evolution from greedy β beam β sampling strategies, how temperature reshapes probabilities, and the differences between top-k, top-p, and min-p is essential for tuning LLM outputs in production.
πΊοΈ Analogy, GPS navigation: Greedy decoding is like a GPS that always takes the best-looking turn right now. It might lead you down a beautiful street that dead-ends. Beam search is like having 5 GPS devices running in parallel, each exploring a different route, and you pick the one with the best total travel time at the end. The locally optimal choice isn't always globally optimal.
At each step, select the token with the highest probability:
Reading the formula: at each step, look at every possible next word, pick the one with the highest probability, and commit to it. Simple and fast, but myopic, like always turning onto the prettiest street without checking if it leads to your destination.
Here is a basic implementation of greedy decoding in PyTorch. It takes a model and prompt, then iteratively selects the highest probability token using argmax until the max length or end-of-sequence token is reached:
python1import torch 2 3def greedy_decode( 4 model: torch.nn.Module, 5 prompt_ids: torch.Tensor, 6 max_length: int, 7 eos_token_id: int 8) -> torch.Tensor: 9 input_ids = prompt_ids 10 for _ in range(max_length): 11 logits = model(input_ids).logits[:, -1, :] 12 next_token = logits.argmax(dim=-1, keepdim=True) 13 input_ids = torch.cat([input_ids, next_token], dim=-1) 14 if next_token.item() == eos_token_id: 15 break 16 return input_ids
π‘ Key insight from Holtzman et al. (2020):[1] Human-written text does NOT follow the most probable trajectory. The tokens humans choose often rank 10thβ100th in the model's distribution. This means optimizing for maximum likelihood (greedy/beam) produces fundamentally un-human text.
Instead of keeping only the best token, maintain B (beam width) candidate sequences, expanding each by the top-k tokens. This PyTorch snippet demonstrates beam search by tracking the top B candidate sequences at each step. It calculates the log probabilities for all possible next tokens across all current beams, and retains only the B sequences with the highest cumulative scores:
python1def beam_search( 2 model: torch.nn.Module, 3 prompt_ids: torch.Tensor, 4 beam_width: int = 5, 5 max_length: int = 100, 6 eos_token_id: int = 2 7) -> torch.Tensor: 8 # Each beam: (sequence, cumulative_log_prob) 9 beams = [(prompt_ids, 0.0)] 10 completed = [] 11 12 for step in range(max_length): 13 all_candidates = [] 14 for seq, score in beams: 15 if seq[-1] == eos_token_id: 16 completed.append((seq, score)) 17 continue 18 19 logits = model(seq).logits[:, -1, :] 20 log_probs = torch.log_softmax(logits, dim=-1) 21 top_k = log_probs.topk(beam_width) 22 23 for i in range(beam_width): 24 new_seq = torch.cat([seq, top_k.indices[:, i:i+1]], dim=-1) 25 new_score = score + top_k.values[:, i].item() 26 all_candidates.append((new_seq, new_score)) 27 28 # Keep top beam_width candidates 29 beams = sorted(all_candidates, key=lambda x: x[1], 30 reverse=True)[:beam_width] 31 32 return max(completed or beams, key=lambda x: x[1])[0]
Raw log-probability scoring biases toward shorter sequences (each additional token multiplies probabilities < 1). The length penalty normalizes:
Reading the formula: without correction, longer sequences always score lower (each extra word multiplies by a probability < 1). Dividing by the length raised to compensates: means no correction (short wins), fully normalizes for length (each word judged equally).
| Effect | Use Case | |
|---|---|---|
| 0 | No normalization (favors short) | Short outputs |
| 0.6 | Balanced (common default) | Translation |
| 1.0 | Full normalization (favors long) | Summarization |
β οΈ Counterintuitive: Beam search with larger beams produces more likely but less interesting text.[2] In open-ended generation, increasing beam width actually decreases output quality because the most probable sequence is generic and repetitive.
πΏ Analogy, shower temperature dial: Temperature controls the model's "sharpness" the same way a shower dial controls water temperature. Low temperature () makes the model laser-focused on its top choices, like ice-cold water: bracing and precise. High temperature () spreads probability across many tokens, like turning the dial to hot: more relaxed and exploratory. At , you get exactly one answer (greedy). At , it's completely random.
Temperature modifies the logit distribution before softmax:
Reading the formula: divide the raw logits by temperature before applying softmax. When , the division amplifies differences between logits, making the top choice even more dominant. When , it compresses differences, spreading probability more evenly across options. At , it becomes greedy; at , it becomes random.
The following function applies temperature scaling by dividing the raw logits by the temperature parameter before they are passed to the softmax function. A temperature less than 1.0 makes the distribution sharper, while greater than 1.0 flattens it:
python1def apply_temperature(logits: torch.Tensor, temperature: float = 1.0) -> torch.Tensor: 2 """Scale logits by temperature before softmax.""" 3 return logits / temperature
| Temperature | Effect on Distribution | Entropy |
|---|---|---|
| Approaches one-hot (greedy) | β 0 | |
| Original model distribution | Baseline | |
| Flattened (more uniform) | Increases | |
| Approaches uniform random | β |
Why it works: Dividing logits by amplifies differences between logits, making the distribution peakier. Dividing by shrinks differences, making probabilities more similar. This controls the sharpness of the model's confidence.
| Use Case | Temperature | Rationale |
|---|---|---|
| Code generation | 0.0β0.3 | Correctness > creativity |
| Factual QA | 0.1β0.3 | Precision matters |
| General chat | 0.7β0.9 | Natural variety |
| Creative writing | 1.0β1.5 | Encourage diversity |
| Brainstorming | 1.2β2.0 | Maximum creativity |
Restrict sampling to the top most probable tokens, then renormalize. This implementation of top-k sampling limits the probability distribution to only the most likely next tokens. It masks out all other tokens by setting their logits to negative infinity before applying softmax and sampling from the remaining probability mass:
python1def top_k_sampling(logits: torch.Tensor, k: int = 50) -> torch.Tensor: 2 """Sample from the top-k most probable tokens.""" 3 top_k_values, top_k_indices = logits.topk(k) 4 filtered = torch.full_like(logits, float('-inf')) 5 filtered.scatter_(1, top_k_indices, top_k_values) 6 probs = torch.softmax(filtered, dim=-1) 7 return torch.multinomial(probs, 1)
The optimal varies dramatically by context:
| Context | Distribution Shape | Ideal |
|---|---|---|
| "The capital of France is" | Very peaked | ~3 |
| "I enjoy eating" | Moderately flat | ~50 |
| "The" | Very flat | ~500 |
Fixed either cuts valid tokens (too small) or includes garbage tokens (too large). This motivated dynamic approaches.
π Analogy, nightclub VIP list: Top-k is like a bouncer who always lets in exactly 50 people, even if the club is empty (too many) or packed (not enough). Top-p is like a bouncer who keeps admitting people until 90% of the night's expected crowd is inside. On a slow night, only a few VIPs get in; on a busy night, the guest list is much longer. This dynamic threshold adapts naturally to each context.
Instead of fixing , dynamically select the smallest set of tokens whose cumulative probability exceeds threshold :
Reading the formula: sort all tokens by probability, then include tokens from the top until their cumulative probability reaches (e.g., 0.9). On easy predictions where one token has 95% probability, only that token qualifies. On hard predictions where many tokens share probability, a large set is included. The vocabulary size adapts to the model's confidence.
Here is how nucleus sampling is implemented in PyTorch. It sorts the logits, calculates cumulative probabilities, and zeroes out any tokens that exceed the target threshold before renormalizing and sampling:
python1def nucleus_sampling(logits: torch.Tensor, p: float = 0.9) -> torch.Tensor: 2 """Top-p (nucleus) sampling: dynamic vocabulary truncation.""" 3 sorted_logits, sorted_indices = logits.sort(descending=True) 4 cumulative_probs = torch.softmax(sorted_logits, dim=-1).cumsum(dim=-1) 5 6 # Remove tokens with cumulative probability above threshold 7 sorted_indices_to_remove = cumulative_probs > p 8 # Keep at least one token 9 sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone() 10 sorted_indices_to_remove[..., 0] = False 11 12 indices_to_remove = sorted_indices_to_remove.scatter( 13 1, sorted_indices, sorted_indices_to_remove 14 ) 15 logits[indices_to_remove] = float('-inf') 16 probs = torch.softmax(logits, dim=-1) 17 return torch.multinomial(probs, 1)
| Distribution Type | Tokens Included | Behavior |
|---|---|---|
| Peaked (factual context) | 2-3 tokens | Nearly deterministic |
| Moderate (general text) | 20-50 tokens | Balanced diversity |
| Flat (creative context) | 100+ tokens | Highly diverse |
Top-p adapts automatically to the model's confidence: the more uncertain the model, the more tokens it considers.
Top-p has a subtle flaw at high temperatures (): temperature flattening causes many low-probability tokens to collectively exceed the threshold. At with , you might include hundreds of tokens, many with negligible individual probability, leading to incoherent outputs. This is exactly the problem min-p solves.
π― Analogy, hiring bar: Top-p is like saying "hire the cheapest 90% of applicants." At a bad company (uncertain model), that includes some terrible candidates. Min-p is like saying "only hire people at least 10% as qualified as the best applicant." If your top candidate is stellar, the bar is high. If your top candidate is mediocre, the bar is low. The threshold adapts to the quality of what's available, not to an arbitrary cutoff.
Min-p (ICLR 2025 Oral) uses the top token's probability as a scaling factor for the threshold:
Reading the formula: find the most likely token, then keep only tokens whose probability is at least fraction of that top token's probability. If the best token has 80% probability and , only tokens above 8% are kept. If the best has only 5%, the bar drops to 0.5%, automatically adapting to context difficulty.
| Scenario | Top-p () | Min-p () |
|---|---|---|
| Model confident () | Includes many low-prob tokens in the 0.9 mass | Only tokens with β |
| Model uncertain () | Might cut off valid alternatives | Includes tokens with β |
| High temperature () | Long tail of garbage tokens | Threshold scales with confidence β |
The min-p sampling function dynamically calculates a threshold based on the maximum probability in the distribution multiplied by min_p. It then zeroes out any token probabilities that fall below this threshold and renormalizes the remaining distribution:
python1def min_p_sampling(logits: torch.Tensor, min_p: float = 0.1, temperature: float = 1.0) -> torch.Tensor: 2 """Min-p sampling: confidence-scaled dynamic truncation.""" 3 # Apply temperature 4 logits = logits / temperature 5 probs = torch.softmax(logits, dim=-1) 6 7 # Dynamic threshold: min_p Γ max probability 8 max_prob = probs.max(dim=-1, keepdim=True).values 9 threshold = min_p * max_prob 10 11 # Zero out tokens below threshold 12 filtered_probs = probs.clone() 13 filtered_probs[probs < threshold] = 0.0 14 15 # Renormalize and sample 16 filtered_probs = filtered_probs / filtered_probs.sum(dim=-1, keepdim=True) 17 return torch.multinomial(filtered_probs, 1)
Min-p has been integrated into Hugging Face Transformers, vLLM, llama.cpp, and Ollama. It's the recommended sampling method for DeepSeek-R1 () and widely adopted for creative text generation where top-p struggles at high temperatures.
| Method | Threshold | Adapts to Confidence? | High-T Behavior |
|---|---|---|---|
| Top-k | Fixed tokens | β No | Includes garbage |
| Top-p | Fixed cumulative mass | β οΈ Partially | Long tail problem |
| Min-p | β Yes | Clean truncation |
Autoregressive models are prone to degenerate repetition:
text1"The cat sat on the mat. The cat sat on the mat. The cat sat on..."
Divide the logits of previously generated tokens by a penalty factor. This function applies a multiplicative repetition penalty to the logits of tokens that have already been generated. If a token is in the generated_ids, its logit is divided by the penalty factor (for positive logits) or multiplied (for negative logits) to reduce its likelihood of being selected again:
python1def apply_repetition_penalty( 2 logits: torch.Tensor, 3 generated_ids: torch.Tensor, 4 penalty: float = 1.2 5) -> torch.Tensor: 6 """Penalize tokens that already appeared in the output (multiplicative).""" 7 # Assumes logits shape: (batch_size, vocab_size) 8 # Assumes generated_ids shape: (batch_size, seq_len) 9 batch_size = logits.shape[0] 10 11 for i in range(batch_size): 12 # Get unique tokens generated for this sequence 13 unique_tokens = torch.unique(generated_ids[i]) 14 15 for token_id in unique_tokens: 16 if logits[i, token_id] > 0: 17 logits[i, token_id] /= penalty 18 else: 19 logits[i, token_id] *= penalty 20 21 return logits
OpenAI's API exposes two fine-grained controls:
Reading the formula: two knobs to discourage repetition. The frequency penalty grows with each use (say a word 5 times, it gets 5Γ the penalty, discouraging overuse). The presence penalty is a flat one-time penalty the moment a token is used at all (encouraging topic diversity). Together they let you tune how much repetition is acceptable.
In practice, multiple strategies are combined in a specific order. The standard pipeline applies modifications that reshape the distribution before applying truncation.
The following pipeline function demonstrates the standard production order of operations for token generation. It first applies the repetition penalty, then scales the logits by temperature, and finally performs truncation sampling (using min-p in this example) to select the final token:
python1def generate_token( 2 logits: torch.Tensor, 3 generated_ids: torch.Tensor, 4 temperature: float = 0.8, 5 min_p: float = 0.1, 6 repetition_penalty: float = 1.1, 7) -> torch.Tensor: 8 """Production-grade token generation pipeline.""" 9 # Step 1: Apply repetition penalty (before temperature) 10 logits = apply_repetition_penalty(logits, generated_ids, 11 repetition_penalty) 12 13 # Step 2: Apply temperature 14 logits = logits / temperature 15 16 # Step 3: Apply min-p sampling 17 # Note: We pass temperature=1.0 because we already applied it in Step 2. 18 # Applying it again would double-scale (equivalent to T^2). 19 next_token = min_p_sampling(logits, min_p=min_p, temperature=1.0) 20 21 return next_token
| Use Case | Temperature | Sampler | Rep Penalty | Notes |
|---|---|---|---|---|
| Code generation | 0.0β0.2 | Greedy or top-p=0.95 | 1.0 | Correctness is paramount |
| Factual QA | 0.1β0.3 | Top-p=0.9 | 1.0 | Precision over diversity |
| Chat | 0.7β0.9 | Min-p=0.1 | 1.1 | Natural, non-repetitive |
| Creative writing | 1.0β1.5 | Min-p=0.05 | 1.2 | High diversity, coherent |
| Brainstorming | 1.2β2.0 | Min-p=0.02 | 1.3 | Maximum exploration |
| Reasoning (DeepSeek-R1) | 0.6 | Min-p=0.05 | 1.0 | Chain-of-thought quality |
| Strategy | Deterministic? | Adapts to Context? | Best For |
|---|---|---|---|
| Greedy | β Yes | β No | Extraction, classification |
| Beam search | β Yes | β No | Translation, summarization |
| Top-k | β No | β No | Legacy, mostly replaced |
| Top-p | β No | β οΈ Partially | General generation |
| Min-p | β No | β Yes | Creative + high-T generation |
| Temperature | Modifier | Modifier | Controls global diversity |
| Rep penalty | Modifier | Modifier | Prevents degeneration |
The Curious Case of Neural Text Degeneration.
Holtzman, A., et al. Β· 2020 Β· ICLR 2020
CTRL: A Conditional Transformer Language Model for Controllable Generation.
Keskar, N. S., et al. Β· 2019
If Beam Search is the Answer, What was the Question?
Meister, C., Cotterell, R., & Vieira, T. Β· 2020 Β· EMNLP 2020
Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs.
Nguyen, et al. Β· 2025 Β· ICLR 2025 Oral
Hierarchical Neural Story Generation.
Fan, A., Lewis, M., & Dauphin, Y. Β· 2018 Β· ACL 2018