Compare decoding strategies for text generation: greedy, beam search, top-k, nucleus (top-p), temperature, repetition controls, and newer variants like min-p.
The previous deep dives followed information through transformer internals until the model produced logits. Decoding strategy controls how that final probability distribution becomes one visible completion. This chapter compares greedy decoding, beam search, temperature sampling, , nucleus sampling, and min-p so you can tune output quality intentionally.
Imagine an order-status assistant that hasn't chosen its next yet. Internally, it may assign a 70% chance to shipped, a 20% chance to delayed, and a 10% chance to refunded. Language models do this at every generation step: they produce raw scores for many possible next tokens, convert those scores into probabilities, then use a decoding policy to pick what appears on screen.
That final choice has real product consequences. A support bot usually needs stable, factual phrasing. A brainstorming tool needs more variety. A code generator often benefits from determinism, while a story generator may sound lifeless if it picks the highest-probability token every time. Decoding is the control surface that turns the same model distribution into those different behaviors.
For how decoding fits into the broader inference pipeline and techniques like speculative decoding, see those articles. Speculative decoding accelerates generation by asking a draft model to propose tokens that the target model can verify in parallel.[1]
Before we compare strategies, we need to see what happens inside the model at each generation step.
LLMs generate text one token at a time. They predict the next token, append it to the prompt, and repeat. This is called autoregressive generation (auto = self, regressive = predicting the next step from previous ones).
At every step the model outputs a vector of raw scores called logits, one score per token in the vocabulary. A function called converts those logits into probabilities that sum to 1. The decoding strategy then decides which token to emit.
Suppose the prompt so far is "The carrier scan shows the package is". Here is a plausible probability distribution for the very next token:
| Token | Logit | Probability |
|---|---|---|
delayed | 2.00 | 0.45 |
in | 1.60 | 0.30 |
on | 0.90 | 0.15 |
broken | -1.11 | 0.02 |
...(many tail tokens) | ... | 0.08 total |
These numbers are fabricated for clarity. Logits are shown only for named candidates; omitted tail tokens jointly carry the final probability mass. One token (delayed) leads the pack, a few others are plausible, and many individually weak tokens share the tail.
Greedy decoding is like picking the best-looking warehouse route right now. It might lead to a hub cutoff you can't meet. Beam search is like keeping several route plans alive in parallel and picking the one with the best total arrival time at the end. The locally optimal choice isn't necessarily the globally best route.
The question we'll answer in this article is: given this distribution, how do we pick the next token? And how does that choice change the whole sentence?
The simplest strategy is greedy decoding: at every step, pick the single most probable token and commit to it.
Using the table above, greedy decoding chooses delayed. The sentence becomes:
"The carrier scan shows the package is delayed"
That's a fine sentence. But greedy decoding can fail when the best next word leads to a worse overall sentence. Imagine a slightly different distribution where delayed is most likely, but in followed by transit produces a much more natural and informative completion. Greedy doesn't explore that possibility because it locks in delayed immediately.
This is the greedy trap: the locally best token doesn't imply the globally best sequence. It's like a delivery driver who repeatedly takes the fastest next street and ends up in a traffic jam, while a slightly slower first turn would have opened a clear highway.
Human-written text often doesn't follow the model's single most likely continuation. Holtzman et al. show that maximum-likelihood decoding can drift toward bland, repetitive text, which is why pure argmax-style decoding often doesn't sound human.[2]
1first_step = {"delayed": 0.55, "in": 0.45}
2continuations = {"delayed": {".": 0.40}, "in": {"transit": 0.90}}
3
4greedy_first = max(first_step, key=first_step.get)
5path_probabilities = {
6 "delayed .": first_step["delayed"] * continuations["delayed"]["."],
7 "in transit": first_step["in"] * continuations["in"]["transit"],
8}
9highest_sequence = max(path_probabilities, key=path_probabilities.get)
10
11print(f"greedy first token: {greedy_first}")
12print(f"sequence probabilities: {path_probabilities}")
13print(f"higher-probability sequence: {highest_sequence}")
14
15assert greedy_first == "delayed"
16assert highest_sequence == "in transit"1greedy first token: delayed
2sequence probabilities: {'delayed .': 0.22000000000000003, 'in transit': 0.405}
3higher-probability sequence: in transitAt each step, select the token with the highest probability:
At each step, look at every possible next token, pick the one with the highest probability, and commit to it. Simple and fast, but myopic, similar to choosing the next warehouse hop without checking whether the full route can still meet the SLA.
Here is a basic implementation of greedy decoding in PyTorch. It assumes a single prompt in the batch, a model whose forward call returns .logits, and an integer EOS token ID. The loop repeatedly takes the argmax token and stops when it reaches EOS or the maximum number of decoding steps:
1import torch
2
3def greedy_decode(
4 model: torch.nn.Module,
5 prompt_ids: torch.Tensor,
6 max_length: int,
7 eos_token_id: int
8) -> torch.Tensor:
9 """Generate one sequence with greedy decoding (batch size 1)."""
10 input_ids = prompt_ids
11 with torch.no_grad():
12 for _ in range(max_length):
13 logits = model(input_ids).logits[:, -1, :]
14 next_token = logits.argmax(dim=-1, keepdim=True)
15 input_ids = torch.cat([input_ids, next_token], dim=-1)
16 if next_token[0, 0].item() == eos_token_id:
17 break
18 return input_idsInstead of keeping only the best token, maintain B (beam width) candidate sequences, expanding each by the top-k tokens. This simplified PyTorch snippet assumes batch size 1, a model whose forward call returns .logits, and the highest-scoring partial sequences at every step. For readability, it leaves out batching and length normalization, which we'll cover in the next subsection:
1import torch
2
3def beam_search(
4 model: torch.nn.Module,
5 prompt_ids: torch.Tensor,
6 beam_width: int = 5,
7 max_length: int = 100,
8 eos_token_id: int = 2
9) -> torch.Tensor:
10 """Generate one sequence with beam search (batch size 1)."""
11 # Each beam: (sequence, cumulative_log_prob)
12 beams = [(prompt_ids, 0.0)]
13 completed = []
14
15 with torch.no_grad():
16 for _ in range(max_length):
17 all_candidates = []
18 for seq, score in beams:
19 # If this beam already ended, record it
20 if seq[0, -1].item() == eos_token_id:
21 completed.append((seq, score))
22 continue
23
24 # After step 1, beams have different lengths, so process each separately.
25 # In production you would use KV caching or padded batching instead.
26 logits = model(seq).logits[:, -1, :]
27 log_probs = torch.log_softmax(logits, dim=-1)
28 top_k = log_probs.topk(beam_width, dim=-1)
29
30 for i in range(beam_width):
31 new_seq = torch.cat([seq, top_k.indices[:, i:i+1]], dim=-1)
32 new_score = score + top_k.values[:, i].item()
33 all_candidates.append((new_seq, new_score))
34
35 if not all_candidates:
36 break
37
38 # Keep top beam_width candidates
39 beams = sorted(all_candidates, key=lambda x: x[1],
40 reverse=True)[:beam_width]
41
42 return max(completed or beams, key=lambda x: x[1])[0]One common length penalty, used in the Google Neural Machine Translation system, is:[3]
Without correction, longer sequences receive lower unnormalized probability as each extra token multiplies the sequence probability by a value smaller than 1. The exponent controls how strongly you compensate for that short-sequence bias. When , there's no correction. Larger values reduce the bias toward short completions.
| Effect | Use Case | |
|---|---|---|
| 0 | No normalization (favors short) | Short outputs |
| 0.6 | Balanced (common neural machine translation setting) | Translation |
| 1.0 | Stronger bias toward longer outputs | Summarization |
1def gnmt_score(log_probability, length, alpha=0.6):
2 penalty = ((5 + length) / 6) ** alpha
3 return log_probability / penalty
4
5short = {"text": "late", "log_probability": -1.00, "length": 2}
6complete = {"text": "delayed in transit", "log_probability": -1.08, "length": 5}
7
8raw_choice = max([short, complete], key=lambda item: item["log_probability"])["text"]
9normalized_choice = max(
10 [short, complete],
11 key=lambda item: gnmt_score(item["log_probability"], item["length"]),
12)["text"]
13print(f"raw sequence score chooses: {raw_choice}")
14print(f"length-normalized score chooses: {normalized_choice}")
15
16assert raw_choice == "late"
17assert normalized_choice == "delayed in transit"1raw sequence score chooses: late
2length-normalized score chooses: delayed in transitCounterintuitively, beam search with larger beams can produce more likely but less interesting text.[4] In open-ended generation, increasing beam width can reduce output quality because the most probable sequence is often generic and repetitive.
Before we add randomness, we need a dial that controls how much randomness to allow. Temperature is that dial. It reshapes the probability distribution before we sample from it.
Return to our running example. The raw model gives us these probabilities for the next token after "The carrier scan shows the package is":
| Token | Original Probability |
|---|---|
delayed | 0.45 |
in | 0.30 |
on | 0.15 |
broken | 0.02 |
| tail | 0.08 |
Here is what happens when we apply different temperatures and then run softmax:
| Token | (sharp) | (original) | (flat) |
|---|---|---|---|
delayed | ~0.78 | 0.45 | ~0.37 |
in | ~0.20 | 0.30 | ~0.28 |
on | ~0.02 | 0.15 | ~0.18 |
broken | ~0.00 | 0.02 | ~0.05 |
| tail | ~0.00 | 0.08 | ~0.12 |
For this worked table only, the tail is treated as one aggregated category and each original probability is reshaped proportionally to before renormalization. A real decoder divides each individual token logit by before softmax; it never aggregates the tail first. At , delayed is so dominant that sampling usually picks it. At , the probabilities spread out, and even broken becomes a possible sample. As , decoding collapses toward greedy argmax. As , the distribution approaches uniform.
Temperature controls distribution sharpness like a routing strictness dial. Low temperature () keeps the sampler focused on the top choices, like a fulfillment router that picks the safest carrier. High temperature () spreads probability across more tokens, like allowing less common recovery actions when the situation is ambiguous.
Temperature modifies the logit distribution before softmax (the function that converts raw model scores into probabilities):
Divide the raw logits by temperature before applying softmax. When , the division amplifies differences between logits, making the top choice even more dominant. When , it compresses differences, spreading probability more evenly across options. An implementation should special-case temperature=0 as deterministic decoding rather than divide by zero.
The following function applies temperature scaling to raw logits. It divides logits by the temperature before they pass through softmax. A temperature below 1.0 sharpens the distribution; a value above 1.0 flattens it:
1import torch
2
3def apply_temperature(logits: torch.Tensor, temperature: float = 1.0) -> torch.Tensor:
4 """Scale logits by temperature. Handle temperature=0 as greedy decoding elsewhere."""
5 if temperature <= 0:
6 raise ValueError("temperature must be > 0; use greedy decoding for temperature=0")
7 return logits / temperature
8
9scaled = apply_temperature(torch.tensor([2.0, 1.0]), temperature=0.5)
10print("scaled logits:", scaled.tolist())
11try:
12 apply_temperature(torch.tensor([2.0, 1.0]), temperature=0.0)
13except ValueError as error:
14 print("zero temperature:", error)
15
16assert scaled.tolist() == [4.0, 2.0]1scaled logits: [4.0, 2.0]
2zero temperature: temperature must be > 0; use greedy decoding for temperature=0| Temperature | Effect on Distribution | Entropy |
|---|---|---|
| Approaches one-hot (greedy) | near 0 | |
| Original model distribution | Baseline | |
| Flattened (more uniform) | Increases | |
| Approaches uniform random | near |
1from math import exp
2
3def softmax(logits):
4 shifted = [logit - max(logits) for logit in logits]
5 weights = [exp(logit) for logit in shifted]
6 total = sum(weights)
7 return [weight / total for weight in weights]
8
9def apply_temperature(logits, temperature):
10 if temperature <= 0:
11 raise ValueError("temperature must be > 0; use greedy decoding for temperature=0")
12 return [logit / temperature for logit in logits]
13
14logits = [2.0, 1.0, 0.0]
15low_t = softmax(apply_temperature(logits, 0.5))
16high_t = softmax(apply_temperature(logits, 2.0))
17
18assert low_t[0] > high_t[0]
19assert high_t[-1] > low_t[-1]
20assert round(sum(high_t), 12) == 1.0
Dividing logits by amplifies differences between logits, making the distribution peakier. When , the same operation shrinks differences, making probabilities more similar. This controls the sharpness of the sampling distribution.
| Evaluation slice | Temperature candidates | What to measure |
|---|---|---|
| Code generation | 0.0, 0.2, 0.5 | Tests passed, format validity, diversity |
| Factual QA / inventory lookup | 0.0, 0.2, 0.5 | Grounded accuracy and unsupported claims |
| General chat / customer support | 0.3, 0.7, 1.0 | Helpfulness, repetition, policy adherence |
| Creative writing / descriptions | 0.7, 1.0, 1.3 | Diversity and coherence |
Top-k sampling was popularized in neural story generation as a way to restrict sampling to the top most probable tokens, then renormalize the distribution.[5] Let's walk through it with our running example.
Suppose the model gives these probabilities after the prompt "The carrier scan shows the package is":
| Token | Probability | Cumulative |
|---|---|---|
delayed | 0.45 | 0.45 |
in | 0.30 | 0.75 |
on | 0.15 | 0.90 |
broken | 0.02 | 0.92 |
| omitted tail tokens, individually below 0.02 | 0.08 total | 1.00 |
With top-k where , we keep only delayed and in, renormalize their probabilities to sum to 1, and sample from that smaller set. on, broken, and the tail are locked out.
This implementation of top-k sampling limits the probability distribution to only the most likely next tokens. It masks out all other tokens by setting their logits to negative infinity before applying softmax and sampling from the remaining probability mass:
1import torch
2
3def top_k_sampling(logits: torch.Tensor, k: int = 50) -> torch.Tensor:
4 """Sample from the top-k most probable tokens."""
5 if k < 1:
6 raise ValueError("k must be >= 1")
7 k = min(k, logits.shape[-1])
8 top_k_values, top_k_indices = logits.topk(k, dim=-1)
9 filtered = torch.full_like(logits, float('-inf'))
10 filtered.scatter_(dim=-1, index=top_k_indices, src=top_k_values)
11 probs = torch.softmax(filtered, dim=-1)
12 return torch.multinomial(probs, 1)The primary limitation of top-k sampling is its rigidity. A useful value for varies sharply depending on the context of the generation:
| Context | Distribution Shape | Candidate values to evaluate |
|---|---|---|
| "Your return policy allows" | Potentially peaked | 1, 3, 10 |
| "This product is great for" | Potentially flatter | 10, 30, 50 |
| "The warehouse" | Prompt-dependent | Measure rather than assume |
When the distribution is highly peaked, a large like 50 forces the sampler to include dozens of irrelevant tail tokens. If the temperature is high, those tail tokens can accumulate enough probability mass to be selected, causing off-topic generation. Conversely, when the distribution is flat, a small like 10 can cut off valid continuations and make the output too narrow. This inability to adapt to distribution shape motivated dynamic truncation approaches.
1def top_k_tokens(tokens_and_probs, k):
2 return [token for token, _ in sorted(tokens_and_probs, key=lambda item: item[1], reverse=True)[:k]]
3
4def top_p_tokens(tokens_and_probs, p):
5 kept = []
6 cumulative = 0.0
7 for token, probability in sorted(tokens_and_probs, key=lambda item: item[1], reverse=True):
8 kept.append(token)
9 cumulative += probability
10 if cumulative >= p:
11 break
12 return kept
13
14distribution = [("delayed", 0.45), ("in", 0.30), ("on", 0.15), ("broken", 0.02), ("tail", 0.08)]
15
16assert top_k_tokens(distribution, 2) == ["delayed", "in"]
17assert top_p_tokens(distribution, 0.8) == ["delayed", "in", "on"]Top-k keeps exactly 50 candidate tokens if , even when only two are plausible or hundreds deserve consideration. Top-p keeps adding candidate tokens until their cumulative probability covers the requested mass. On an easy order-status reply, only a few candidates qualify; on an ambiguous escalation, the candidate set is much longer. This dynamic threshold adapts naturally to each context.
Instead of fixing , dynamically select the smallest high-probability prefix whose cumulative probability exceeds threshold . This approach, also known as nucleus sampling, addresses the fixed candidate-count limitation of top-k.[2]
Using the same probability table:
| Token | Probability | Cumulative |
|---|---|---|
delayed | 0.45 | 0.45 |
in | 0.30 | 0.75 |
on | 0.15 | 0.90 |
broken | 0.02 | 0.92 |
| omitted tail tokens, individually below 0.02 | 0.08 total | 1.00 |
With top-p where , we include tokens until the cumulative probability reaches 0.8. That means delayed, in, and on are in the nucleus. broken and the tail are excluded. If the distribution were more peaked and delayed had 0.85 probability, the nucleus would contain only that one token.
where are sorted by decreasing probability.
1def nucleus_distribution(tokens_and_probs, threshold):
2 kept = []
3 mass = 0.0
4 for token, probability in sorted(tokens_and_probs, key=lambda item: item[1], reverse=True):
5 kept.append((token, probability))
6 mass += probability
7 if mass >= threshold:
8 break
9 return [(token, round(probability / mass, 3)) for token, probability in kept]
10
11tokens = [("delayed", 0.45), ("in", 0.30), ("on", 0.15), ("broken", 0.02)]
12nucleus = nucleus_distribution(tokens, threshold=0.8)
13print("renormalized nucleus:", nucleus)
14
15assert nucleus == [("delayed", 0.5), ("in", 0.333), ("on", 0.167)]
16assert round(sum(probability for _, probability in nucleus), 3) == 1.01renormalized nucleus: [('delayed', 0.5), ('in', 0.333), ('on', 0.167)]Sort all tokens by probability, then include tokens from the top until their cumulative probability reaches (e.g., 0.9). On easy predictions where one token has 95% probability, only that token qualifies. On hard predictions where many tokens share probability, a large set is included. The candidate set adapts to the distribution shape.
Here is how nucleus sampling is implemented in PyTorch. The function sorts the logits, computes cumulative probability mass, masks away tokens outside the nucleus, and then samples from the renormalized distribution:
1import torch
2
3def nucleus_sampling(logits: torch.Tensor, p: float = 0.9) -> torch.Tensor:
4 """Top-p (nucleus) sampling: dynamic vocabulary truncation."""
5 if not 0.0 < p <= 1.0:
6 raise ValueError("p must be in (0, 1]")
7
8 sorted_logits, sorted_indices = torch.sort(logits, descending=True, dim=-1)
9 sorted_probs = torch.softmax(sorted_logits, dim=-1)
10 cumulative_probs = torch.cumsum(sorted_probs, dim=-1)
11
12 # Remove tokens with cumulative probability above threshold
13 sorted_indices_to_remove = cumulative_probs > p
14 # Keep at least one token
15 sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
16 sorted_indices_to_remove[..., 0] = False
17
18 indices_to_remove = torch.zeros_like(logits, dtype=torch.bool)
19 indices_to_remove.scatter_(dim=-1, index=sorted_indices, src=sorted_indices_to_remove)
20
21 filtered_logits = logits.clone()
22 filtered_logits[indices_to_remove] = float('-inf')
23 probs = torch.softmax(filtered_logits, dim=-1)
24 return torch.multinomial(probs, 1)| Distribution Type | Illustrative Tokens Included | Behavior |
|---|---|---|
| Peaked (factual context) | 2-3 tokens | Nearly deterministic |
| Moderate (general text) | 20-50 tokens | Balanced diversity |
| Flat (creative context) | 100+ tokens | Highly diverse |
Top-p adapts to the distribution shape: flatter distributions produce larger candidate sets.
Those counts are heuristics, not guarantees. The actual nucleus size depends on the model, tokenizer, prompt, and temperature.
Top-p has a failure mode to evaluate at high temperatures (): temperature flattening can cause many individually weak tokens to enter the cumulative-mass set. In those settings, a nucleus can expand substantially. This is one motivation for newer confidence-scaled variants such as min-p.
1from math import exp
2
3def probabilities(logits, temperature):
4 weights = [exp(logit / temperature) for logit in logits]
5 total = sum(weights)
6 return [weight / total for weight in weights]
7
8def nucleus_size(probs, threshold):
9 mass = 0.0
10 for index, probability in enumerate(sorted(probs, reverse=True), start=1):
11 mass += probability
12 if mass >= threshold:
13 return index
14
15logits = [4.0, 2.0, 1.0, 0.0, -0.5, -1.0]
16focused = nucleus_size(probabilities(logits, 0.7), threshold=0.9)
17flattened = nucleus_size(probabilities(logits, 1.5), threshold=0.9)
18print(f"nucleus size at T=0.7: {focused}")
19print(f"nucleus size at T=1.5: {flattened}")
20
21assert flattened > focused1nucleus size at T=0.7: 1
2nucleus size at T=1.5: 3Top-p ranks candidate next tokens and keeps enough of them to cover 90% of total probability mass. If probability is concentrated in a few actions, the shortlist stays small. If probability is spread out, the shortlist gets long. Min-p instead says "only keep candidates whose probability is at least 10% of the top-ranked candidate." If that candidate is dominant, the bar is high. If it is weak, the bar is lower. The threshold adapts to relative peak probability, not a fixed cumulative cutoff.
Min-p, proposed by Nguyen et al. and published as an ICLR 2025 oral paper, scales the cutoff by the top token's probability instead of using a cumulative-mass threshold:[6]
Find the most likely token, then keep tokens whose probability is at least fraction of that top token's probability. If the top token has 80% probability and , tokens above 8% are kept. If the top token has 5%, the bar drops to 0.5%, adapting to distribution shape.
| Scenario | Top-p () | Min-p () |
|---|---|---|
| Model confident () | Keeps enough tokens to reach 90% mass, which can still include a long tail | Keeps only tokens with |
| Model uncertain () | Still targets 90% cumulative mass | Lowers the cutoff to , so more alternatives survive |
| High temperature () | Flattened tails can enter the nucleus | Relative cutoff can trim more of that tail |
1def min_p_tokens(tokens_and_probs, rho):
2 max_probability = max(probability for _, probability in tokens_and_probs)
3 threshold = rho * max_probability
4 return [token for token, probability in tokens_and_probs if probability >= threshold]
5
6confident = [("shipped", 0.80), ("delayed", 0.09), ("purple", 0.01)]
7uncertain = [("shipped", 0.05), ("delayed", 0.04), ("in", 0.03)]
8
9assert min_p_tokens(confident, rho=0.1) == ["shipped", "delayed"]
10assert min_p_tokens(uncertain, rho=0.5) == ["shipped", "delayed", "in"]The min-p sampling function dynamically calculates a threshold based on the maximum probability in the distribution. It takes raw logits, a min_p scaling factor, and a temperature as inputs. The function applies temperature, zeroes out token probabilities that fall below the dynamic threshold (calculated as min_p times the maximum probability), and renormalizes before returning a sampled token:
1import torch
2
3def min_p_sampling(logits: torch.Tensor, min_p: float = 0.1, temperature: float = 1.0) -> torch.Tensor:
4 """Min-p sampling: confidence-scaled dynamic truncation."""
5 if not 0.0 <= min_p <= 1.0:
6 raise ValueError("min_p must be in [0, 1]")
7 if temperature <= 0:
8 raise ValueError("temperature must be > 0; use greedy decoding for temperature=0")
9
10 # Apply temperature
11 logits = logits / temperature
12 probs = torch.softmax(logits, dim=-1)
13
14 # Dynamic threshold: min_p * max probability
15 max_prob = probs.max(dim=-1, keepdim=True).values
16 threshold = min_p * max_prob
17
18 # Zero out tokens below threshold
19 filtered_probs = probs.clone()
20 filtered_probs[probs < threshold] = 0.0
21
22 # Renormalize and sample
23 filtered_probs = filtered_probs / filtered_probs.sum(dim=-1, keepdim=True)
24 return torch.multinomial(filtered_probs, 1)Min-p is a newer truncation heuristic to evaluate, rather than a universal successor to nucleus sampling. Nguyen et al. propose it for controlling low-probability candidates admitted by top-p at higher temperatures.[6] A subsequent critical reanalysis reports that min-p didn't reliably improve quality-diversity tradeoffs against commonly used samplers in its experiments and disputes broad adoption claims.[7] Meanwhile, DeepSeek-R1 reports temperature 0.6 with top-p 0.95 in one published evaluation configuration.[8] Treat any of these settings as experiment inputs, not presets for a different model or product.
| Method | Threshold | Adapts to distribution shape? | Main tradeoff |
|---|---|---|---|
| Top-k | Fixed tokens | No | Simple, but rigid |
| Top-p | Fixed cumulative mass | Partly | Dynamic, but can admit a long tail |
| Min-p | Yes | Relative cutoff; compare empirically with top-p |
Autoregressive models can fall into degenerate repetition, where the generation process gets stuck in a loop. For example, a model might repeatedly generate the same phrase:
"The order shipped late. The order shipped late. The order shipped..."
The repetition penalty originates in CTRL, which applies a penalty greater than 1 to previously generated tokens and uses 1.2 in its reported sampling setup.[9] A sign-aware multiplicative rule divides positive repeated-token logits and multiplies negative repeated-token logits, moving both downward in preference. The function below implements that rule without mutating its input tensor:
1import torch
2
3def apply_repetition_penalty(
4 logits: torch.Tensor,
5 generated_ids: torch.Tensor,
6 penalty: float = 1.2
7) -> torch.Tensor:
8 """Penalize tokens that already appeared in the output (multiplicative)."""
9 adjusted = logits.clone()
10
11 for i in range(adjusted.shape[0]):
12 unique_tokens = torch.unique(generated_ids[i])
13
14 for token_id in unique_tokens:
15 if adjusted[i, token_id] > 0:
16 adjusted[i, token_id] /= penalty
17 else:
18 adjusted[i, token_id] *= penalty
19
20 return adjusted
21
22logits = torch.tensor([[2.4, -0.5, 0.7]])
23penalized = apply_repetition_penalty(logits, torch.tensor([[0, 1]]), penalty=1.2)
24print("original logits:", logits.tolist()[0])
25print("penalized logits:", [round(value, 3) for value in penalized.tolist()[0]])
26
27assert penalized[0, 0].item() < logits[0, 0].item()
28assert penalized[0, 1].item() < logits[0, 1].item()
29assert penalized[0, 2].item() == logits[0, 2].item()1original logits: [2.4000000953674316, -0.5, 0.699999988079071]
2penalized logits: [2.0, -0.6, 0.7]A common formulation is:
Two knobs discourage repetition. The frequency penalty grows with each use: if a token appears 5 times, it gets 5 times the penalty. The presence penalty is a flat one-time penalty the moment a token is used at all, encouraging topic diversity. Exact formulas and defaults vary by stack, but this captures the core idea.
In practice, production systems stack several logit transformations together. The high-level mental model is stable:
The exact order is implementation-specific. Some stacks apply temperature before truncation, while others place temperature later in the sampler chain. In interviews, the important point isn't memorizing one canonical order. It's knowing that these controls are layered transformations of the logits, and you should check the implementation of the stack you're using.
This concrete stack makes that layered mental model visible on one next-token step.
The following tested mini-pipeline shows one common pattern: apply repetition penalty, special-case greedy decoding when temperature=0, otherwise scale logits and sample from a top-p nucleus. Min-p fits into the same slot as top-p in real systems.
1import torch
2
3def apply_temperature(logits: torch.Tensor, temperature: float = 1.0) -> torch.Tensor:
4 if temperature <= 0:
5 raise ValueError("temperature must be > 0; use greedy decoding for temperature=0")
6 return logits / temperature
7
8def nucleus_sampling(logits: torch.Tensor, p: float = 0.9) -> torch.Tensor:
9 if not 0.0 < p <= 1.0:
10 raise ValueError("p must be in (0, 1]")
11
12 sorted_logits, sorted_indices = torch.sort(logits, descending=True, dim=-1)
13 sorted_probs = torch.softmax(sorted_logits, dim=-1)
14 cumulative_probs = torch.cumsum(sorted_probs, dim=-1)
15
16 sorted_indices_to_remove = cumulative_probs > p
17 sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
18 sorted_indices_to_remove[..., 0] = False
19
20 indices_to_remove = torch.zeros_like(logits, dtype=torch.bool)
21 indices_to_remove.scatter_(dim=-1, index=sorted_indices, src=sorted_indices_to_remove)
22
23 filtered_logits = logits.clone()
24 filtered_logits[indices_to_remove] = float("-inf")
25 probs = torch.softmax(filtered_logits, dim=-1)
26 return torch.multinomial(probs, 1)
27
28def apply_repetition_penalty(
29 logits: torch.Tensor,
30 generated_ids: torch.Tensor,
31 penalty: float = 1.2,
32) -> torch.Tensor:
33 if penalty <= 0:
34 raise ValueError("penalty must be positive")
35
36 adjusted = logits.clone()
37 for batch_index in range(adjusted.shape[0]):
38 for token_id in torch.unique(generated_ids[batch_index]):
39 if adjusted[batch_index, token_id] > 0:
40 adjusted[batch_index, token_id] /= penalty
41 else:
42 adjusted[batch_index, token_id] *= penalty
43 return adjusted
44
45def sample_next_token(
46 logits: torch.Tensor,
47 generated_ids: torch.Tensor,
48 temperature: float = 0.8,
49 top_p: float = 0.9,
50 repetition_penalty: float = 1.1,
51) -> torch.Tensor:
52 """One common pipeline: penalties -> temperature -> nucleus sampling."""
53 logits = apply_repetition_penalty(logits, generated_ids, repetition_penalty)
54
55 if temperature == 0:
56 return logits.argmax(dim=-1, keepdim=True)
57
58 logits = apply_temperature(logits, temperature)
59 return nucleus_sampling(logits, p=top_p)
60
61torch.manual_seed(0)
62logits = torch.tensor([[2.0, 1.2, 0.2, -1.0]])
63generated_ids = torch.tensor([[0, 1, 1]])
64
65greedy_token = sample_next_token(logits, generated_ids, temperature=0.0)
66sampled_token = sample_next_token(logits, generated_ids, temperature=0.8, top_p=0.9)
67
68print(f"greedy token id: {greedy_token.item()}")
69print(f"sampled token id: {sampled_token.item()}")1greedy token id: 0
2sampled token id: 1| Evaluation slice | Temperature candidates | Sampler candidates | Repetition-penalty candidates | Goal |
|---|---|---|---|---|
| Code generation | 0.0, 0.2, 0.5 | Greedy, top-p=0.95 | 1.0, 1.05 | Tests and format validity |
| Factual QA | 0.0, 0.2, 0.5 | Greedy, top-p=0.9 | 1.0, 1.05 | Grounded accuracy |
| Chat | 0.3, 0.7, 1.0 | Top-p=0.9, min-p=0.1 | 1.0, 1.1 | Helpfulness and repetition |
| Creative writing | 0.7, 1.0, 1.3 | Top-p=0.95, min-p=0.05 | 1.0, 1.1 | Diversity and coherence |
| Example published eval setting (DeepSeek-R1) | 0.6 | Top-p=0.95 | Not reported | One published configuration[8] |
The first four rows are candidate sweeps, not recommended defaults. The right settings depend on the model, tokenizer, task, and measured failure costs.
Here is a short debugging exercise you can reason through without running code. Read the symptoms, identify the likely cause, and pick a fix.
Symptom: A customer-service chatbot keeps ending every reply with "Please let me know if you need anything else. Please let me know if you need anything else."
Likely cause: The repetition penalty is set to 1.0 (no penalty) or temperature is so low that the model keeps rediscovering the same high-probability closing phrase.
Experiment: Compare a mild repetition penalty against a slightly wider sampling distribution on held-out replies, tracking task correctness and loop rate.
Symptom: A brainstorming assistant set to temperature=1.5 and top_p=0.95 occasionally outputs tokens that don't form real words, like "flarble" or "xquz."
Likely cause: At high temperature the distribution flattens, and top-p's cumulative threshold pulls in a long tail of very low-probability tokens. Some of those are nonsense.
Experiment: Hold temperature fixed and compare a tighter top-p value against min-p; measure coherence and diversity instead of assuming one filter improves both.
Symptom: A support bot answering "What is your return policy?" sometimes says "No returns accepted" or "Returns are unlimited."
Likely cause: Temperature is too high for a factual task, or top-p is so wide that unlikely but wrong answers get sampled.
Fix: First improve grounding or output constraints, then compare conservative decoding settings on an accuracy evaluation. Lower temperature can't repair a wrong high-probability answer.
When choosing a decoding strategy for a new application, first determine whether output variance is allowed and which failures matter. Then compare candidate algorithms on task metrics, repetition, format validity, and latency.
| Strategy | Deterministic? | Adapts to context? | Candidate evaluation fit |
|---|---|---|---|
| Greedy | Yes | No | Extraction, classification |
| Beam search | Yes | No | Translation, summarization |
| Top-k | No | No | Simple truncation baseline |
| Top-p | No | Yes, by cumulative mass | General generation |
| Min-p | No | Yes | Confidence-scaled truncation to compare with top-p |
| Temperature | Modifier | Modifier | Controls global diversity |
| Rep penalty | Modifier | Modifier | Suppresses repeated-token reuse |
0.8 to 0.2, but correctness stayed good. What sampler change comes first?Add a mild repetition or frequency penalty before raising temperature again. The immediate problem is looped phrasing, not loss of task accuracy, so stay in the decoding stack first and only widen temperature if the penalty isn't enough.
Translation has a narrow target meaning, so keeping several high-probability candidates helps recover a better full sequence. Chat has many acceptable answers, so larger beams often over-optimize for generic high-likelihood continuations and drain personality from the response.[4]
top_k=10. What is wrong with that setup?Top-k is a fixed cutoff. With a flat creative distribution, k=10 may still exclude many reasonable alternatives. Compare dynamic filters such as top-p or min-p on the target distribution rather than increasing temperature alone.
temperature=1.4 and top_p=0.95. Why might min-p help?At high temperature, many individually weak tokens can enter a top-p nucleus because their cumulative mass adds up. Min-p ties the cutoff to the top token's probability, so if the model still has a clear best candidate, more of the tail gets dropped.
Log the actual sampler path: penalties, masks, temperature, truncation, and final token choice. Verify operation order, default values, special handling of temperature=0, and any engine-specific processors. Treat the runtime as code to inspect, not a black box that must share another framework's semantics.
| Decision | Pass bar |
|---|---|
| Greedy vs sampling | You can name one task where greedy is the right default and one where it will likely sound degenerate. |
| Beam search | You can explain why beam width can help translation yet hurt open-ended chat, and when length penalty matters. |
| Temperature | You can connect lower or higher temperature to the actual shape change in one worked probability distribution. |
| Top-k vs top-p vs min-p | You can choose which truncation rule better fits a peaked distribution and which better fits a flat one. |
| Penalty knobs | You can explain when repetition, frequency, and presence penalties solve style loops versus when they don't touch factual errors. |
| Production choice | You can defend one sampler stack for factual QA, one for code, and one for creative chat under a real latency or quality goal. |
| Implementation bugs | You can name two runtime mistakes that change outputs even when the visible knobs look the same. |
Symptom: Answers became safer and more repetitive after you increased beam width.
Likely cause: Beam search is pushing toward the most likely overall continuation, which is often generic in open-ended dialogue.
Fix: Use beam search for translation or tightly grounded summarization. For chat, switch back to a sampling strategy and tune temperature plus truncation instead.
Symptom: You lowered temperature, but the model still states wrong facts confidently.
Likely cause: Temperature changes distribution sharpness, not factual grounding. If the model lacks evidence, it can still choose a wrong but high-probability continuation.
Fix: Keep decoding conservative for factual tasks, but fix retrieval, prompt grounding, or output constraints rather than treating temperature as a truth knob.
Symptom: Creative mode started producing weird fragments or off-topic words.
Likely cause: Higher temperature flattened the distribution, so top-p admitted a long tail of individually weak candidates.
Fix: Lower temperature, tighten top-p, or test min-p so the cutoff tracks the strength of the best token.
Symptom: Outputs stayed narrow in creative prompts and erratic in factual prompts with the same k.
Likely cause: Top-k doesn't adapt to distribution shape. The same cutoff is too small in flat contexts and too large in peaked ones.
Fix: Treat top-k as a simple baseline. Compare top-p or min-p when candidate-set size should react to distribution shape.
Symptom: Another engine gives different outputs even with matching documented settings.
Likely cause: Sampler order, default values, or special cases such as temperature=0 differ across implementations.
Fix: Inspect the actual sampler path, log intermediate logits if needed, and validate behavior on representative prompts before blaming the model.
temperature=0 was sent through the sampling math directlySymptom: The runtime crashes, returns NaNs, or behaves inconsistently when someone sets temperature to zero.
Likely cause: The implementation divided logits by zero instead of treating temperature=0 as a greedy special case.
Fix: Handle temperature=0 explicitly as deterministic decoding, and keep the sampling path only for temperatures greater than zero.
Symptom: Sampling looks biased or unstable even though the nucleus cutoff seems correct.
Likely cause: Tokens outside the nucleus were removed, but the remaining probabilities weren't renormalized before sampling.
Fix: After masking the tail, renormalize the surviving probabilities so the sampler draws from a valid distribution.
The papers behind these techniques are cited throughout the article: Holtzman et al. (2020) on neural text degeneration,[2] Meister et al. (2020) on beam search paradox,[4] Fan et al. (2018) on top-k sampling,[5] Wu et al. (2016) on GNMT beam-search length normalization,[3] and Nguyen et al. (2025) on min-p as a newer confidence-scaled variant,[6] along with a 2025 critical reanalysis that questions min-p's reported gains.[7]
Fast Inference from Transformers via Speculative Decoding.
Leviathan, Y., Kalman, M., & Matias, Y. · 2023 · ICML 2023
The Curious Case of Neural Text Degeneration.
Holtzman, A., et al. · 2020 · ICLR 2020
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.
Wu, Y., et al. · 2016
If Beam Search is the Answer, What was the Question?.
Meister, C., Cotterell, R., & Vieira, T. · 2020 · EMNLP 2020
Hierarchical Neural Story Generation.
Fan, A., Lewis, M., & Dauphin, Y. · 2018 · ACL 2018
Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs.
Nguyen, et al. · 2025 · ICLR 2025 Oral
Min-p, Max Exaggeration: A Critical Analysis of Min-p Sampling in Language Models
Schaeffer, R., Kazdan, J., & Denisov-Blanch, Y. · 2025
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.
DeepSeek Team · 2025
CTRL: A Conditional Transformer Language Model for Controllable Generation.
Keskar, N. S., et al. · 2019