LearnTransformer Deep DivesDecoding Strategies: Greedy to Nucleus

📝HardNLP Fundamentals

Decoding Strategies: Greedy to Nucleus

Compare decoding strategies for text generation: greedy, beam search, top-k, nucleus (top-p), temperature, repetition controls, and newer variants like min-p.

38 min read

Learning path

Step 95 of 158 in the full curriculum

Mechanistic Interpretability Scaling Laws & Compute-Optimal Training

Transformer internals eventually produce logits. Decoding strategy controls how that final probability distribution becomes one visible completion. Compare greedy decoding, beam search, temperature sampling, top-k, nucleus sampling, and min-p so you can tune output quality intentionally.

An incident-status assistant that hasn't chosen its next token yet may assign a 70% chance to stable, a 20% chance to degraded, and a 10% chance to rollback. Language models do this at every generation step: they produce raw scores for many possible next tokens, convert those scores into probabilities, then use a decoding policy to pick what appears on screen.

That final choice has real product consequences. Runbook assistants usually need stable, factual phrasing, while brainstorming tools need more variety. Code generators often benefit from determinism; story generators may sound lifeless if they always pick the highest-probability token. Decoding is the control surface that turns the same model distribution into those different behaviors.

For how decoding fits into the broader inference pipeline and techniques like speculative decoding, see those articles. Speculative decoding accelerates generation by asking a draft model to propose tokens that the target model can verify in parallel.^{[1]Reference 1Fast Inference from Transformers via Speculative Decoding.https://arxiv.org/abs/2211.17192}

One next-token distribution feeds greedy, beam search, and nucleus decoding: greedy takes the highest probability token, beam keeps multiple partial paths, and nucleus samples from the cumulative top-p set. — One distribution feeds three policies: greedy commits to the highest-probability token, beam keeps paths alive, and nucleus samples inside cumulative top-p mass.

How LLMs Choose the Next Token

To compare strategies, first inspect what happens inside the model at each generation step.

Autoregressive generation

LLMs generate text one token at a time. They predict the next token, append it to the prompt, and repeat. This is called autoregressive generation (auto = self, regressive = predicting the next step from previous ones).

At every step the model outputs a vector of raw scores called logits, one score per token in the vocabulary. A function called softmax converts those logits into probabilities that sum to 1. The decoding strategy then decides which token to emit.

A concrete probability distribution

Suppose the prompt so far is "The canary status is". Use this plausible probability distribution for the very next token:

Token	Logit	Probability
`degraded`	2.00	0.45
`in`	1.60	0.30
`on`	0.90	0.15
`broken`	-1.11	0.02
`...(many tail tokens)`	...	0.08 total

These numbers are fabricated for clarity. Logits are shown only for named candidates; omitted tail tokens jointly carry the final probability mass. One token (degraded) leads the pack, a few others are plausible, and many individually weak tokens share the tail.

Greedy decoding is like picking the best-looking decode branch right now. It may lead into a dead end after the next token. Beam search keeps several branches alive in parallel and picks the sequence with the best total score at the end. The locally optimal choice isn't necessarily the globally best sequence.

The question is: given this distribution, how do we pick the next token? And how does that choice change the whole sentence?

Greedy Decoding and Its Trap

Why the most likely next word isn't the best sentence

The simplest strategy is greedy decoding: at every step, pick the single most probable token and commit to it.

Using the table above, greedy decoding chooses degraded. The sentence becomes:

"The canary status is degraded"

That's a fine sentence. But greedy decoding can fail when the best next word leads to a worse overall sentence. Imagine a slightly different distribution where degraded is most likely, but needs followed by review produces a more useful completion. Greedy doesn't explore that possibility because it locks in degraded immediately.

This is the greedy trap: the locally best token doesn't imply the globally best sequence. A first token with slightly lower probability can lead to a stronger full answer after the next step.

Human-written text often doesn't follow the model's single most likely continuation. Holtzman et al. show that maximum-likelihood decoding can drift toward bland, repetitive text, which is why pure argmax-style decoding often doesn't sound human.^{[2]Reference 2The Curious Case of Neural Text Degeneration.https://arxiv.org/abs/1904.09751}

greedy-can-miss-higher-scoring-sequence.py

first_step = {"degraded": 0.55, "needs": 0.45}
continuations = {"degraded": {".": 0.40}, "needs": {"review": 0.90}}

greedy_first = max(first_step, key=first_step.get)
path_probabilities = {
    "degraded .": first_step["degraded"] * continuations["degraded"]["."],
    "needs review": first_step["needs"] * continuations["needs"]["review"],
}
highest_sequence = max(path_probabilities, key=path_probabilities.get)

print(f"greedy first token: {greedy_first}")
print(f"sequence probabilities: {path_probabilities}")
print(f"higher-probability sequence: {highest_sequence}")

assert greedy_first == "degraded"
assert highest_sequence == "needs review"

Local versus sequence score

greedy first token: degraded
sequence probabilities: {'degraded .': 0.22000000000000003, 'needs review': 0.405}
higher-probability sequence: needs review

The algorithm

At each step, select the token with the highest probability:

$w_t = \arg\max_{w} P(w | w_{<t})$

Reading the formula

At each step, look at every possible next token, pick the one with the highest probability, and commit to it. Simple and fast, but myopic: the next-token winner can still start a weak full sequence.

This basic PyTorch implementation assumes a single prompt in the batch, a model whose forward call returns .logits, and an integer EOS token ID. The loop repeatedly takes the argmax token and stops when it reaches EOS or the maximum number of decoding steps:

reading-the-formula.py

import torch

def greedy_decode(
    model: torch.nn.Module,
    prompt_ids: torch.Tensor,
    max_length: int,
    eos_token_id: int
) -> torch.Tensor:
    """Generate one sequence with greedy decoding (batch size 1)."""
    input_ids = prompt_ids
    with torch.no_grad():
        for _ in range(max_length):
            logits = model(input_ids).logits[:, -1, :]
            next_token = logits.argmax(dim=-1, keepdim=True)
            input_ids = torch.cat([input_ids, next_token], dim=-1)
            if next_token[0, 0].item() == eos_token_id:
                break
    return input_ids

Strengths

Deterministic and fast: $<span data-glossary="big-o-notation">O(V)</span>$ per step (single argmax over vocabulary)
Useful baseline for constrained tasks: classification-style outputs or extraction when output variability is undesirable

Limitations

Suboptimal globally: The locally best token doesn't imply the best sequence
Repetitive in open-ended generation: Maximum-likelihood decoding can fall into loops or generic output in evaluated settings.^{[2]Reference 2The Curious Case of Neural Text Degeneration.https://arxiv.org/abs/1904.09751}

Beam search

Algorithm

Instead of keeping only the best token, maintain B (beam width) candidate sequences, expanding each by the top-k tokens. This simplified PyTorch snippet assumes batch size 1, a model whose forward call returns .logits, and the highest-scoring partial sequences at every step. For readability, it leaves out batching and length normalization, which appear in the next subsection:

algorithm.py

import torch

def beam_search(
    model: torch.nn.Module,
    prompt_ids: torch.Tensor,
    beam_width: int = 5,
    max_length: int = 100,
    eos_token_id: int = 2
) -> torch.Tensor:
    """Generate one sequence with beam search (batch size 1)."""
    # Each beam: (sequence, cumulative_log_prob)
    beams = [(prompt_ids, 0.0)]
    completed = []

    with torch.no_grad():
        for _ in range(max_length):
            all_candidates = []
            for seq, score in beams:
                # If this beam already ended, record it
                if seq[0, -1].item() == eos_token_id:
                    completed.append((seq, score))
                    continue

                # After step 1, beams have different lengths, so process each separately.
                # In production you would use KV caching or padded batching instead.
                logits = model(seq).logits[:, -1, :]
                log_probs = torch.log_softmax(logits, dim=-1)
                top_k = log_probs.topk(beam_width, dim=-1)

                for i in range(beam_width):
                    new_seq = torch.cat([seq, top_k.indices[:, i:i+1]], dim=-1)
                    new_score = score + top_k.values[:, i].item()
                    all_candidates.append((new_seq, new_score))

            if not all_candidates:
                beams = []
                break

            # Keep top beam_width candidates
            beams = sorted(all_candidates, key=lambda x: x[1],
                           reverse=True)[:beam_width]

    # Active beams become truncated final candidates at the length cutoff.
    # Compare them with EOS-completed beams instead of discarding either set.
    final_candidates = completed + beams
    return max(final_candidates, key=lambda x: x[1])[0]

An EOS token finalizes a beam early. Reaching max_length also finalizes every beam that is still active, but labels it as truncated in a full implementation. Selection must compare both sets under the same scoring rule. Returning completed whenever that list is non-empty silently discards active candidates at the cutoff, even when one has the best score.

Length penalty

One common length penalty, used in the Google Neural Machine Translation system, is:^{[3]Reference 3Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.https://arxiv.org/abs/1609.08144}

$\text{score}(Y) = \frac{\log P(Y)}{(5 + |Y|)^\alpha / 6^\alpha}$

Reading the formula

Without correction, longer sequences receive lower unnormalized probability as each extra token multiplies the sequence probability by a value smaller than 1. The exponent $\alpha$ controls how strongly you compensate for that short-sequence bias. When $\alpha = 0$ , there's no correction. Larger values reduce the bias toward short completions.

$\alpha$	Effect	How to use it
0	No compensation for short-sequence bias	Useful baseline
0.6	Moderate compensation	Candidate value near the range GNMT often found best on its development sets
1.0	Stronger compensation for longer outputs	Compare on your task before adopting

Tune $\alpha$ on a validation set. It isn't a universal translation or summarization default: the right value depends on model behavior, stopping criteria, and the task metric.

beam-length-normalization.py

def gnmt_score(log_probability, length, alpha=0.6):
    penalty = ((5 + length) / 6) ** alpha
    return log_probability / penalty

short = {"text": "late", "log_probability": -1.00, "length": 2}
complete = {"text": "needs review", "log_probability": -1.08, "length": 5}

raw_choice = max([short, complete], key=lambda item: item["log_probability"])["text"]
normalized_choice = max(
    [short, complete],
    key=lambda item: gnmt_score(item["log_probability"], item["length"]),
)["text"]
print(f"raw sequence score chooses: {raw_choice}")
print(f"length-normalized score chooses: {normalized_choice}")

assert raw_choice == "late"
assert normalized_choice == "needs review"

Length-normalized ranking

raw sequence score chooses: late
length-normalized score chooses: needs review

When to use beam search

Good fit: machine translation. The output usually has one target meaning to preserve.
Good fit: structured summarization. The output should stay close to source evidence.
Poor fit: open-ended chat or creative writing. High-likelihood paths often become generic.

Counterintuitively, beam search with larger beams can produce more likely but less interesting text.^{[4]Reference 4If Beam Search is the Answer, What was the Question?.https://arxiv.org/abs/2010.02650} In open-ended generation, increasing beam width can reduce output quality because the most probable sequence is often generic and repetitive.

Beam width also has a serving cost. A width- $B$ search keeps up to $B$ active continuations at each step. Efficient runtimes batch those hypotheses and reorder their KV-cache state, but larger beams still increase decoder work and memory pressure relative to greedy decoding.

Temperature scaling

What temperature does

Before adding randomness, you need a dial that controls how much randomness to allow. Temperature is that dial. It reshapes the probability distribution before we sample from it.

Return to our running example. The raw model gives us these probabilities for the next token after "The canary status is":

Token	Original Probability
`degraded`	0.45
`in`	0.30
`on`	0.15
`broken`	0.02
tail	0.08

This is what happens when we apply different temperatures and then run softmax:

Token	$T = 0.3$ (sharp)	$T = 1.0$ (original)	$T = 1.5$ (flat)
`degraded`	~0.78	0.45	~0.37
`in`	~0.20	0.30	~0.28
`on`	~0.02	0.15	~0.18
`broken`	~0.00	0.02	~0.05
tail	~0.00	0.08	~0.12

For this worked table only, the tail is treated as one aggregated category and each original probability is reshaped proportionally to $P_i^{1/T}$ before renormalization. A real decoder divides each individual token logit by $T$ before softmax; it never aggregates the tail first. At $T = 0.3$ , degraded is so dominant that sampling usually picks it. At $T = 1.5$ , the probabilities spread out, and even broken becomes a possible sample. As $T \to 0$ , decoding collapses toward greedy argmax. As $T \to \infty$ , the distribution approaches uniform.

Temperature controls distribution sharpness like a sampler strictness dial. Low temperature ( $T < 1$ ) keeps the model focused on the top choices. High temperature ( $T > 1$ ) spreads probability across more tokens, allowing less common recovery actions when the situation is ambiguous.

The formula

Temperature modifies the logit distribution before softmax (the function that converts raw model scores into probabilities):

$P(w_i) = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$

Reading the formula

Divide the raw logits by temperature $T$ before applying softmax. When $T < 1$ , the division amplifies differences between logits, making the top choice even more dominant. When $T > 1$ , it compresses differences, spreading probability more evenly across options. An implementation should special-case temperature=0 as deterministic decoding rather than divide by zero.

This function applies temperature scaling to raw logits. It divides logits by the temperature before they pass through softmax. A temperature below 1.0 sharpens the distribution; a value above 1.0 flattens it:

reading-the-formula-2.py

import torch

def apply_temperature(logits: torch.Tensor, temperature: float = 1.0) -> torch.Tensor:
    """Scale logits by temperature. Handle temperature=0 as greedy decoding elsewhere."""
    if temperature <= 0:
        raise ValueError("temperature must be > 0; use greedy decoding for temperature=0")
    return logits / temperature

scaled = apply_temperature(torch.tensor([2.0, 1.0]), temperature=0.5)
print("scaled logits:", scaled.tolist())
try:
    apply_temperature(torch.tensor([2.0, 1.0]), temperature=0.0)
except ValueError as error:
    print("zero temperature:", error)

assert scaled.tolist() == [4.0, 2.0]

Temperature guard

scaled logits: [4.0, 2.0]
zero temperature: temperature must be > 0; use greedy decoding for temperature=0

Mathematical intuition

Entropy measures how spread out a probability distribution is:

$H(P) = -\sum_i p_i \log_2 p_i$

Low entropy means most probability sits on a few candidates. Higher entropy means more candidates carry meaningful mass. The illustration computes entropy over the same five displayed buckets, including the aggregated tail bucket, so its values are directly comparable across temperatures.

Temperature $T$	Effect on Distribution	Entropy
$T \to 0$	Approaches one-hot (greedy)	near 0
$T = 1$	Original model distribution	Baseline
$T > 1$	Flattened (more uniform)	Increases
$T \to \infty$	Approaches uniform random	near $\log_2 V$

mathematical-intuition.py

from math import exp

def softmax(logits):
    shifted = [logit - max(logits) for logit in logits]
    weights = [exp(logit) for logit in shifted]
    total = sum(weights)
    return [weight / total for weight in weights]

def apply_temperature(logits, temperature):
    if temperature <= 0:
        raise ValueError("temperature must be > 0; use greedy decoding for temperature=0")
    return [logit / temperature for logit in logits]

logits = [2.0, 1.0, 0.0]
low_t = softmax(apply_temperature(logits, 0.5))
high_t = softmax(apply_temperature(logits, 2.0))

assert low_t[0] > high_t[0]
assert high_t[-1] > low_t[-1]
assert round(sum(high_t), 12) == 1.0

Lower temperature concentrates probability in one token. Higher temperature spreads mass across more tokens. — As temperature rises, the same token set carries a flatter distribution across both bars and stacked probability mass.

Why it works

Dividing logits by $T < 1$ amplifies differences between logits, making the distribution peakier. When $T > 1$ , the same operation shrinks differences, making probabilities more similar. This controls the sharpness of the sampling distribution.

Starting sweep candidates

Evaluation slice	Temperature candidates	What to measure
Code generation	0.0, 0.2, 0.5	Tests passed, format validity, diversity
Factual QA / runbook lookup	0.0, 0.2, 0.5	Grounded accuracy and unsupported claims
General chat / runbook assistance	0.3, 0.7, 1.0	Helpfulness, repetition, policy adherence
Creative writing / descriptions	0.7, 1.0, 1.3	Diversity and coherence

Top-k Sampling

Algorithm

Top-k sampling was popularized in neural story generation as a way to restrict sampling to the top $k$ most probable tokens, then renormalize the distribution.^{[5]Reference 5Hierarchical Neural Story Generation.https://arxiv.org/abs/1805.04833} The running example shows how that restriction changes the distribution.

Suppose the model gives these probabilities after the prompt "The canary status is":

Token	Probability	Cumulative
`degraded`	0.45	0.45
`in`	0.30	0.75
`on`	0.15	0.90
`broken`	0.02	0.92
omitted tail tokens, individually below 0.02	0.08 total	1.00

With top-k where $k = 2$ , we keep only degraded and in, renormalize their probabilities to sum to 1, and sample from that smaller set. on, broken, and the tail are locked out.

This implementation of top-k sampling limits the probability distribution to only the $k$ most likely next tokens. It masks out all other tokens by setting their logits to negative infinity before applying softmax and sampling from the remaining probability mass:

algorithm-2.py

import torch

def top_k_sampling(logits: torch.Tensor, k: int = 50) -> torch.Tensor:
    """Sample from the top-k most probable tokens."""
    if k < 1:
        raise ValueError("k must be >= 1")
    k = min(k, logits.shape[-1])
    top_k_values, top_k_indices = logits.topk(k, dim=-1)
    filtered = torch.full_like(logits, float('-inf'))
    filtered.scatter_(dim=-1, index=top_k_indices, src=top_k_values)
    probs = torch.softmax(filtered, dim=-1)
    return torch.multinomial(probs, 1)

The fixed-k problem

The primary limitation of top-k sampling is its rigidity. A useful value for $k$ varies sharply depending on the context of the generation:

Context	Distribution Shape	Candidate $k$ values to evaluate
"The rollback runbook says"	Potentially peaked	1, 3, 10
"This product is great for"	Potentially flatter	10, 30, 50
"The service"	Prompt-dependent	Measure rather than assume

When the distribution is highly peaked, a large $k$ like 50 forces the sampler to include dozens of irrelevant tail tokens. If the temperature is high, those tail tokens can accumulate enough probability mass to be selected, causing off-topic generation. Conversely, when the distribution is flat, a small $k$ like 10 can cut off valid continuations and make the output too narrow. This inability to adapt to distribution shape motivated dynamic truncation approaches.

the-fixed-k-problem.py

def top_k_tokens(tokens_and_probs, k):
    return [token for token, _ in sorted(tokens_and_probs, key=lambda item: item[1], reverse=True)[:k]]

def top_p_tokens(tokens_and_probs, p):
    kept = []
    cumulative = 0.0
    for token, probability in sorted(tokens_and_probs, key=lambda item: item[1], reverse=True):
        kept.append(token)
        cumulative += probability
        if cumulative >= p:
            break
    return kept

distribution = [
    ("degraded", 0.45),
    ("in", 0.30),
    ("on", 0.15),
    ("paused", 0.04),
    ("held", 0.03),
    ("broken", 0.02),
    ("rollback", 0.01),
]

assert top_k_tokens(distribution, 2) == ["degraded", "in"]
assert top_p_tokens(distribution, 0.8) == ["degraded", "in", "on"]

Nucleus (top-p) Sampling

Algorithm

Top-k keeps exactly 50 candidate tokens if $k = 50$ , even when only two are plausible or hundreds deserve consideration. Top-p keeps adding candidate tokens until their cumulative probability covers the requested mass. On an easy incident-status reply, only a few candidates qualify; on an ambiguous escalation, the candidate set is much longer. This dynamic threshold adapts naturally to each context.

Instead of fixing $k$ , dynamically select the smallest high-probability prefix whose cumulative probability reaches or exceeds threshold $p$ . This approach, also known as nucleus sampling, addresses the fixed candidate-count limitation of top-k.^{[2]Reference 2The Curious Case of Neural Text Degeneration.https://arxiv.org/abs/1904.09751}

Using the same probability table:

Token	Probability	Cumulative
`degraded`	0.45	0.45
`in`	0.30	0.75
`on`	0.15	0.90
`broken`	0.02	0.92
omitted tail tokens, individually below 0.02	0.08 total	1.00

With top-p where $p = 0.8$ , we include tokens until the cumulative probability reaches 0.8. That means degraded, in, and on are in the nucleus. broken and the tail are excluded. If the distribution were more peaked and degraded had 0.85 probability, the nucleus would contain only that one token.

V_p = \{w_{(1)}, \ldots, w_{(m)}\}, \qquad m = \min \left\{j : \sum_{i=1}^{j} P(w_{(i)}) \ge p \right\}

where $w_{(i)}$ are sorted by decreasing probability.

nucleus-renormalization.py

def nucleus_distribution(tokens_and_probs, threshold):
    kept = []
    mass = 0.0
    for token, probability in sorted(tokens_and_probs, key=lambda item: item[1], reverse=True):
        kept.append((token, probability))
        mass += probability
        if mass >= threshold:
            break
    return [(token, round(probability / mass, 3)) for token, probability in kept]

tokens = [("degraded", 0.45), ("in", 0.30), ("on", 0.15), ("broken", 0.02)]
nucleus = nucleus_distribution(tokens, threshold=0.8)
print("renormalized nucleus:", nucleus)

assert nucleus == [("degraded", 0.5), ("in", 0.333), ("on", 0.167)]
assert round(sum(probability for _, probability in nucleus), 3) == 1.0

Nucleus renormalization

renormalized nucleus: [('degraded', 0.5), ('in', 0.333), ('on', 0.167)]

Reading the formula

Sort all tokens by probability, then include tokens from the top until their cumulative probability reaches $p$ (e.g., 0.9). On easy predictions where one token has 95% probability, only that token qualifies. On hard predictions where many tokens share probability, a large set is included. The candidate set adapts to the distribution shape.

Nucleus sampling is straightforward to implement in PyTorch. The function sorts the logits, computes cumulative probability mass, masks away tokens outside the nucleus, and then samples from the renormalized distribution:

reading-the-formula-3.py

import torch

def nucleus_sampling(logits: torch.Tensor, p: float = 0.9) -> torch.Tensor:
    """Top-p (nucleus) sampling: dynamic vocabulary truncation."""
    if not 0.0 < p <= 1.0:
        raise ValueError("p must be in (0, 1]")

    sorted_logits, sorted_indices = torch.sort(logits, descending=True, dim=-1)
    sorted_probs = torch.softmax(sorted_logits, dim=-1)
    cumulative_probs = torch.cumsum(sorted_probs, dim=-1)

    # Remove tokens with cumulative probability above threshold
    sorted_indices_to_remove = cumulative_probs > p
    # Keep at least one token
    sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
    sorted_indices_to_remove[..., 0] = False

    indices_to_remove = torch.zeros_like(logits, dtype=torch.bool)
    indices_to_remove.scatter_(dim=-1, index=sorted_indices, src=sorted_indices_to_remove)

    filtered_logits = logits.clone()
    filtered_logits[indices_to_remove] = float('-inf')
    probs = torch.softmax(filtered_logits, dim=-1)
    return torch.multinomial(probs, 1)

How distribution shape changes nucleus size

At the same $p = 0.9$ threshold, the number of survivors depends entirely on the probabilities presented at that generation step:

Sorted probabilities	Smallest prefix reaching 0.9	Survivors
0.92, 0.04, 0.02, 0.01, 0.01	0.92	1
0.45, 0.30, 0.15, 0.07, 0.03	0.45 + 0.30 + 0.15 = 0.90	3
0.22, 0.19, 0.17, 0.15, 0.13, 0.08, 0.06	0.22 + 0.19 + 0.17 + 0.15 + 0.13 + 0.08 = 0.94	6

The threshold stays fixed while the candidate count changes. Real nucleus size depends on the model, tokenizer, prompt, temperature, and exact probability shape.

Top-p's weakness: the long tail problem

Top-p has a failure mode to evaluate at high temperatures ( $T > 1$ ): temperature flattening can cause many individually weak tokens to enter the cumulative-mass set. In those settings, a $p = 0.9$ nucleus can expand substantially. This is one motivation for newer confidence-scaled variants such as min-p.

temperature-expands-a-nucleus.py

from math import exp

def probabilities(logits, temperature):
    weights = [exp(logit / temperature) for logit in logits]
    total = sum(weights)
    return [weight / total for weight in weights]

def nucleus_size(probs, threshold):
    mass = 0.0
    for index, probability in enumerate(sorted(probs, reverse=True), start=1):
        mass += probability
        if mass >= threshold:
            return index

logits = [4.0, 2.0, 1.0, 0.0, -0.5, -1.0]
focused = nucleus_size(probabilities(logits, 0.7), threshold=0.9)
flattened = nucleus_size(probabilities(logits, 1.5), threshold=0.9)
print(f"nucleus size at T=0.7: {focused}")
print(f"nucleus size at T=1.5: {flattened}")

assert flattened > focused

Temperature changes nucleus size

nucleus size at T=0.7: 1
nucleus size at T=1.5: 3

Beyond nucleus: min-p

The insight

Top-p ranks candidate next tokens and keeps enough of them to cover 90% of total probability mass. Concentrated probability produces a small shortlist; spread-out probability produces a long one. Min-p instead says "only keep candidates whose probability is at least 10% of the top-ranked candidate." A dominant top candidate sets a high bar. A weak top candidate lowers the bar. The threshold adapts to relative peak probability, not a fixed cumulative cutoff.

Min-p, proposed by Nguyen et al. and published as an ICLR 2025 oral paper, scales the cutoff by the top token's probability instead of using a cumulative-mass threshold:^{[6]Reference 6Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs.https://arxiv.org/abs/2407.01082}

$V_{\min\text{-}p} = \{w : P(w) \geq \rho \cdot \max_v P(v)\}$

Reading the formula

Find the most likely token, then keep tokens whose probability is at least $\rho$ fraction of that top token's probability. If the top token has 80% probability and $\rho = 0.1$ , tokens above 8% are kept. If the top token has 5%, the bar drops to 0.5%, adapting to distribution shape.

What min-p changes relative to top-p

Scenario	Top-p ( $p = 0.9$ )	Min-p ( $\rho = 0.1$ )
Model confident ( $P_{\max} = 0.8$ )	Keeps enough tokens to reach 90% mass, which can still include a long tail	Keeps only tokens with $P \geq 0.08$
Model uncertain ( $P_{\max} = 0.05$ )	Still targets 90% cumulative mass	Lowers the cutoff to $P \geq 0.005$ , so more alternatives survive
High temperature ( $T = 1.5$ )	Flattened tails can enter the nucleus	Relative cutoff can trim more of that tail

Peaked and flatter distributions show how top-k, top-p, and min-p keep different token sets. — Checkmarks show surviving tokens. Top-k keeps fixed ranks, top-p stops at cumulative mass, and min-p uses a cutoff relative to the strongest token.

what-min-p-changes-relative-to-top-p.py

def min_p_tokens(tokens_and_probs, rho):
    max_probability = max(probability for _, probability in tokens_and_probs)
    threshold = rho * max_probability
    return [token for token, probability in tokens_and_probs if probability >= threshold]

confident = [("stable", 0.80), ("degraded", 0.09), ("purple", 0.01)]
uncertain = [("stable", 0.05), ("degraded", 0.04), ("in", 0.03)]

assert min_p_tokens(confident, rho=0.1) == ["stable", "degraded"]
assert min_p_tokens(uncertain, rho=0.5) == ["stable", "degraded", "in"]

The min-p sampling function dynamically calculates a threshold based on the maximum probability in the distribution. It takes raw logits, a min_p scaling factor, and a temperature as inputs. The function applies temperature, zeroes out token probabilities that fall below the dynamic threshold (calculated as min_p times the maximum probability), and renormalizes before returning a sampled token:

what-min-p-changes-relative-to-top-p-2.py

import torch

def min_p_sampling(logits: torch.Tensor, min_p: float = 0.1, temperature: float = 1.0) -> torch.Tensor:
    """Min-p sampling: confidence-scaled dynamic truncation."""
    if not 0.0 <= min_p <= 1.0:
        raise ValueError("min_p must be in [0, 1]")
    if temperature <= 0:
        raise ValueError("temperature must be > 0; use greedy decoding for temperature=0")

    # Apply temperature
    logits = logits / temperature
    probs = torch.softmax(logits, dim=-1)

    # Dynamic threshold: min_p * max probability
    max_prob = probs.max(dim=-1, keepdim=True).values
    threshold = min_p * max_prob

    # Zero out tokens below threshold
    filtered_probs = probs.clone()
    filtered_probs[probs < threshold] = 0.0

    # Renormalize and sample
    filtered_probs = filtered_probs / filtered_probs.sum(dim=-1, keepdim=True)
    return torch.multinomial(filtered_probs, 1)

When to think about min-p

Min-p is a newer truncation heuristic to evaluate, rather than a universal successor to nucleus sampling. Nguyen et al. propose it for controlling low-probability candidates admitted by top-p at higher temperatures.^{[6]Reference 6Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs.https://arxiv.org/abs/2407.01082} A subsequent critical reanalysis reports that min-p didn't reliably improve quality-diversity tradeoffs against commonly used samplers in its experiments and disputes broad adoption claims.^{[7]Reference 7Min-p, Max Exaggeration: A Critical Analysis of Min-p Sampling in Language Modelshttps://arxiv.org/abs/2506.13681} Meanwhile, DeepSeek-R1 reports temperature 0.6 with top-p 0.95 in one published evaluation configuration.^{[8]Reference 8DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.https://arxiv.org/abs/2501.12948} Treat any of these settings as experiment inputs, not presets for a different model or product.

Method	Threshold	Adapts to distribution shape?	Main tradeoff
Top-k	Fixed $k$ tokens	No	Simple, but rigid
Top-p	Fixed $p$ cumulative mass	Partly	Dynamic, but can admit a long tail
Min-p	$\rho \times P_{\max}$	Yes	Relative cutoff; compare empirically with top-p

Repetition penalty

The problem

Autoregressive models can fall into degenerate repetition, where the generation process gets stuck in a loop. For example, a model might repeatedly generate the same phrase:

"The deploy rolled back. The deploy rolled back. The deploy rolled..."

How the penalty works

CTRL introduced penalized sampling and reported that a penalty around 1.2 balanced truthful generation against repetition in its experiments.^{[9]Reference 9CTRL: A Conditional Transformer Language Model for Controllable Generation.https://arxiv.org/abs/1909.05858} Common runtimes such as Hugging Face Transformers expose a related sign-aware logit processor: divide positive repeated-token logits and multiply negative repeated-token logits by the penalty, moving both downward in preference. This function mirrors that runtime behavior without mutating its input tensor:

how-the-penalty-works.py

import torch

def apply_repetition_penalty(
    logits: torch.Tensor,
    generated_ids: torch.Tensor,
    penalty: float = 1.2
) -> torch.Tensor:
    """Penalize tokens that already appeared in the output (multiplicative)."""
    if penalty <= 0:
        raise ValueError("penalty must be positive")

    adjusted = logits.clone()

    for i in range(adjusted.shape[0]):
        unique_tokens = torch.unique(generated_ids[i])

        for token_id in unique_tokens:
            if adjusted[i, token_id] > 0:
                adjusted[i, token_id] /= penalty
            else:
                adjusted[i, token_id] *= penalty

    return adjusted

logits = torch.tensor([[2.4, -0.5, 0.7]])
penalized = apply_repetition_penalty(logits, torch.tensor([[0, 1]]), penalty=1.2)
print("original logits:", logits.tolist()[0])
print("penalized logits:", [round(value, 3) for value in penalized.tolist()[0]])

assert penalized[0, 0].item() < logits[0, 0].item()
assert penalized[0, 1].item() < logits[0, 1].item()
assert penalized[0, 2].item() == logits[0, 2].item()

Sign-aware repetition penalty

original logits: [2.4000000953674316, -0.5, 0.699999988079071]
penalized logits: [2.0, -0.6, 0.7]

Frequency and presence penalties

A common formulation is:

$\text{logit}_{\text{adjusted}} = \text{logit} - \alpha \cdot \text{count}(\text{token}) - \beta \cdot \mathbb{1}[\text{count}(\text{token}) > 0]$

Reading the formula

Two knobs discourage repetition. The frequency penalty $\alpha$ grows with each use: if a token appears 5 times, it gets 5 times the penalty. The presence penalty $\beta$ is a flat one-time penalty the moment a token is used at all, encouraging topic diversity. Exact formulas and defaults vary by stack, but this captures the core idea.

Combining Strategies in Production

In practice, production systems stack several logit transformations together. The high-level mental model is stable:

Start with raw logits.
Apply logit processors such as penalties, masks, or forced-token constraints.
Branch by policy. Deterministic decoding takes argmax from the adjusted logits.
A sampling path applies warpers such as temperature, top-k, top-p, or min-p, then samples from the surviving distribution.

The exact order is implementation-specific. Some stacks apply temperature before truncation, while others place temperature later in the sampler chain. In interviews, don't memorize one canonical order. Know that these controls are layered transformations of the logits, and check the implementation of the stack you're using.

This concrete stack makes that layered mental model visible on one next-token step.

Production sampler stack where raw logits become adjusted logits after a repetition penalty, greedy decoding chooses B, and a top-p branch keeps B, A, and C before sampling C. — One sampler ordering: the penalty lowers repeated A below B, greedy chooses B, and top-p keeps B, A, and C before the illustrated draw lands on C.

This tested mini-pipeline shows one common pattern: apply repetition penalty, special-case greedy decoding when temperature=0, otherwise scale logits and sample from a top-p nucleus. Min-p fits into the same slot as top-p in real systems.

combining-strategies-in-production.py

import torch

def apply_temperature(logits: torch.Tensor, temperature: float = 1.0) -> torch.Tensor:
    if temperature <= 0:
        raise ValueError("temperature must be > 0; use greedy decoding for temperature=0")
    return logits / temperature

def nucleus_sampling(logits: torch.Tensor, p: float = 0.9) -> torch.Tensor:
    if not 0.0 < p <= 1.0:
        raise ValueError("p must be in (0, 1]")

    sorted_logits, sorted_indices = torch.sort(logits, descending=True, dim=-1)
    sorted_probs = torch.softmax(sorted_logits, dim=-1)
    cumulative_probs = torch.cumsum(sorted_probs, dim=-1)

    sorted_indices_to_remove = cumulative_probs > p
    sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
    sorted_indices_to_remove[..., 0] = False

    indices_to_remove = torch.zeros_like(logits, dtype=torch.bool)
    indices_to_remove.scatter_(dim=-1, index=sorted_indices, src=sorted_indices_to_remove)

    filtered_logits = logits.clone()
    filtered_logits[indices_to_remove] = float("-inf")
    probs = torch.softmax(filtered_logits, dim=-1)
    return torch.multinomial(probs, 1)

def apply_repetition_penalty(
    logits: torch.Tensor,
    generated_ids: torch.Tensor,
    penalty: float = 1.2,
) -> torch.Tensor:
    if penalty <= 0:
        raise ValueError("penalty must be positive")

    adjusted = logits.clone()
    for batch_index in range(adjusted.shape[0]):
        for token_id in torch.unique(generated_ids[batch_index]):
            if adjusted[batch_index, token_id] > 0:
                adjusted[batch_index, token_id] /= penalty
            else:
                adjusted[batch_index, token_id] *= penalty
    return adjusted

def sample_next_token(
    logits: torch.Tensor,
    generated_ids: torch.Tensor,
    temperature: float = 0.8,
    top_p: float = 0.9,
    repetition_penalty: float = 1.1,
) -> torch.Tensor:
    """One common pipeline: penalties -> temperature -> nucleus sampling."""
    logits = apply_repetition_penalty(logits, generated_ids, repetition_penalty)

    if temperature == 0:
        return logits.argmax(dim=-1, keepdim=True)

    logits = apply_temperature(logits, temperature)
    return nucleus_sampling(logits, p=top_p)

torch.manual_seed(0)
logits = torch.tensor([[2.0, 1.2, 0.2, -1.0]])
generated_ids = torch.tensor([[0, 1, 1]])

greedy_token = sample_next_token(logits, generated_ids, temperature=0.0)
sampled_token = sample_next_token(logits, generated_ids, temperature=0.8, top_p=0.9)

print(f"greedy token id: {greedy_token.item()}")
print(f"sampled token id: {sampled_token.item()}")

Output

greedy token id: 0
sampled token id: 1

Evaluation configurations

Evaluation slice	Temperature candidates	Sampler candidates	Repetition-penalty candidates	Goal
Code generation	0.0, 0.2, 0.5	Greedy, top-p=0.95	1.0, 1.05	Tests and format validity
Factual QA	0.0, 0.2, 0.5	Greedy, top-p=0.9	1.0, 1.05	Grounded accuracy
Chat	0.3, 0.7, 1.0	Top-p=0.9, min-p=0.1	1.0, 1.1	Helpfulness and repetition
Creative writing	0.7, 1.0, 1.3	Top-p=0.95, min-p=0.05	1.0, 1.1	Diversity and coherence
Published DeepSeek-R1 pass@1 setting	0.6	Top-p=0.95	Not reported	One evaluation configuration^{[8]Reference 8DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.https://arxiv.org/abs/2501.12948}

The first four rows are candidate sweeps, not recommended defaults. The right settings depend on the model, tokenizer, task, and measured failure costs.

Practice: debug this output

Use this short debugging exercise without running code. Read the symptoms, identify the likely cause, and pick a fix.

Scenario 1: Repetitive boilerplate

Symptom: An incident chatbot keeps ending every reply with the same generic closing sentence, repeated twice.
Likely cause: The repetition penalty is set to 1.0 (no penalty) or temperature is so low that the model keeps rediscovering the same high-probability closing phrase.
Experiment: Compare a mild repetition penalty against a slightly wider sampling distribution on held-out replies, tracking task correctness and loop rate.

Scenario 2: Gibberish at high creativity

Symptom: A brainstorming assistant set to temperature=1.5 and top_p=0.95 occasionally outputs tokens that don't form real words, like "flarble" or "xquz."
Likely cause: At high temperature the distribution flattens, and top-p's cumulative threshold pulls in a long tail of very low-probability tokens. Some of those are nonsense.
Experiment: Hold temperature fixed and compare a tighter top-p value against min-p; measure coherence and diversity instead of assuming one filter improves both.

Scenario 3: Factual QA drifts

Symptom: A runbook assistant answering "What is the rollback policy?" sometimes says "Rollback blocked" or "Rollbacks are unlimited."
Likely cause: Temperature is too high for a factual task, or top-p is so wide that unlikely but wrong answers get sampled.
Fix: First improve grounding or output constraints, then compare conservative decoding settings on an accuracy evaluation. Lower temperature can't repair a wrong high-probability answer.

Choosing a policy by task

When choosing a decoding strategy for a new application, first determine whether output variance is allowed and which failures matter. Then compare candidate algorithms on task metrics, repetition, format validity, and latency.

Strategy	Deterministic?	Adapts to context?	Candidate evaluation fit
Greedy	Yes	No	Extraction, classification
Beam search	Yes	No	Translation, summarization
Top-k	No	No	Simple truncation baseline
Top-p	No	Yes, by cumulative mass	General generation
Min-p	No	Yes	Confidence-scaled truncation to compare with top-p
Temperature	Modifier	Modifier	Controls global diversity
Rep penalty	Modifier	Modifier	Suppresses repeated-token reuse

What to check before moving on

Decision	Pass bar
Greedy vs sampling	You can name one task where greedy is the right default and one where it will likely sound degenerate.
Beam search	You can explain why beam width can help translation yet hurt open-ended chat, and when length penalty matters.
Temperature	You can connect lower or higher temperature to the actual shape change in one worked probability distribution.
Top-k vs top-p vs min-p	You can choose which truncation rule better fits a peaked distribution and which better fits a flat one.
Penalty knobs	You can explain when repetition, frequency, and presence penalties solve style loops versus when they don't touch factual errors.
Production choice	You can defend one sampler stack for factual QA, one for code, and one for creative chat under a real latency or quality goal.
Implementation bugs	You can name two runtime mistakes that change outputs even when the visible knobs look the same.

What to remember

Decoding shapes quality: The choice of decoding strategy (deterministic vs. stochastic) changes whether the model output tends toward precision and repetition or diversity and creativity.
A useful mental progression: Start with static truncation (top-k), then dynamic mass-based truncation (nucleus/top-p), then newer confidence-scaled variants such as min-p.
Likelihood isn't generation quality: Holtzman et al. show that maximization-based decoding can produce repetitive, bland open-ended text even when likelihood is high.^{[2]Reference 2The Curious Case of Neural Text Degeneration.https://arxiv.org/abs/1904.09751}
Temperature vs. Truncation: Temperature ( $T$ ) modifies the shape of the distribution (sharpness), while top-p/min-p modify the tail (truncation). They are orthogonal controls used together.
Sampler order is framework-specific: Treat decoding as layered logit transformations, not one universal ordering rule. Check the implementation of the stack you're using.

Common failures and fixes

Beam search made chat sound robotic

Symptom: Answers became safer and more repetitive after you increased beam width.
Likely cause: Beam search is pushing toward the most likely overall continuation, which is often generic in open-ended dialogue.
Fix: Use beam search for translation or tightly grounded summarization. For chat, switch back to a sampling strategy and tune temperature plus truncation instead.

Temperature tweak didn't fix hallucinations

Symptom: You lowered temperature, but the model still states wrong facts confidently.
Likely cause: Temperature changes distribution sharpness, not factual grounding. If the model lacks evidence, it can still choose a wrong but high-probability continuation.
Fix: Keep decoding conservative for factual tasks, but fix retrieval, prompt grounding, or output constraints rather than treating temperature as a truth knob.

High temperature plus top-p pulled in nonsense tail tokens

Symptom: Creative mode started producing weird fragments or off-topic words.
Likely cause: Higher temperature flattened the distribution, so top-p admitted a long tail of individually weak candidates.
Fix: Lower temperature, tighten top-p, or test min-p so the cutoff tracks the strength of the best token.

Fixed top-k clipped valid options

Symptom: Outputs stayed narrow in creative prompts and erratic in factual prompts with the same k.
Likely cause: Top-k doesn't adapt to distribution shape. The same cutoff is too small in flat contexts and too large in peaked ones.
Fix: Treat top-k as a simple baseline. Compare top-p or min-p when candidate-set size should react to distribution shape.

Same knobs changed behavior after a runtime swap

Symptom: Another engine gives different outputs even with matching documented settings.
Likely cause: Sampler order, default values, or special cases such as temperature=0 differ across implementations.
Fix: Inspect the actual sampler path, log intermediate logits if needed, and validate behavior on representative prompts before blaming the model.

`temperature=0` was sent through the sampling math directly

Symptom: The runtime crashes, returns NaNs, or behaves inconsistently when someone sets temperature to zero.
Likely cause: The implementation divided logits by zero instead of treating temperature=0 as a greedy special case.
Fix: Handle temperature=0 explicitly as deterministic decoding, and keep the sampling path only for temperatures greater than zero.

Truncated probabilities no longer sum to one

Symptom: Logged or returned survivor probabilities sum to less than one after truncation.
Likely cause: Tail probabilities were set to zero, but code that consumes or reports probabilities still expects a normalized distribution. Some sampling APIs accept unnormalized nonnegative weights, but metrics and probability contracts usually don't.
Fix: Either mask logits with negative infinity and run softmax again, or divide surviving probabilities by their remaining mass before returning them.

Sources and further study

The papers behind these techniques are cited throughout the article: Holtzman et al. (2020) on neural text degeneration,^{[2]Reference 2The Curious Case of Neural Text Degeneration.https://arxiv.org/abs/1904.09751} Meister et al. (2020) on beam search paradox,^{[4]Reference 4If Beam Search is the Answer, What was the Question?.https://arxiv.org/abs/2010.02650} Fan et al. (2018) on top-k sampling,^{[5]Reference 5Hierarchical Neural Story Generation.https://arxiv.org/abs/1805.04833} Wu et al. (2016) on GNMT beam-search length normalization,^{[3]Reference 3Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.https://arxiv.org/abs/1609.08144} and Nguyen et al. (2025) on min-p as a newer confidence-scaled variant,^{[6]Reference 6Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs.https://arxiv.org/abs/2407.01082} along with a 2025 critical reanalysis that questions min-p's reported gains.^{[7]Reference 7Min-p, Max Exaggeration: A Critical Analysis of Min-p Sampling in Language Modelshttps://arxiv.org/abs/2506.13681}

Inference Mechanics: TTFT, TPS, and KV Cache - How decoding fits into the broader inference pipeline.
Speculative Decoding - Accelerating generation by drafting tokens with a smaller model.
Perplexity and Language Model Evaluation - Understanding why human text has high perplexity.

Practice drill

Create a decoding audit table for three production routes: factual policy answer, creative merchandising copy, and code completion.

Pick baseline settings for temperature, top-p or top-k, repetition penalty, and stop conditions.
Write two failure examples per route: one too deterministic, one too random or unsupported.
Compare outputs under at least three setting changes, then mark accuracy, diversity, latency, and parseability.
Choose one launch setting per route and write the rollback trigger that would force you to change it.

Good decoding work ends with an eval-backed operating policy, not a favorite temperature.

Next Step

Continue to Scaling Laws & Compute-Optimal Training

You can now trace how transformer internals produce logits and how decoding turns them into output tokens. Next, shift from running a trained model to deciding how to train one: scaling laws explain how parameters, data, and compute should be balanced before an expensive training run.

PreviousMechanistic Interpretability

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Fast Inference from Transformers via Speculative Decoding.

Leviathan, Y., Kalman, M., & Matias, Y. · 2023 · ICML 2023

The Curious Case of Neural Text Degeneration.

Holtzman, A., et al. · 2020 · ICLR 2020

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.

Wu, Y., et al. · 2016

If Beam Search is the Answer, What was the Question?.

Meister, C., Cotterell, R., & Vieira, T. · 2020 · EMNLP 2020

Hierarchical Neural Story Generation.

Fan, A., Lewis, M., & Dauphin, Y. · 2018 · ACL 2018

Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs.

Nguyen, et al. · 2025 · ICLR 2025 Oral

Min-p, Max Exaggeration: A Critical Analysis of Min-p Sampling in Language Models

Schaeffer, R., Kazdan, J., & Denisov-Blanch, Y. · 2025

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.

DeepSeek Team · 2025

CTRL: A Conditional Transformer Language Model for Controllable Generation.

Keskar, N. S., et al. · 2019

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Decoding Strategies: Greedy to Nucleus

How LLMs Choose the Next Token

Autoregressive generation

A concrete probability distribution

Greedy Decoding and Its Trap

Why the most likely next word isn't the best sentence

Why can greedy decoding be bad even when it always picks the most likely next token?

The algorithm

Reading the formula

Strengths

Limitations

Beam search

Algorithm

Length penalty

Reading the formula

A summarization model with beam width 8 keeps returning clipped one-line outputs instead of complete summaries. Which scoring term do you inspect first?

When to use beam search

Temperature scaling

What temperature does

The formula

Reading the formula

Mathematical intuition

Why it works

Starting sweep candidates

Why should one global temperature be questioned in evaluation?

Top-k Sampling

Algorithm

The fixed-k problem

Why does top-k fail differently on peaked and flat distributions?

Nucleus (top-p) Sampling

Algorithm

Reading the formula

How distribution shape changes nucleus size

Top-p's weakness: the long tail problem

Why can top-p admit too much tail at high temperature?

Beyond nucleus: min-p

The insight

Reading the formula

What min-p changes relative to top-p

When to think about min-p

Repetition penalty

The problem

How the penalty works

Frequency and presence penalties

Reading the formula

Combining Strategies in Production

You move a chat product to a new inference engine and outputs change even though temperature and top-p stayed the same. What sampler detail do you verify first?

Evaluation configurations

Practice: debug this output

Scenario 1: Repetitive boilerplate

Scenario 2: Gibberish at high creativity

Scenario 3: Factual QA drifts

A factual runbook assistant starts inventing policy details only when you raise temperature from 0.2 to 1.3 and keep top_p=0.95. What changed, and which knob do you tighten first?

Choosing a policy by task

What to check before moving on

What to remember

Common failures and fixes

Beam search made chat sound robotic

Temperature tweak didn't fix hallucinations

High temperature plus top-p pulled in nonsense tail tokens

Fixed top-k clipped valid options

Same knobs changed behavior after a runtime swap

temperature=0 was sent through the sampling math directly

Truncated probabilities no longer sum to one

Sources and further study

Related articles

Practice drill

Mastery Check

Discussion

`temperature=0` was sent through the sampling math directly