Turn retrieved evidence into controlled text by implementing stable softmax, sampling filters, constrained decoding, beam search, and reproducible generation audits.
The previous chapter put the right policy passage into a support assistant's context. That wasn't the final answer. A language model still has to turn its next-token scores into actual text, one token at a time.
Suppose retrieval found this evidence:
Annual-plan purchases are refundable within 30 days. This customer purchased 12 days ago.
The answer should communicate eligibility. Yet a model may assign some probability to pending, denied, or an irrelevant token such as coupon. Decoding is the algorithm that selects a continuation from those scores. It can make output deterministic, diverse, shorter, constrained, or faster to serve. It can't rescue missing evidence or guarantee that a fluent token is supported by the policy.
In this chapter, you'll build the critical pieces yourself: stable softmax, greedy generation, temperature and truncation, seeded sampling, output masks, beam search, length-aware scoring, and a decoder evaluation receipt. Those are the controls that must be understood before comparing an LLM product experiment.
coupon, and the retrieved 30-day policy is still needed to establish that only eligible is supported for a purchase made 12 days ago.For readable arithmetic, pretend each word below is a single token. The prompt ends with:
1Evidence: Annual-plan purchases are refundable within 30 days. Purchase age: 12 days.
2Answer: Your refund request isAssume a model produces these raw scores, called logits, for the next token:
| Token | Logit | Interpretation |
|---|---|---|
eligible | 3.0 | Supported by the retrieved policy. |
pending | 2.5 | Plausible service language, but it doesn't answer eligibility. |
denied | 1.0 | Conflicts with the evidence. |
coupon | -1.0 | Off-topic tail token. |
Logits aren't probabilities: they can be negative and don't sum to one. Softmax turns them into a probability distribution:
Subtracting the maximum logit doesn't change the probabilities because it rescales every exponential by the same constant. It does prevent overflow when scores become large, which matters in real inference code.
1from math import exp
2
3tokens = ["eligible", "pending", "denied", "coupon"]
4logits = [3.0, 2.5, 1.0, -1.0]
5
6def softmax(scores: list[float]) -> list[float]:
7 offset = max(scores)
8 weights = [exp(score - offset) for score in scores]
9 total = sum(weights)
10 return [weight / total for weight in weights]
11
12probabilities = softmax(logits)
13for token, score, probability in zip(tokens, logits, probabilities):
14 print(f"{token:8} logit={score:>4.1f} probability={probability:.3f}")
15print(f"sum={sum(probabilities):.3f}")1eligible logit= 3.0 probability=0.568
2pending logit= 2.5 probability=0.345
3denied logit= 1.0 probability=0.077
4coupon logit=-1.0 probability=0.010
5sum=1.000The distribution tells a precise story: eligible is most likely, but the model hasn't assigned zero probability to unsupported text. Sampling policy therefore affects product behavior.
The stable formula isn't an optional optimization. Directly calculating exp(1000) overflows ordinary floating-point arithmetic even though the correct softmax distribution is well behaved.
1from math import exp
2
3large_logits = [1000.0, 999.0, 998.0]
4
5try:
6 raw_weights = [exp(score) for score in large_logits]
7 naive = [weight / sum(raw_weights) for weight in raw_weights]
8 print("naive:", naive)
9except OverflowError:
10 print("naive softmax: overflow")
11
12offset = max(large_logits)
13stable_weights = [exp(score - offset) for score in large_logits]
14stable = [weight / sum(stable_weights) for weight in stable_weights]
15print("stable:", [round(probability, 3) for probability in stable])1naive softmax: overflow
2stable: [0.665, 0.245, 0.09]A decoder with unstable arithmetic doesn't produce "slightly different" text. It produces invalid probabilities or fails to generate at all.
0.0 to -4.0. Their exponential weights sum to 1.760; normalization gives eligible probability 0.568, greedy selection appends it, and the updated prefix becomes the next model input.Generation is autoregressive: after selecting one token, the system appends it to the sequence and asks the model for new logits for the next position. In a transformer service, a key-value (KV) cache stores attention state used by later steps; memory management for that growing cache is a central serving concern.[1]
Greedy decoding chooses the highest-logit token at every step. Here are two captured model steps for our policy answer. This is enough to implement the decoder without pretending we have a full model in the notebook.
1captured_logits = [
2 {"eligible": 3.0, "pending": 2.5, "denied": 1.0, "coupon": -1.0},
3 {".": 4.0, "today": 1.5, "after": 0.5},
4]
5
6def greedy_token(scores: dict[str, float]) -> str:
7 return max(scores, key=scores.get)
8
9generated = []
10for step, scores in enumerate(captured_logits, start=1):
11 chosen = greedy_token(scores)
12 generated.append(chosen)
13 print(f"step {step}: chose {chosen!r} from {len(scores)} candidates")
14
15answer = "Your refund request is " + " ".join(generated).replace(" .", ".")
16print(answer)1step 1: chose 'eligible' from 4 candidates
2step 2: chose '.' from 3 candidates
3Your refund request is eligible.Greedy generation is valuable for tests and structured responses because the policy contains no random draw. Its limitation is just as important: picking the best next token doesn't explore a weaker first token that could lead to a better complete sequence.
Temperature divides every logit by a positive value before softmax:
T < 1 sharpens the distribution, pushing more mass toward the current winner.T > 1 flattens it, giving lower-scored tokens more opportunity to be sampled.The formula requires T > 0. For a deterministic baseline, use greedy selection rather than dividing logits by zero.
Entropy measures the uncertainty of a distribution. Higher entropy here means the decoder has more plausible ways to vary the next token, including harmful ones.
1from math import exp, log
2
3tokens = ["eligible", "pending", "denied", "coupon"]
4logits = [3.0, 2.5, 1.0, -1.0]
5
6def softmax(scores: list[float]) -> list[float]:
7 offset = max(scores)
8 weights = [exp(score - offset) for score in scores]
9 total = sum(weights)
10 return [weight / total for weight in weights]
11
12def entropy(probabilities: list[float]) -> float:
13 return -sum(p * log(p) for p in probabilities)
14
15for temperature in (0.5, 1.0, 2.0):
16 probabilities = softmax([score / temperature for score in logits])
17 table = dict(zip(tokens, probabilities))
18 print(
19 f"T={temperature:.1f}: eligible={table['eligible']:.3f} "
20 f"coupon={table['coupon']:.3f} entropy={entropy(probabilities):.3f}"
21 )1T=0.5: eligible=0.721 coupon=0.000 entropy=0.647
2T=1.0: eligible=0.568 coupon=0.010 entropy=0.933
3T=2.0: eligible=0.438 coupon=0.059 entropy=1.190At higher temperature, the unsupported coupon continuation receives more probability along with genuinely useful variation. The printed 0.000 value at T=0.5 is rounded; the finite-logit token still has a small positive probability. Temperature is a diversity control, not a groundedness control.
A sampling decoder draws from probabilities rather than taking the maximum. This is useful when multiple outputs are acceptable, but it makes comparisons noisy unless you record the policy and the random seed used by your own experiment harness.
The next lab uses Python's random-number generator, so the same seed exactly replays the same draws in this local implementation.
1from random import Random
2
3tokens = ["eligible", "pending", "denied", "coupon"]
4probabilities = [0.568, 0.345, 0.077, 0.010]
5
6def draw_sequence(seed: int, length: int = 8) -> list[str]:
7 rng = Random(seed)
8 return rng.choices(tokens, weights=probabilities, k=length)
9
10for seed in (7, 7, 21):
11 print(f"seed={seed}: {' '.join(draw_sequence(seed))}")1seed=7: eligible eligible pending eligible eligible eligible eligible eligible
2seed=7: eligible eligible pending eligible eligible eligible eligible eligible
3seed=21: eligible pending pending eligible eligible pending pending eligibleProduction backends may have their own reproducibility contract, and hardware or implementation changes can defeat byte-for-byte replay. At minimum, store model identifier, prompt version, decoder settings, token limit, and any supported seed alongside each evaluation output.
Temperature leaves every token in play. Top-k sampling keeps up to k highest-probability candidates, then renormalizes their probability mass before drawing. Top-k was used for neural story generation before nucleus sampling was proposed.[2]
1tokens = ["eligible", "pending", "denied", "coupon"]
2probabilities = [0.568, 0.345, 0.077, 0.010]
3
4def top_k(items: list[str], probs: list[float], k: int) -> list[tuple[str, float]]:
5 ranked = sorted(zip(items, probs), key=lambda row: row[1], reverse=True)[:k]
6 retained_mass = sum(probability for _, probability in ranked)
7 return [(token, probability / retained_mass) for token, probability in ranked]
8
9for token, probability in top_k(tokens, probabilities, k=2):
10 print(f"{token:8} renormalized={probability:.3f}")1eligible renormalized=0.622
2pending renormalized=0.378Top-p, also called nucleus sampling, keeps the smallest high-probability prefix whose cumulative mass reaches a threshold p, then renormalizes and samples from that nucleus. Holtzman et al. proposed it after finding that open-ended likelihood-maximizing generation can become repetitive, while unrestricted sampling can draw from an unreliable tail.[3]
1def nucleus(probabilities: dict[str, float], threshold: float) -> list[str]:
2 ranked = sorted(probabilities.items(), key=lambda row: row[1], reverse=True)
3 kept = []
4 cumulative = 0.0
5 for token, probability in ranked:
6 kept.append(token)
7 cumulative += probability
8 if cumulative >= threshold:
9 break
10 return kept
11
12peaked = {"eligible": 0.568, "pending": 0.345, "denied": 0.077, "coupon": 0.010}
13flat = {"eligible": 0.28, "pending": 0.25, "denied": 0.24, "coupon": 0.23}
14
15print("peaked p=0.90:", nucleus(peaked, 0.90))
16print("flat p=0.90:", nucleus(flat, 0.90))1peaked p=0.90: ['eligible', 'pending']
2flat p=0.90: ['eligible', 'pending', 'denied', 'coupon']This is the defining contrast:
| Decoder control | Changes probabilities? | Removes candidates? | Candidate count |
|---|---|---|---|
| Temperature | Yes | No | Unchanged |
| Top-k | Renormalizes retained tokens | Yes | Up to k |
| Top-p | Renormalizes retained tokens | Yes | Adapts to each distribution |
An e-commerce assistant often needs structured output, such as a refund_status field with allowed values. A token mask can remove invalid values before selection. This is constrained decoding.
Make the distinction explicit:
coupon as a refund status.denied.1logits = {"coupon": 6.0, "denied": 4.0, "eligible": 3.0, "pending": 2.0}
2schema_values = {"eligible", "pending", "denied"}
3supported_by_policy = {"eligible"}
4
5def constrained_greedy(scores: dict[str, float], allowed: set[str]) -> str:
6 candidates = {token: score for token, score in scores.items() if token in allowed}
7 if not candidates:
8 raise ValueError("constraint removed every token")
9 return max(candidates, key=candidates.get)
10
11print("unconstrained:", max(logits, key=logits.get))
12print("schema-only:", constrained_greedy(logits, schema_values))
13print("policy-supported:", constrained_greedy(logits, supported_by_policy))1unconstrained: coupon
2schema-only: denied
3policy-supported: eligibleThe output shows why structured generation and grounded generation are different requirements. Good production evaluations test both.
This fixture makes each status one token. In a real tokenizer, a valid value may span several tokens. Production constrained decoding tracks allowed prefixes after every generated token rather than applying one static final-value mask.
Greedy selection can miss a better continuation at a later step. Suppose the response prefix is Your refund request, and the model has two ways to continue:
| First token | Probability | Strongest next token | Probability | Joint probability after two steps |
|---|---|---|---|---|
eligible | 0.40 | . | 0.20 | 0.080 |
is | 0.35 | eligible | 0.70 | 0.245 |
denied | 0.25 | . | 0.30 | 0.075 |
Greedy picks eligible immediately. At this fixed two-step horizon, beam search with width 2 retains both eligible and is, expands each path, then selects is eligible because that partial path scores higher.
This teaching fixture gives each retained prefix one captured second-token continuation. A real beam decoder expands every allowed next token for every retained prefix, combines the new log-probability scores, and prunes back to the configured beam width after each step. Unfinished prefixes remain candidates until an end-of-sequence token or another stop condition marks them complete.
Implement the fixture's two-step search in log space. Adding log probabilities is numerically safer than multiplying many tiny probabilities across a long sequence.
1from math import exp, log
2
3first_step = {"eligible": 0.40, "is": 0.35, "denied": 0.25}
4second_step = {
5 "eligible": {".": 0.20},
6 "is": {"eligible": 0.70},
7 "denied": {".": 0.30},
8}
9
10greedy_first = max(first_step, key=first_step.get)
11greedy_second = max(second_step[greedy_first], key=second_step[greedy_first].get)
12greedy_probability = first_step[greedy_first] * second_step[greedy_first][greedy_second]
13
14prefixes = sorted(first_step, key=first_step.get, reverse=True)[:2]
15candidates = []
16for first in prefixes:
17 for second, p_second in second_step[first].items():
18 candidates.append(((first, second), log(first_step[first]) + log(p_second)))
19
20winner, winner_log_probability = max(candidates, key=lambda row: row[1])
21print("greedy:", f"{greedy_first} {greedy_second}", f"probability={greedy_probability:.3f}")
22print("beam:", " ".join(winner), f"probability={exp(winner_log_probability):.3f}")1greedy: eligible . probability=0.080
2beam: is eligible probability=0.245Beam search is useful when complete-sequence likelihood matters, such as input-grounded transformation or tightly formatted output. For open-ended generation, Holtzman et al. found that maximization-based decoding, including beam search, can produce bland or repetitive text.[3]
Raw sequence probability shrinks with each generated token because it multiplies values no larger than one. That creates a preference for short completions. Beam-search systems therefore define a length-scoring policy rather than comparing raw totals blindly. For example, Google's neural machine translation system used length normalization and a coverage penalty in its beam search.[4]
The simplified lab below uses average log probability to make the issue visible. It isn't a universal beam-search formula; a serving stack must document the formula it actually uses.
1candidates = [
2 {"text": "eligible", "log_probability": -0.25, "tokens": 1},
3 {"text": "eligible within 30 days", "log_probability": -0.42, "tokens": 4},
4]
5
6raw_winner = max(candidates, key=lambda row: row["log_probability"])
7normalized_winner = max(candidates, key=lambda row: row["log_probability"] / row["tokens"])
8
9print("raw winner:", raw_winner["text"])
10print("length-normalized winner:", normalized_winner["text"])
11for row in candidates:
12 average = row["log_probability"] / row["tokens"]
13 print(f"{row['text']:24} raw={row['log_probability']:.3f} average={average:.3f}")1raw winner: eligible
2length-normalized winner: eligible within 30 days
3eligible raw=-0.250 average=-0.250
4eligible within 30 days raw=-0.420 average=-0.105Search scores optimize a measurable objective. They don't automatically optimize usefulness, groundedness, or customer clarity. Those need evaluations.
Temperature, truncation, constraints, and beam search change how text is selected. Speculative decoding addresses a different bottleneck: autoregressive generation normally needs a costly target-model step for each accepted token.
Leviathan, Kalman, and Matias describe a draft-and-verify algorithm in which a faster approximation model proposes tokens and the target model checks several proposals in parallel. Their exact sampling procedure preserves the target model's output distribution while reducing latency in suitable workloads.[5]
Keep this boundary clear:
| Technique | Primary purpose | Can it change selected-output distribution? |
|---|---|---|
| Temperature / top-k / top-p | Change diversity and tail risk | Yes |
| Beam search / length scoring | Search for high-scoring sequences | Yes |
| Token constraints | Enforce allowable forms | Yes |
| Exact speculative decoding | Produce target distribution faster | No |
A shortcut that accepts draft tokens without the target algorithm's verification and correction isn't equivalent speculative decoding. It is a different generator and must be evaluated as such.
Suppose a model change and a sampler change land together. If support resolution improves, you won't know which change caused it. Every generation evaluation should keep a decoder receipt beside each output:
Here is a tiny offline evaluation table. It treats eligible as the only response supported by the retrieved policy and compares three recorded decoder runs.
1runs = {
2 "greedy T=1": ["eligible", "eligible", "eligible", "eligible"],
3 "sample T=2": ["eligible", "pending", "coupon", "denied"],
4 "top-p p=.9": ["eligible", "pending", "eligible", "eligible"],
5}
6supported = {"eligible"}
7
8for receipt, outputs in runs.items():
9 grounded = sum(output in supported for output in outputs)
10 rate = grounded / len(outputs)
11 print(f"{receipt:12} grounded={grounded}/{len(outputs)} rate={rate:.2f}")1greedy T=1 grounded=4/4 rate=1.00
2sample T=2 grounded=1/4 rate=0.25
3top-p p=.9 grounded=3/4 rate=0.75This isn't proof that greedy is always best. It proves the procedure: compare decoders on held-out evidence and preserve settings so the next experiment measures the product change you meant to test.
Before shipping a generated support answer, separate these questions:
| Review question | Failure symptom | Evidence to collect |
|---|---|---|
| Are logits converted stably? | NaN, overflow, or invalid probabilities. | Large-logit unit test. |
| Is diversity deliberate? | Unsupported tail tokens appear unexpectedly. | Temperature/top-p sweep with groundedness labels. |
| Are formats enforced? | Invalid status or JSON value reaches application code. | Constraint and schema tests. |
| Is sequence scoring appropriate? | Replies are too short or oddly generic. | Beam/length policy report. |
| Is evaluation reproducible? | Two experiments can't be compared. | Decoder receipts and fixed eval cases. |
| Is latency optimization equivalent? | Faster serving silently changes answers. | Target-distribution audit for speculative serving. |
[1000.0, 999.0, 998.0] to stable-softmax.py. Explain why subtracting the maximum changes numeric safety without changing the probabilities.top-p-adapts-to-the-distribution.py to use threshold=0.70. Predict which tokens remain for the peaked and flat distributions before running it.mask-invalid-status-tokens.py with a two-token allowed value such as needs review. Sketch the prefix states a production decoder must allow.two-step-beam-search.py. Prune back to beam width 2 after expansion and confirm the winner.[0.665, 0.245, 0.090]. Shifting logits changes their absolute values, not their differences.p=0.70, the peaked distribution needs eligible and pending; the flat distribution needs eligible, pending, and denied.needs, the decoder must still allow the continuation that completes needs review while rejecting prefixes that can never reach a valid value.eligible is grounded while coupon isn't.| Symptom | Cause | Fix |
|---|---|---|
| Large logits lead to invalid probabilities. | Softmax exponentiated raw values without shifting. | Subtract the maximum logit before exp. |
| Output becomes more varied but also less supported. | Temperature flattened probability mass into the tail. | Evaluate groundedness, then use deliberate truncation or constraints. |
| Valid JSON contains a policy-wrong status. | Schema enforcement was mistaken for factual validation. | Test generated values against retrieved evidence. |
| Beam output is extremely short. | Raw sequence scores favor fewer multiplied probabilities. | Declare and evaluate a length-scoring policy. |
| A model experiment can't be reproduced. | Decoder settings or retrieval snapshot wasn't logged. | Store a decoder receipt with every evaluated output. |
Answer every question, then check your score. Score above 75% to mark this lesson complete.
10 questions remaining.
Efficient Memory Management for Large Language Model Serving with PagedAttention.
Kwon, W., et al. · 2023 · SOSP 2023
Hierarchical Neural Story Generation.
Fan, A., Lewis, M., & Dauphin, Y. · 2018 · ACL 2018
The Curious Case of Neural Text Degeneration.
Holtzman, A., et al. · 2020 · ICLR 2020
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.
Wu, Y., et al. · 2016
Fast Inference from Transformers via Speculative Decoding.
Leviathan, Y., Kalman, M., & Matias, Y. · 2023 · ICML 2023