Compute perplexity from held-out token probabilities, compare models under a fixed protocol, normalize across tokenizers, and decide what PPL can't tell you.
Tokens gave the model its vocabulary. Embeddings gave those tokens useful geometry. Now an engineer needs a measurement: when the real next token appears, how much probability did the model assign to it?
Suppose a shipping update begins The package left the. A model that assigns high probability to hub and low probability to volcano understands this local pattern better than a model that treats both as equally plausible. Perplexity converts that held-out prediction behavior into one number.
You will build that number from probabilities, implement its failure-resistant details, and write an evaluation report that answers a more important question than "is the score low?": is this score comparable to the previous run, and is it enough for the product decision?
Key idea: Perplexity is a next-token fit metric for causal language models. It's useful when you keep the evaluation contract fixed. It isn't a score for factuality, helpfulness, or safe product behavior.
A causal language model assigns a probability to every possible next token. During evaluation, you don't reward the model for a token it could have emitted. You score the probability it assigned to the token that actually occurred in held-out text.
If the observed token has probability , its negative log-likelihood (NLL), or surprise, is:
A certain correct prediction has probability 1 and surprise 0. A probability close to 0 produces a large penalty. This is why confident misses hurt language-model loss so much.
1import math
2
3probabilities = {
4 "hub": 0.60,
5 "warehouse": 0.25,
6 "volcano": 0.001,
7}
8
9for token in ["hub", "warehouse", "volcano"]:
10 surprise = -math.log(probabilities[token])
11 print(f"{token:9s} probability={probabilities[token]:.3f} surprise={surprise:.3f} nats")1hub probability=0.600 surprise=0.511 nats
2warehouse probability=0.250 surprise=1.386 nats
3volcano probability=0.001 surprise=6.908 natsThe held-out token matters. If the log says hub, the first score counts. It doesn't matter that warehouse also sounded reasonable: likelihood evaluates the text the model was asked to predict.
For held-out tokens , the average NLL is:
Perplexity exponentiates that average:
This is the standard definition used for autoregressive, or causal, language models. It isn't the standard metric for masked models such as BERT, because they are trained to predict masked positions rather than the next token in sequence.[1]
Start with three observed token probabilities:
1import math
2
3observed_probabilities = [0.50, 0.10, 0.80]
4token_nll = [-math.log(probability) for probability in observed_probabilities]
5average_nll = sum(token_nll) / len(token_nll)
6perplexity = math.exp(average_nll)
7
8print(f"token NLL: {[round(value, 3) for value in token_nll]}")
9print(f"average NLL: {average_nll:.3f} nats")
10print(f"perplexity: {perplexity:.2f}")1token NLL: [0.693, 2.303, 0.223]
2average NLL: 1.073 nats
3perplexity: 2.92The output means the model behaved, on average, as though it faced about 2.92 equally likely choices at each prediction step. That effective choice count is an interpretation, not a claim that exactly 2.92 vocabulary tokens were available.
1import math
2
3for equally_likely_options in [1, 4, 20, 100]:
4 probability = 1 / equally_likely_options
5 loss = -math.log(probability)
6 perplexity = math.exp(loss)
7 print(f"{equally_likely_options:3d} options -> loss={loss:.3f}, PPL={perplexity:.1f}")11 options -> loss=-0.000, PPL=1.0
2 4 options -> loss=1.386, PPL=4.0
3 20 options -> loss=2.996, PPL=20.0
4100 options -> loss=4.605, PPL=100.0
Models produce logits, not probabilities. A naive implementation calls exp(logit) directly. Large logits can overflow even though the eventual softmax probabilities are ordinary values. Stable log-softmax subtracts the largest logit before exponentiating.
1import math
2
3def stable_log_softmax(logits: list[float]) -> list[float]:
4 maximum = max(logits)
5 log_normalizer = maximum + math.log(
6 sum(math.exp(value - maximum) for value in logits)
7 )
8 return [value - log_normalizer for value in logits]
9
10logits = [1000.0, 998.0, 997.0]
11observed_token_id = 0
12
13try:
14 math.exp(logits[observed_token_id])
15except OverflowError:
16 print("naive exp(logit) overflowed")
17
18log_probabilities = stable_log_softmax(logits)
19nll = -log_probabilities[observed_token_id]
20print(f"stable NLL={nll:.3f}, PPL={math.exp(nll):.3f}")1naive exp(logit) overflowed
2stable NLL=0.170, PPL=1.185In a framework evaluator, cross-entropy normally applies this stable computation for you. You still need to know the principle when debugging inf losses, implementing metrics, or reviewing a custom evaluation loop.
A perplexity score is never complete without its units and conditioning rules. At minimum, log:
| Contract field | Why it changes the score |
|---|---|
| Dataset and split | A delivery-status corpus isn't a legal-policy corpus; train data isn't held-out data. |
| Tokenizer revision | Tokens set the denominator and the events being predicted. |
| Context length and stride | More usable left context generally makes token prediction easier. |
| Special-token and masking policy | Scoring or skipping initial and padding tokens changes the aggregate. |
| Model objective | A causal next-token model isn't directly comparable to a masked-language objective. |
Represent that contract in code before you compare checkpoints:
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class EvaluationContract:
5 dataset: str
6 tokenizer: str
7 context_tokens: int
8 stride_tokens: int
9 objective: str = "causal-next-token"
10
11def comparable(left: EvaluationContract, right: EvaluationContract) -> bool:
12 return left == right
13
14baseline = EvaluationContract("support-holdout-v3", "bpe-v7", 2048, 512)
15new_checkpoint = EvaluationContract("support-holdout-v3", "bpe-v7", 2048, 512)
16short_context_run = EvaluationContract("support-holdout-v3", "bpe-v7", 512, 512)
17
18print("baseline vs new checkpoint:", comparable(baseline, new_checkpoint))
19print("baseline vs short context:", comparable(baseline, short_context_run))1baseline vs new checkpoint: True
2baseline vs short context: FalseNow compare two model checkpoints on the same token outcomes:
1import math
2
3def perplexity(observed_probabilities: list[float]) -> float:
4 average_nll = sum(-math.log(p) for p in observed_probabilities) / len(
5 observed_probabilities
6 )
7 return math.exp(average_nll)
8
9held_out_probabilities = {
10 "checkpoint-0400": [0.31, 0.44, 0.18, 0.52, 0.24],
11 "checkpoint-0800": [0.42, 0.59, 0.29, 0.61, 0.35],
12}
13
14for name, probabilities in held_out_probabilities.items():
15 print(f"{name}: PPL={perplexity(probabilities):.2f}")1checkpoint-0400: PPL=3.18
2checkpoint-0800: PPL=2.31The result supports a narrow statement: checkpoint-0800 predicts tokens in this held-out set better under this protocol. It doesn't yet prove better support answers.
Evaluation windows are rarely the same size. Averaging each window's already exponentiated PPL gives a short hard window too much influence. Sum NLL weighted by scored-token count, divide once, and exponentiate once.
1import math
2
3windows = [
4 {"average_nll": 0.50, "scored_tokens": 2},
5 {"average_nll": 2.00, "scored_tokens": 8},
6]
7
8wrong = sum(math.exp(window["average_nll"]) for window in windows) / len(windows)
9total_nll = sum(
10 window["average_nll"] * window["scored_tokens"] for window in windows
11)
12total_tokens = sum(window["scored_tokens"] for window in windows)
13right = math.exp(total_nll / total_tokens)
14
15print(f"average of window PPLs: {wrong:.2f}")
16print(f"token-weighted corpus PPL: {right:.2f}")1average of window PPLs: 4.52
2token-weighted corpus PPL: 5.47Token-level perplexity depends on tokenization. The string Package at hub may be four subword tokens for one model and fourteen character tokens for another. A probability event per large subword isn't the same unit as a probability event per character. Hugging Face's perplexity documentation explicitly warns that tokenization affects PPL comparisons.[1]
For the same raw UTF-8 evaluation text, bits per byte (BPB) gives both models one shared denominator. PALOMA uses BPB as a practical compromise when tokenizers differ, while noting that it still scores the canonical token sequence chosen by each tokenizer rather than marginalizing over every valid segmentation.[2]
where is the byte count of the original text. A related metric, bits per character, is useful when a benchmark defines character units instead of bytes.
1import math
2
3text = "Package at hub"
4byte_count = len(text.encode("utf-8"))
5evaluations = [
6 {"name": "subword model", "tokens": 4, "total_nll": 8.4},
7 {"name": "character model", "tokens": 14, "total_nll": 9.0},
8]
9
10for run in evaluations:
11 ppl = math.exp(run["total_nll"] / run["tokens"])
12 bpb = run["total_nll"] / (byte_count * math.log(2))
13 print(f"{run['name']:15s} token PPL={ppl:.2f}, BPB={bpb:.3f}")
14
15print("Lower BPB identifies less surprise on identical bytes.")1subword model token PPL=8.17, BPB=0.866
2character model token PPL=1.90, BPB=0.927
3Lower BPB identifies less surprise on identical bytes.The character model looks dramatically better under raw token PPL because it predicts smaller units. BPB gives the fairer comparison: the subword model assigned more total probability to the same bytes. It doesn't erase every tokenization effect, so keep logging the tokenizer and use a fixed vocabulary when possible.
A real evaluation file may contain thousands of tokens, while a model accepts only a fixed number of context tokens. Cutting text into disjoint blocks is fast, but tokens at each block boundary lose usable left context. A strided sliding window reuses context and scores only newly exposed target tokens.
Hugging Face demonstrates this protocol for GPT-2 Large on WikiText-2: a no-overlap stride = 1024 run reports PPL 19.44, while stride = 512 reports 16.44 for the same model and corpus. The score improved because more context was supplied, not because model weights changed.[1]
First simulate which positions a sliding-window loop scores:
1tokens = list("ABCDEFGHIJ")
2windows = [
3 {"context": (0, 5), "score": (1, 5)},
4 {"context": (3, 8), "score": (5, 8)},
5 {"context": (5, 10), "score": (8, 10)},
6]
7
8scored_tokens: list[str] = []
9for index, window in enumerate(windows, start=1):
10 begin, end = window["context"]
11 score_begin, score_end = window["score"]
12 context = "".join(tokens[begin:end])
13 scored = "".join(tokens[score_begin:score_end])
14 scored_tokens.extend(tokens[score_begin:score_end])
15 print(f"window {index}: context={context}, newly scored={scored}")
16
17print("scored exactly once:", scored_tokens == tokens[1:])1window 1: context=ABCDE, newly scored=BCDE
2window 2: context=DEFGH, newly scored=FGH
3window 3: context=FGHIJ, newly scored=IJ
4scored exactly once: TrueThe first token is input context because a causal model needs a previous position before it can score a next-token target. In a framework implementation, context-only labels are commonly masked with -100 so cross-entropy ignores them.[1]
Here is a dependency-free evaluation loop over precomputed token NLL values. A real model supplies the losses; aggregation logic stays the same.
1import math
2
3new_target_losses = [
4 [0.30, 0.72, 0.51, 0.43],
5 [0.27, 0.61, 0.38],
6 [0.56, 0.48],
7]
8
9total_nll = sum(sum(window) for window in new_target_losses)
10scored_tokens = sum(len(window) for window in new_target_losses)
11perplexity = math.exp(total_nll / scored_tokens)
12
13print(f"scored tokens={scored_tokens}")
14print(f"average NLL={total_nll / scored_tokens:.3f}")
15print(f"PPL={perplexity:.2f}")1scored tokens=9
2average NLL=0.473
3PPL=1.61For every reported PPL, store max_context_tokens, stride_tokens, the first-token policy, and the masking policy beside the score. Those details are measurement settings, not implementation trivia.
Suppose your support assistant predicts common delivery-status language fluently but states a wrong refund deadline. Perplexity can reward fluent next-token prediction without detecting that policy failure. Likewise, changing decoding strategy can change generated text quality even when the underlying model is unchanged, as Holtzman et al. demonstrated when studying repetitive neural generation.[3]
Use PPL for the question it answers:
| Decision | Useful measurement |
|---|---|
| Did a base-model checkpoint get better at held-out next-token prediction? | PPL under fixed protocol, or BPB across tokenizers |
| Did the assistant provide the correct order status and cite supplied evidence? | Task-specific deterministic checks |
| Did an open-ended reply follow a rubric for clarity and groundedness? | Calibrated judge or human review |
| Is a release safe for a high-impact workflow? | Task regressions plus human-reviewed edge cases |
1candidates = [
2 {"name": "fluent-wrong", "ppl": 8.9, "policy_checks_passed": 1},
3 {"name": "grounded-answer", "ppl": 9.8, "policy_checks_passed": 3},
4]
5
6best_language_fit = min(candidates, key=lambda row: row["ppl"])
7best_product_answer = max(candidates, key=lambda row: row["policy_checks_passed"])
8
9print("best held-out language fit:", best_language_fit["name"])
10print("best support policy result:", best_product_answer["name"])1best held-out language fit: fluent-wrong
2best support policy result: grounded-answer
Many support cases can be checked without a model judge: expected status, required citation identifier, and forbidden policy claim are deterministic. Use those checks first.
1EXPECTED_STATUS = "delayed"
2REQUIRED_SOURCE = "tracking_event_483"
3
4def score_answer(answer: dict[str, str]) -> tuple[int, list[str]]:
5 failures: list[str] = []
6 if answer["status"] != EXPECTED_STATUS:
7 failures.append("wrong status")
8 if answer["source"] != REQUIRED_SOURCE:
9 failures.append("missing evidence")
10 return 2 - len(failures), failures
11
12answers = [
13 {"name": "A", "status": "delayed", "source": "tracking_event_483"},
14 {"name": "B", "status": "delivered", "source": "tracking_event_483"},
15]
16
17for answer in answers:
18 score, failures = score_answer(answer)
19 print(answer["name"], score, failures or ["pass"])1A 2 ['pass']
2B 1 ['wrong status']Open-ended tone, clarity, and partial correctness may require rubric review. LLM judges can scale that review, but the MT-Bench and Chatbot Arena study documents position, verbosity, and self-enhancement bias. Treat a judge as a calibrated measurement instrument, not truth.[4]
This lesson draws the boundary. Later evaluation lessons build judge calibration, benchmark selection, and online experimentation in depth.
PPL needs held-out text. If training data includes your evaluation records, lower loss may reflect memorization rather than generalization. Product task suites have the same failure: if prompt examples or fine-tuning rows include hidden test tickets, release metrics lose meaning.
1training_record_ids = {"ticket-101", "ticket-102", "ticket-103"}
2validation_record_ids = {"ticket-201", "ticket-202", "ticket-103"}
3
4overlap = training_record_ids & validation_record_ids
5if overlap:
6 print("FAIL leaked record ids:", sorted(overlap))
7else:
8 print("PASS validation set is disjoint")1FAIL leaked record ids: ['ticket-103']For public LLM benchmarks, test content can also enter later training corpora. LiveBench addresses that risk with frequently updated questions from recent sources and objective ground-truth scoring; it limits contamination risk rather than making every future score immune to leakage.[5]
An engineering metric becomes useful when it ships with enough context to reproduce a decision. A compact report should include metric value, protocol fields, leakage checks, and product task gates.
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class Report:
5 checkpoint: str
6 perplexity: float
7 dataset: str
8 tokenizer: str
9 context_tokens: int
10 stride_tokens: int
11 leaked_records: int
12 policy_pass_rate: float
13
14def release_gate(report: Report) -> str:
15 if report.leaked_records:
16 return "BLOCK: contaminated evaluation set"
17 if report.policy_pass_rate < 1.0:
18 return "BLOCK: product regressions"
19 return "PASS: protocol recorded and product checks passed"
20
21report = Report(
22 checkpoint="support-lm-0800",
23 perplexity=9.81,
24 dataset="support-holdout-v3",
25 tokenizer="bpe-v7",
26 context_tokens=2048,
27 stride_tokens=512,
28 leaked_records=0,
29 policy_pass_rate=1.0,
30)
31
32print(f"{report.checkpoint}: PPL={report.perplexity} @ {report.context_tokens}/{report.stride_tokens}")
33print(release_gate(report))1support-lm-0800: PPL=9.81 @ 2048/512
2PASS: protocol recorded and product checks passedThe report refuses two common shortcuts: treating an untrusted held-out set as evidence, and treating language fit as a substitute for application correctness.
exp(average NLL): an interpretable view of held-out next-token surprise.Symptom: A team declares victory from PPL 10 versus PPL 12 but can't name the tokenizer, data split, or stride.
Cause: The score was treated as a universal model rating instead of a metric with units and conditioning rules.
Fix: Store the evaluation contract with every result and compare raw PPL only when contracts match.
Symptom: Long-document PPL changes when window boundaries move, even though scored token losses are unchanged.
Cause: Per-window perplexities were averaged directly.
Fix: Sum token NLL across all windows, divide by scored-token count once, then exponentiate.
Symptom: A fluent model ships a wrong policy answer because it had the lowest PPL.
Cause: Intrinsic next-token evaluation was confused with application correctness.
Fix: Gate releases on deterministic task checks and calibrated review in addition to base-model fit metrics.
Symptom: Evaluation looks unusually strong, then fails on genuinely new tickets.
Cause: Training or prompt examples overlap with hidden evaluation data.
Fix: Enforce disjoint identifiers, keep private held-out records, and rotate realistic challenge cases.
Perplexity of fixed-length models
Hugging Face · 2026
PALOMA: A Benchmark for Evaluating Language Model Fit.
Magnusson, I., et al. · 2024 · NeurIPS 2024 Datasets and Benchmarks Track
The Curious Case of Neural Text Degeneration.
Holtzman, A., et al. · 2020 · ICLR 2020
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.
Zheng, L., et al. · 2023 · NeurIPS 2023
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
White, C., et al. · 2024