LearnCore LLM FoundationsPerplexity & Model Evaluation

📊MediumEvaluation & Benchmarks

Perplexity & Model Evaluation

Compute perplexity from held-out token probabilities, compare models under a fixed protocol, normalize across tokenizers, and decide what PPL can't tell you.

15 min read

Learning path

Step 48 of 155 in the full curriculum

Static to Contextual Embeddings File Ingestion for AI

Tokens gave the model its vocabulary. Embeddings gave those tokens useful geometry. Now an engineer needs a measurement: when the real next token appears, how much probability did the model assign to it?

Suppose a shipping update begins The package left the. A model that assigns high probability to hub and low probability to volcano understands this local pattern better than a model that treats both as equally plausible. Perplexity converts that held-out prediction behavior into one number.

You will build that number from probabilities, implement its failure-resistant details, and write an evaluation report that answers a more important question than "is the score low?": is this score comparable to the previous run, and is it enough for the product decision?

Key idea: Perplexity is a next-token fit metric for causal language models. It's useful when you keep the evaluation contract fixed. It isn't a score for factuality, helpfulness, or safe product behavior.

From surprise to a metric

A causal language model assigns a probability to every possible next token. During evaluation, you don't reward the model for a token it could have emitted. You score the probability it assigned to the token that actually occurred in held-out text.

If the observed token has probability $p$ , its negative log-likelihood (NLL), or surprise, is:

\text{surprise}(p) = -\ln(p)

A certain correct prediction has probability 1 and surprise 0. A probability close to 0 produces a large penalty. This is why confident misses hurt language-model loss so much.

compare-token-surprise.py

import math

probabilities = {
    "hub": 0.60,
    "warehouse": 0.25,
    "volcano": 0.001,
}

for token in ["hub", "warehouse", "volcano"]:
    surprise = -math.log(probabilities[token])
    print(f"{token:9s} probability={probabilities[token]:.3f} surprise={surprise:.3f} nats")

Output

hub       probability=0.600 surprise=0.511 nats
warehouse probability=0.250 surprise=1.386 nats
volcano   probability=0.001 surprise=6.908 nats

The held-out token matters. If the log says hub, the first score counts. It doesn't matter that warehouse also sounded reasonable: likelihood evaluates the text the model was asked to predict.

Average surprise becomes perplexity

For held-out tokens $x_1, x_2, \ldots, x_N$ , the average NLL is:

L = -\frac{1}{N}\sum_{i=1}^{N}\ln p_\theta(x_i \mid x_{<i})

Perplexity exponentiates that average:

\text{PPL} = \exp(L)

This is the standard definition used for autoregressive, or causal, language models. It isn't the standard metric for masked models such as BERT, because they are trained to predict masked positions rather than the next token in sequence.^[1]

Start with three observed token probabilities:

compute-perplexity-from-probabilities.py

import math

observed_probabilities = [0.50, 0.10, 0.80]
token_nll = [-math.log(probability) for probability in observed_probabilities]
average_nll = sum(token_nll) / len(token_nll)
perplexity = math.exp(average_nll)

print(f"token NLL: {[round(value, 3) for value in token_nll]}")
print(f"average NLL: {average_nll:.3f} nats")
print(f"perplexity: {perplexity:.2f}")

Output

token NLL: [0.693, 2.303, 0.223]
average NLL: 1.073 nats
perplexity: 2.92

The output means the model behaved, on average, as though it faced about 2.92 equally likely choices at each prediction step. That effective choice count is an interpretation, not a claim that exactly 2.92 vocabulary tokens were available.

recover-effective-choice-count.py

import math

for equally_likely_options in [1, 4, 20, 100]:
    probability = 1 / equally_likely_options
    loss = -math.log(probability)
    perplexity = math.exp(loss)
    print(f"{equally_likely_options:3d} options -> loss={loss:.3f}, PPL={perplexity:.1f}")

Output

options -> loss=-0.000, PPL=1.0
options -> loss=1.386, PPL=4.0
options -> loss=2.996, PPL=20.0
options -> loss=4.605, PPL=100.0

Line chart showing how average negative log-likelihood maps exponentially to perplexity. — Read loss and perplexity as two views of the same average surprise: a fixed loss reduction multiplies perplexity by a fixed ratio.

Perplexity shown as effective next-token branching factor. — Effective choices are intuitive only inside one evaluation setup; a lower value is better only after the protocol matches.

Compute from logits without numerical failure

Models produce logits, not probabilities. A naive implementation calls exp(logit) directly. Large logits can overflow even though the eventual softmax probabilities are ordinary values. Stable log-softmax subtracts the largest logit before exponentiating.

stable-log-softmax-for-perplexity.py

import math

def stable_log_softmax(logits: list[float]) -> list[float]:
    maximum = max(logits)
    log_normalizer = maximum + math.log(
        sum(math.exp(value - maximum) for value in logits)
    )
    return [value - log_normalizer for value in logits]

logits = [1000.0, 998.0, 997.0]
observed_token_id = 0

try:
    math.exp(logits[observed_token_id])
except OverflowError:
    print("naive exp(logit) overflowed")

log_probabilities = stable_log_softmax(logits)
nll = -log_probabilities[observed_token_id]
print(f"stable NLL={nll:.3f}, PPL={math.exp(nll):.3f}")

Output

naive exp(logit) overflowed
stable NLL=0.170, PPL=1.185

In a framework evaluator, cross-entropy normally applies this stable computation for you. You still need to know the principle when debugging inf losses, implementing metrics, or reviewing a custom evaluation loop.

The comparison contract

A perplexity score is never complete without its units and conditioning rules. At minimum, log:

Contract field	Why it changes the score
Dataset and split	A delivery-status corpus isn't a legal-policy corpus; train data isn't held-out data.
Tokenizer revision	Tokens set the denominator and the events being predicted.
Context length and stride	More usable left context generally makes token prediction easier.
Special-token and masking policy	Scoring or skipping initial and padding tokens changes the aggregate.
Model objective	A causal next-token model isn't directly comparable to a masked-language objective.

Compact guide for valid versus invalid raw perplexity comparisons. — A raw PPL improvement is credible only when dataset, tokenizer, context, stride, and objective are held fixed.

Represent that contract in code before you compare checkpoints:

enforce-a-perplexity-contract.py

from dataclasses import dataclass

@dataclass(frozen=True)
class EvaluationContract:
    dataset: str
    tokenizer: str
    context_tokens: int
    stride_tokens: int
    objective: str = "causal-next-token"

def comparable(left: EvaluationContract, right: EvaluationContract) -> bool:
    return left == right

baseline = EvaluationContract("support-holdout-v3", "bpe-v7", 2048, 512)
new_checkpoint = EvaluationContract("support-holdout-v3", "bpe-v7", 2048, 512)
short_context_run = EvaluationContract("support-holdout-v3", "bpe-v7", 512, 512)

print("baseline vs new checkpoint:", comparable(baseline, new_checkpoint))
print("baseline vs short context:", comparable(baseline, short_context_run))

Output

baseline vs new checkpoint: True
baseline vs short context: False

Now compare two model checkpoints on the same token outcomes:

compare-checkpoints-on-one-held-out-set.py

import math

def perplexity(observed_probabilities: list[float]) -> float:
    average_nll = sum(-math.log(p) for p in observed_probabilities) / len(
        observed_probabilities
    )
    return math.exp(average_nll)

held_out_probabilities = {
    "checkpoint-0400": [0.31, 0.44, 0.18, 0.52, 0.24],
    "checkpoint-0800": [0.42, 0.59, 0.29, 0.61, 0.35],
}

for name, probabilities in held_out_probabilities.items():
    print(f"{name}: PPL={perplexity(probabilities):.2f}")

Output

checkpoint-0400: PPL=3.18
checkpoint-0800: PPL=2.31

The result supports a narrow statement: checkpoint-0800 predicts tokens in this held-out set better under this protocol. It doesn't yet prove better support answers.

Aggregate loss once

Evaluation windows are rarely the same size. Averaging each window's already exponentiated PPL gives a short hard window too much influence. Sum NLL weighted by scored-token count, divide once, and exponentiate once.

aggregate-loss-before-exponentiating.py

import math

windows = [
    {"average_nll": 0.50, "scored_tokens": 2},
    {"average_nll": 2.00, "scored_tokens": 8},
]

wrong = sum(math.exp(window["average_nll"]) for window in windows) / len(windows)
total_nll = sum(
    window["average_nll"] * window["scored_tokens"] for window in windows
)
total_tokens = sum(window["scored_tokens"] for window in windows)
right = math.exp(total_nll / total_tokens)

print(f"average of window PPLs: {wrong:.2f}")
print(f"token-weighted corpus PPL: {right:.2f}")

Output

average of window PPLs: 4.52
token-weighted corpus PPL: 5.47

Different tokenizers need common units

Token-level perplexity depends on tokenization. The string Package at hub may be four subword tokens for one model and fourteen character tokens for another. A probability event per large subword isn't the same unit as a probability event per character. Hugging Face's perplexity documentation explicitly warns that tokenization affects PPL comparisons.^[1]

For the same raw UTF-8 evaluation text, bits per byte (BPB) gives both models one shared denominator. PALOMA uses BPB as a practical compromise when tokenizers differ, while noting that it still scores the canonical token sequence chosen by each tokenizer rather than marginalizing over every valid segmentation.^[2]

\text{BPB} = \frac{-\sum_i \ln p_\theta(x_i \mid x_{<i})} {B \ln 2}

where $B$ is the byte count of the original text. A related metric, bits per character, is useful when a benchmark defines character units instead of bytes.

Tokenizer normalization visual comparing subword, character, and byte units for the same text. — Token PPL changes with segmentation. BPB compares surprise over the same raw bytes, making cross-tokenizer comparison fairer rather than perfect.

compare-tokenizers-with-bits-per-byte.py

import math

text = "Package at hub"
byte_count = len(text.encode("utf-8"))
evaluations = [
    {"name": "subword model", "tokens": 4, "total_nll": 8.4},
    {"name": "character model", "tokens": 14, "total_nll": 9.0},
]

for run in evaluations:
    ppl = math.exp(run["total_nll"] / run["tokens"])
    bpb = run["total_nll"] / (byte_count * math.log(2))
    print(f"{run['name']:15s} token PPL={ppl:.2f}, BPB={bpb:.3f}")

print("Lower BPB identifies less surprise on identical bytes.")

Output

subword model   token PPL=8.17, BPB=0.866
character model token PPL=1.90, BPB=0.927
Lower BPB identifies less surprise on identical bytes.

The character model looks dramatically better under raw token PPL because it predicts smaller units. BPB gives the fairer comparison: the subword model assigned more total probability to the same bytes. It doesn't erase every tokenization effect, so keep logging the tokenizer and use a fixed vocabulary when possible.

Long documents need a scoring policy

A real evaluation file may contain thousands of tokens, while a model accepts only a fixed number of context tokens. Cutting text into disjoint blocks is fast, but tokens at each block boundary lose usable left context. A strided sliding window reuses context and scores only newly exposed target tokens.

Hugging Face demonstrates this protocol for GPT-2 Large on WikiText-2: a no-overlap stride = 1024 run reports PPL 19.44, while stride = 512 reports 16.44 for the same model and corpus. The score improved because more context was supplied, not because model weights changed.^[1]

Strided sliding-window perplexity evaluation showing reused context in blue, newly scored tokens in green, and one final aggregate loss. — Blue tokens condition predictions without being counted again; green target tokens contribute loss once before the final exponentiation.

First simulate which positions a sliding-window loop scores:

score-each-target-token-once.py

tokens = list("ABCDEFGHIJ")
windows = [
    {"context": (0, 5), "score": (1, 5)},
    {"context": (3, 8), "score": (5, 8)},
    {"context": (5, 10), "score": (8, 10)},
]

scored_tokens: list[str] = []
for index, window in enumerate(windows, start=1):
    begin, end = window["context"]
    score_begin, score_end = window["score"]
    context = "".join(tokens[begin:end])
    scored = "".join(tokens[score_begin:score_end])
    scored_tokens.extend(tokens[score_begin:score_end])
    print(f"window {index}: context={context}, newly scored={scored}")

print("scored exactly once:", scored_tokens == tokens[1:])

Output

window 1: context=ABCDE, newly scored=BCDE
window 2: context=DEFGH, newly scored=FGH
window 3: context=FGHIJ, newly scored=IJ
scored exactly once: True

The first token is input context because a causal model needs a previous position before it can score a next-token target. In a framework implementation, context-only labels are commonly masked with -100 so cross-entropy ignores them.^[1]

Here is a dependency-free evaluation loop over precomputed token NLL values. A real model supplies the losses; aggregation logic stays the same.

aggregate-a-strided-evaluation-run.py

import math

new_target_losses = [
    [0.30, 0.72, 0.51, 0.43],
    [0.27, 0.61, 0.38],
    [0.56, 0.48],
]

total_nll = sum(sum(window) for window in new_target_losses)
scored_tokens = sum(len(window) for window in new_target_losses)
perplexity = math.exp(total_nll / scored_tokens)

print(f"scored tokens={scored_tokens}")
print(f"average NLL={total_nll / scored_tokens:.3f}")
print(f"PPL={perplexity:.2f}")

Output

scored tokens=9
average NLL=0.473
PPL=1.61

For every reported PPL, store max_context_tokens, stride_tokens, the first-token policy, and the masking policy beside the score. Those details are measurement settings, not implementation trivia.

PPL answers one question, not every question

Suppose your support assistant predicts common delivery-status language fluently but states a wrong refund deadline. Perplexity can reward fluent next-token prediction without detecting that policy failure. Likewise, changing decoding strategy can change generated text quality even when the underlying model is unchanged, as Holtzman et al. demonstrated when studying repetitive neural generation.^[3]

Use PPL for the question it answers:

Decision	Useful measurement
Did a base-model checkpoint get better at held-out next-token prediction?	PPL under fixed protocol, or BPB across tokenizers
Did the assistant provide the correct order status and cite supplied evidence?	Task-specific deterministic checks
Did an open-ended reply follow a rubric for clarity and groundedness?	Calibrated judge or human review
Is a release safe for a high-impact workflow?	Task regressions plus human-reviewed edge cases

separate-language-fit-from-product-correctness.py

candidates = [
    {"name": "fluent-wrong", "ppl": 8.9, "policy_checks_passed": 1},
    {"name": "grounded-answer", "ppl": 9.8, "policy_checks_passed": 3},
]

best_language_fit = min(candidates, key=lambda row: row["ppl"])
best_product_answer = max(candidates, key=lambda row: row["policy_checks_passed"])

print("best held-out language fit:", best_language_fit["name"])
print("best support policy result:", best_product_answer["name"])

Output

best held-out language fit: fluent-wrong
best support policy result: grounded-answer

Production evaluation stack from perplexity through task checks, calibrated rubric review, and human audit. — PPL catches cheap language-fit regressions; release decisions move upward to checks that represent actual user and policy failures.

When deterministic checks end

Many support cases can be checked without a model judge: expected status, required citation identifier, and forbidden policy claim are deterministic. Use those checks first.

build-a-deterministic-support-eval.py

EXPECTED_STATUS = "delayed"
REQUIRED_SOURCE = "tracking_event_483"

def score_answer(answer: dict[str, str]) -> tuple[int, list[str]]:
    failures: list[str] = []
    if answer["status"] != EXPECTED_STATUS:
        failures.append("wrong status")
    if answer["source"] != REQUIRED_SOURCE:
        failures.append("missing evidence")
    return 2 - len(failures), failures

answers = [
    {"name": "A", "status": "delayed", "source": "tracking_event_483"},
    {"name": "B", "status": "delivered", "source": "tracking_event_483"},
]

for answer in answers:
    score, failures = score_answer(answer)
    print(answer["name"], score, failures or ["pass"])

Output

A 2 ['pass']
B 1 ['wrong status']

Open-ended tone, clarity, and partial correctness may require rubric review. LLM judges can scale that review, but the MT-Bench and Chatbot Arena study documents position, verbosity, and self-enhancement bias. Treat a judge as a calibrated measurement instrument, not truth.^[4]

LLM-as-judge pipeline with prompt, target response, rubric, independent judge, score, rationale, and bias controls. — A judge is appropriate only after its rubric and ordering policy are fixed and its scores are checked against human-labeled cases.

This lesson draws the boundary. Later evaluation lessons build judge calibration, benchmark selection, and online experimentation in depth.

Keep evaluation data clean

PPL needs held-out text. If training data includes your evaluation records, lower loss may reflect memorization rather than generalization. Product task suites have the same failure: if prompt examples or fine-tuning rows include hidden test tickets, release metrics lose meaning.

fail-on-evaluation-data-leakage.py

training_record_ids = {"ticket-101", "ticket-102", "ticket-103"}
validation_record_ids = {"ticket-201", "ticket-202", "ticket-103"}

overlap = training_record_ids & validation_record_ids
if overlap:
    print("FAIL leaked record ids:", sorted(overlap))
else:
    print("PASS validation set is disjoint")

Output

FAIL leaked record ids: ['ticket-103']

For public LLM benchmarks, test content can also enter later training corpora. LiveBench addresses that risk with frequently updated questions from recent sources and objective ground-truth scoring; it limits contamination risk rather than making every future score immune to leakage.^[5]

Evaluation hygiene flow separating training records from held-out and rotating challenge sets. — Protect measurement before debating scores: reject train/eval overlap, keep a private holdout, and refresh challenge cases over time.

Build an evaluation report

An engineering metric becomes useful when it ships with enough context to reproduce a decision. A compact report should include metric value, protocol fields, leakage checks, and product task gates.

emit-a-release-evaluation-report.py

from dataclasses import dataclass

@dataclass(frozen=True)
class Report:
    checkpoint: str
    perplexity: float
    dataset: str
    tokenizer: str
    context_tokens: int
    stride_tokens: int
    leaked_records: int
    policy_pass_rate: float

def release_gate(report: Report) -> str:
    if report.leaked_records:
        return "BLOCK: contaminated evaluation set"
    if report.policy_pass_rate < 1.0:
        return "BLOCK: product regressions"
    return "PASS: protocol recorded and product checks passed"

report = Report(
    checkpoint="support-lm-0800",
    perplexity=9.81,
    dataset="support-holdout-v3",
    tokenizer="bpe-v7",
    context_tokens=2048,
    stride_tokens=512,
    leaked_records=0,
    policy_pass_rate=1.0,
)

print(f"{report.checkpoint}: PPL={report.perplexity} @ {report.context_tokens}/{report.stride_tokens}")
print(release_gate(report))

Output

support-lm-0800: PPL=9.81 @ 2048/512
PASS: protocol recorded and product checks passed

The report refuses two common shortcuts: treating an untrusted held-out set as evidence, and treating language fit as a substitute for application correctness.

Key takeaways

Perplexity is exp(average NLL): an interpretable view of held-out next-token surprise.
Raw PPL comparison requires the same dataset, tokenizer, objective, context, stride, and masking policy.
Bits per byte puts models with different tokenizers onto one raw-text denominator.
Long-document evaluation must score new target tokens once while reusing context and aggregating loss before exponentiating.
Low PPL doesn't establish factual, useful, or safe outputs; application checks and calibrated review answer those questions.
Leakage invalidates confident evaluation claims, whether the set measures PPL or product behavior.

Mastery check

Key concepts

Held-out next-token likelihood
Cross-entropy to perplexity
Stable log-probability scoring
Evaluation protocol contracts
Bits-per-byte normalization
Strided context windows
Intrinsic versus product quality
Leakage-resistant evaluation sets

Evaluation rubric

Foundational: Computes token surprise, average NLL, and PPL from observed probabilities.
Intermediate: Explains effective choice count without treating it as a vocabulary-size claim.
Intermediate: Rejects invalid raw PPL comparisons by checking protocol fields.
Advanced: Uses BPB when tokenizers differ and aggregates strided loss correctly.
Advanced: Designs a report that separates language-fit metrics from product and leakage gates.

Follow-up questions

Common pitfalls

Comparing scores without a protocol

Symptom: A team declares victory from PPL 10 versus PPL 12 but can't name the tokenizer, data split, or stride.

Cause: The score was treated as a universal model rating instead of a metric with units and conditioning rules.

Fix: Store the evaluation contract with every result and compare raw PPL only when contracts match.

Averaging window perplexities

Symptom: Long-document PPL changes when window boundaries move, even though scored token losses are unchanged.

Cause: Per-window perplexities were averaged directly.

Fix: Sum token NLL across all windows, divide by scored-token count once, then exponentiate.

Selecting chat behavior using language fit alone

Symptom: A fluent model ships a wrong policy answer because it had the lowest PPL.

Cause: Intrinsic next-token evaluation was confused with application correctness.

Fix: Gate releases on deterministic task checks and calibrated review in addition to base-model fit metrics.

Testing on leaked records

Symptom: Evaluation looks unusually strong, then fails on genuinely new tickets.

Cause: Training or prompt examples overlap with hidden evaluation data.

Fix: Enforce disjoint identifiers, keep private held-out records, and rotate realistic challenge cases.

Next Step

Continue to File Ingestion for AI

You can now measure prediction fit and protect evaluation sets; next you will turn source documents into clean, traceable records that a system can evaluate and cite.

PreviousStatic to Contextual Embeddings

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Perplexity of fixed-length models

Hugging Face · 2026

PALOMA: A Benchmark for Evaluating Language Model Fit.

Magnusson, I., et al. · 2024 · NeurIPS 2024 Datasets and Benchmarks Track

The Curious Case of Neural Text Degeneration.

Holtzman, A., et al. · 2020 · ICLR 2020

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.

Zheng, L., et al. · 2023 · NeurIPS 2023

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

White, C., et al. · 2024

Back to Topics

LearnCore LLM FoundationsPerplexity & Model Evaluation

📊MediumEvaluation & Benchmarks

Perplexity & Model Evaluation

Compute perplexity from held-out token probabilities, compare models under a fixed protocol, normalize across tokenizers, and decide what PPL can't tell you.

15 min read

Learning path

Step 48 of 155 in the full curriculum

Static to Contextual Embeddings File Ingestion for AI

Key idea: Perplexity is a next-token fit metric for causal language models. It's useful when you keep the evaluation contract fixed. It isn't a score for factuality, helpfulness, or safe product behavior.

From surprise to a metric

If the observed token has probability $p$ , its negative log-likelihood (NLL), or surprise, is:

\text{surprise}(p) = -\ln(p)

A certain correct prediction has probability 1 and surprise 0. A probability close to 0 produces a large penalty. This is why confident misses hurt language-model loss so much.

compare-token-surprise.py

import math

probabilities = {
    "hub": 0.60,
    "warehouse": 0.25,
    "volcano": 0.001,
}

for token in ["hub", "warehouse", "volcano"]:
    surprise = -math.log(probabilities[token])
    print(f"{token:9s} probability={probabilities[token]:.3f} surprise={surprise:.3f} nats")

Output

hub       probability=0.600 surprise=0.511 nats
warehouse probability=0.250 surprise=1.386 nats
volcano   probability=0.001 surprise=6.908 nats

The held-out token matters. If the log says hub, the first score counts. It doesn't matter that warehouse also sounded reasonable: likelihood evaluates the text the model was asked to predict.

Average surprise becomes perplexity

For held-out tokens $x_1, x_2, \ldots, x_N$ , the average NLL is:

L = -\frac{1}{N}\sum_{i=1}^{N}\ln p_\theta(x_i \mid x_{<i})

Perplexity exponentiates that average:

\text{PPL} = \exp(L)

Start with three observed token probabilities:

compute-perplexity-from-probabilities.py

import math

observed_probabilities = [0.50, 0.10, 0.80]
token_nll = [-math.log(probability) for probability in observed_probabilities]
average_nll = sum(token_nll) / len(token_nll)
perplexity = math.exp(average_nll)

print(f"token NLL: {[round(value, 3) for value in token_nll]}")
print(f"average NLL: {average_nll:.3f} nats")
print(f"perplexity: {perplexity:.2f}")

Output

token NLL: [0.693, 2.303, 0.223]
average NLL: 1.073 nats
perplexity: 2.92

recover-effective-choice-count.py

import math

for equally_likely_options in [1, 4, 20, 100]:
    probability = 1 / equally_likely_options
    loss = -math.log(probability)
    perplexity = math.exp(loss)
    print(f"{equally_likely_options:3d} options -> loss={loss:.3f}, PPL={perplexity:.1f}")

Output

options -> loss=-0.000, PPL=1.0
options -> loss=1.386, PPL=4.0
options -> loss=2.996, PPL=20.0
options -> loss=4.605, PPL=100.0

Compute from logits without numerical failure

stable-log-softmax-for-perplexity.py

import math

def stable_log_softmax(logits: list[float]) -> list[float]:
    maximum = max(logits)
    log_normalizer = maximum + math.log(
        sum(math.exp(value - maximum) for value in logits)
    )
    return [value - log_normalizer for value in logits]

logits = [1000.0, 998.0, 997.0]
observed_token_id = 0

try:
    math.exp(logits[observed_token_id])
except OverflowError:
    print("naive exp(logit) overflowed")

log_probabilities = stable_log_softmax(logits)
nll = -log_probabilities[observed_token_id]
print(f"stable NLL={nll:.3f}, PPL={math.exp(nll):.3f}")

Output

naive exp(logit) overflowed
stable NLL=0.170, PPL=1.185

The comparison contract

A perplexity score is never complete without its units and conditioning rules. At minimum, log:

Contract field	Why it changes the score
Dataset and split	A delivery-status corpus isn't a legal-policy corpus; train data isn't held-out data.
Tokenizer revision	Tokens set the denominator and the events being predicted.
Context length and stride	More usable left context generally makes token prediction easier.
Special-token and masking policy	Scoring or skipping initial and padding tokens changes the aggregate.
Model objective	A causal next-token model isn't directly comparable to a masked-language objective.

Represent that contract in code before you compare checkpoints:

enforce-a-perplexity-contract.py

from dataclasses import dataclass

@dataclass(frozen=True)
class EvaluationContract:
    dataset: str
    tokenizer: str
    context_tokens: int
    stride_tokens: int
    objective: str = "causal-next-token"

def comparable(left: EvaluationContract, right: EvaluationContract) -> bool:
    return left == right

baseline = EvaluationContract("support-holdout-v3", "bpe-v7", 2048, 512)
new_checkpoint = EvaluationContract("support-holdout-v3", "bpe-v7", 2048, 512)
short_context_run = EvaluationContract("support-holdout-v3", "bpe-v7", 512, 512)

print("baseline vs new checkpoint:", comparable(baseline, new_checkpoint))
print("baseline vs short context:", comparable(baseline, short_context_run))

Output

baseline vs new checkpoint: True
baseline vs short context: False

Now compare two model checkpoints on the same token outcomes:

compare-checkpoints-on-one-held-out-set.py

import math

def perplexity(observed_probabilities: list[float]) -> float:
    average_nll = sum(-math.log(p) for p in observed_probabilities) / len(
        observed_probabilities
    )
    return math.exp(average_nll)

held_out_probabilities = {
    "checkpoint-0400": [0.31, 0.44, 0.18, 0.52, 0.24],
    "checkpoint-0800": [0.42, 0.59, 0.29, 0.61, 0.35],
}

for name, probabilities in held_out_probabilities.items():
    print(f"{name}: PPL={perplexity(probabilities):.2f}")

Output

checkpoint-0400: PPL=3.18
checkpoint-0800: PPL=2.31

The result supports a narrow statement: checkpoint-0800 predicts tokens in this held-out set better under this protocol. It doesn't yet prove better support answers.

Aggregate loss once

aggregate-loss-before-exponentiating.py

import math

windows = [
    {"average_nll": 0.50, "scored_tokens": 2},
    {"average_nll": 2.00, "scored_tokens": 8},
]

wrong = sum(math.exp(window["average_nll"]) for window in windows) / len(windows)
total_nll = sum(
    window["average_nll"] * window["scored_tokens"] for window in windows
)
total_tokens = sum(window["scored_tokens"] for window in windows)
right = math.exp(total_nll / total_tokens)

print(f"average of window PPLs: {wrong:.2f}")
print(f"token-weighted corpus PPL: {right:.2f}")

Output

average of window PPLs: 4.52
token-weighted corpus PPL: 5.47

Different tokenizers need common units

\text{BPB} = \frac{-\sum_i \ln p_\theta(x_i \mid x_{<i})} {B \ln 2}

where $B$ is the byte count of the original text. A related metric, bits per character, is useful when a benchmark defines character units instead of bytes.

compare-tokenizers-with-bits-per-byte.py

import math

text = "Package at hub"
byte_count = len(text.encode("utf-8"))
evaluations = [
    {"name": "subword model", "tokens": 4, "total_nll": 8.4},
    {"name": "character model", "tokens": 14, "total_nll": 9.0},
]

for run in evaluations:
    ppl = math.exp(run["total_nll"] / run["tokens"])
    bpb = run["total_nll"] / (byte_count * math.log(2))
    print(f"{run['name']:15s} token PPL={ppl:.2f}, BPB={bpb:.3f}")

print("Lower BPB identifies less surprise on identical bytes.")

Output

subword model   token PPL=8.17, BPB=0.866
character model token PPL=1.90, BPB=0.927
Lower BPB identifies less surprise on identical bytes.

Long documents need a scoring policy

First simulate which positions a sliding-window loop scores:

score-each-target-token-once.py

tokens = list("ABCDEFGHIJ")
windows = [
    {"context": (0, 5), "score": (1, 5)},
    {"context": (3, 8), "score": (5, 8)},
    {"context": (5, 10), "score": (8, 10)},
]

scored_tokens: list[str] = []
for index, window in enumerate(windows, start=1):
    begin, end = window["context"]
    score_begin, score_end = window["score"]
    context = "".join(tokens[begin:end])
    scored = "".join(tokens[score_begin:score_end])
    scored_tokens.extend(tokens[score_begin:score_end])
    print(f"window {index}: context={context}, newly scored={scored}")

print("scored exactly once:", scored_tokens == tokens[1:])

Output

window 1: context=ABCDE, newly scored=BCDE
window 2: context=DEFGH, newly scored=FGH
window 3: context=FGHIJ, newly scored=IJ
scored exactly once: True

Here is a dependency-free evaluation loop over precomputed token NLL values. A real model supplies the losses; aggregation logic stays the same.

aggregate-a-strided-evaluation-run.py

import math

new_target_losses = [
    [0.30, 0.72, 0.51, 0.43],
    [0.27, 0.61, 0.38],
    [0.56, 0.48],
]

total_nll = sum(sum(window) for window in new_target_losses)
scored_tokens = sum(len(window) for window in new_target_losses)
perplexity = math.exp(total_nll / scored_tokens)

print(f"scored tokens={scored_tokens}")
print(f"average NLL={total_nll / scored_tokens:.3f}")
print(f"PPL={perplexity:.2f}")

Output

scored tokens=9
average NLL=0.473
PPL=1.61

PPL answers one question, not every question

Use PPL for the question it answers:

Decision	Useful measurement
Did a base-model checkpoint get better at held-out next-token prediction?	PPL under fixed protocol, or BPB across tokenizers
Did the assistant provide the correct order status and cite supplied evidence?	Task-specific deterministic checks
Did an open-ended reply follow a rubric for clarity and groundedness?	Calibrated judge or human review
Is a release safe for a high-impact workflow?	Task regressions plus human-reviewed edge cases

separate-language-fit-from-product-correctness.py

candidates = [
    {"name": "fluent-wrong", "ppl": 8.9, "policy_checks_passed": 1},
    {"name": "grounded-answer", "ppl": 9.8, "policy_checks_passed": 3},
]

best_language_fit = min(candidates, key=lambda row: row["ppl"])
best_product_answer = max(candidates, key=lambda row: row["policy_checks_passed"])

print("best held-out language fit:", best_language_fit["name"])
print("best support policy result:", best_product_answer["name"])

Output

best held-out language fit: fluent-wrong
best support policy result: grounded-answer

When deterministic checks end

Many support cases can be checked without a model judge: expected status, required citation identifier, and forbidden policy claim are deterministic. Use those checks first.

build-a-deterministic-support-eval.py

EXPECTED_STATUS = "delayed"
REQUIRED_SOURCE = "tracking_event_483"

def score_answer(answer: dict[str, str]) -> tuple[int, list[str]]:
    failures: list[str] = []
    if answer["status"] != EXPECTED_STATUS:
        failures.append("wrong status")
    if answer["source"] != REQUIRED_SOURCE:
        failures.append("missing evidence")
    return 2 - len(failures), failures

answers = [
    {"name": "A", "status": "delayed", "source": "tracking_event_483"},
    {"name": "B", "status": "delivered", "source": "tracking_event_483"},
]

for answer in answers:
    score, failures = score_answer(answer)
    print(answer["name"], score, failures or ["pass"])

Output

A 2 ['pass']
B 1 ['wrong status']

This lesson draws the boundary. Later evaluation lessons build judge calibration, benchmark selection, and online experimentation in depth.

Keep evaluation data clean

fail-on-evaluation-data-leakage.py

training_record_ids = {"ticket-101", "ticket-102", "ticket-103"}
validation_record_ids = {"ticket-201", "ticket-202", "ticket-103"}

overlap = training_record_ids & validation_record_ids
if overlap:
    print("FAIL leaked record ids:", sorted(overlap))
else:
    print("PASS validation set is disjoint")

Output

FAIL leaked record ids: ['ticket-103']

Build an evaluation report

An engineering metric becomes useful when it ships with enough context to reproduce a decision. A compact report should include metric value, protocol fields, leakage checks, and product task gates.

emit-a-release-evaluation-report.py

from dataclasses import dataclass

@dataclass(frozen=True)
class Report:
    checkpoint: str
    perplexity: float
    dataset: str
    tokenizer: str
    context_tokens: int
    stride_tokens: int
    leaked_records: int
    policy_pass_rate: float

def release_gate(report: Report) -> str:
    if report.leaked_records:
        return "BLOCK: contaminated evaluation set"
    if report.policy_pass_rate < 1.0:
        return "BLOCK: product regressions"
    return "PASS: protocol recorded and product checks passed"

report = Report(
    checkpoint="support-lm-0800",
    perplexity=9.81,
    dataset="support-holdout-v3",
    tokenizer="bpe-v7",
    context_tokens=2048,
    stride_tokens=512,
    leaked_records=0,
    policy_pass_rate=1.0,
)

print(f"{report.checkpoint}: PPL={report.perplexity} @ {report.context_tokens}/{report.stride_tokens}")
print(release_gate(report))

Output

support-lm-0800: PPL=9.81 @ 2048/512
PASS: protocol recorded and product checks passed

The report refuses two common shortcuts: treating an untrusted held-out set as evidence, and treating language fit as a substitute for application correctness.

Key takeaways

Perplexity is exp(average NLL): an interpretable view of held-out next-token surprise.
Raw PPL comparison requires the same dataset, tokenizer, objective, context, stride, and masking policy.
Bits per byte puts models with different tokenizers onto one raw-text denominator.
Long-document evaluation must score new target tokens once while reusing context and aggregating loss before exponentiating.
Low PPL doesn't establish factual, useful, or safe outputs; application checks and calibrated review answer those questions.
Leakage invalidates confident evaluation claims, whether the set measures PPL or product behavior.

Mastery check

Key concepts

Held-out next-token likelihood
Cross-entropy to perplexity
Stable log-probability scoring
Evaluation protocol contracts
Bits-per-byte normalization
Strided context windows
Intrinsic versus product quality
Leakage-resistant evaluation sets

Evaluation rubric

Foundational: Computes token surprise, average NLL, and PPL from observed probabilities.
Intermediate: Explains effective choice count without treating it as a vocabulary-size claim.
Intermediate: Rejects invalid raw PPL comparisons by checking protocol fields.
Advanced: Uses BPB when tokenizers differ and aggregates strided loss correctly.
Advanced: Designs a report that separates language-fit metrics from product and leakage gates.

Follow-up questions

Common pitfalls

Comparing scores without a protocol

Symptom: A team declares victory from PPL 10 versus PPL 12 but can't name the tokenizer, data split, or stride.

Cause: The score was treated as a universal model rating instead of a metric with units and conditioning rules.

Fix: Store the evaluation contract with every result and compare raw PPL only when contracts match.

Averaging window perplexities

Symptom: Long-document PPL changes when window boundaries move, even though scored token losses are unchanged.

Cause: Per-window perplexities were averaged directly.

Fix: Sum token NLL across all windows, divide by scored-token count once, then exponentiate.

Selecting chat behavior using language fit alone

Symptom: A fluent model ships a wrong policy answer because it had the lowest PPL.

Cause: Intrinsic next-token evaluation was confused with application correctness.

Fix: Gate releases on deterministic task checks and calibrated review in addition to base-model fit metrics.

Testing on leaked records

Symptom: Evaluation looks unusually strong, then fails on genuinely new tickets.

Cause: Training or prompt examples overlap with hidden evaluation data.

Fix: Enforce disjoint identifiers, keep private held-out records, and rotate realistic challenge cases.

Next Step

Continue to File Ingestion for AI

You can now measure prediction fit and protect evaluation sets; next you will turn source documents into clean, traceable records that a system can evaluate and cite.

PreviousStatic to Contextual Embeddings

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Perplexity of fixed-length models

Hugging Face · 2026

PALOMA: A Benchmark for Evaluating Language Model Fit.

Magnusson, I., et al. · 2024 · NeurIPS 2024 Datasets and Benchmarks Track

The Curious Case of Neural Text Degeneration.

Holtzman, A., et al. · 2020 · ICLR 2020

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.

Zheng, L., et al. · 2023 · NeurIPS 2023

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

White, C., et al. · 2024

Perplexity & Model Evaluation

From surprise to a metric

A model assigns probability 0.8 to the observed token on one step and 0.02 on another. Which step dominates its loss, and why?

Average surprise becomes perplexity

Compute from logits without numerical failure

Why should an evaluator accumulate negative log-likelihood instead of multiplying token probabilities together?

The comparison contract

Aggregate loss once

Run A reports PPL 12 and run B reports PPL 10. What must be checked before declaring B better?

Different tokenizers need common units

Long documents need a scoring policy

PPL answers one question, not every question

When deterministic checks end

Keep evaluation data clean

Build an evaluation report

Key takeaways

Mastery check

Key concepts

Evaluation rubric

Follow-up questions

Why can two models not be compared by raw PPL when their tokenizers differ?

Why does decreasing stride often lower PPL for the same fixed-context model?

A support model has lower PPL but fails refund-policy checks. Which model should ship?

Common pitfalls

Comparing scores without a protocol

Averaging window perplexities

Selecting chat behavior using language fit alone

Testing on leaked records

Perplexity & Model Evaluation

From surprise to a metric

A model assigns probability 0.8 to the observed token on one step and 0.02 on another. Which step dominates its loss, and why?

Average surprise becomes perplexity

Compute from logits without numerical failure

Why should an evaluator accumulate negative log-likelihood instead of multiplying token probabilities together?

The comparison contract

Aggregate loss once

Run A reports PPL 12 and run B reports PPL 10. What must be checked before declaring B better?

Different tokenizers need common units

Long documents need a scoring policy

PPL answers one question, not every question

When deterministic checks end

Keep evaluation data clean

Build an evaluation report

Key takeaways

Mastery check

Key concepts

Evaluation rubric

Follow-up questions

Why can two models not be compared by raw PPL when their tokenizers differ?

Why does decreasing stride often lower PPL for the same fixed-context model?

A support model has lower PPL but fails refund-policy checks. Which model should ship?

Common pitfalls

Comparing scores without a protocol

Averaging window perplexities

Selecting chat behavior using language fit alone

Testing on leaked records