LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnCore LLM FoundationsPerplexity & Model Evaluation
📊MediumEvaluation & Benchmarks

Perplexity & Model Evaluation

Compute perplexity from held-out token probabilities, compare models under a fixed protocol, normalize across tokenizers, and decide what PPL can't tell you.

15 min read
Learning path
Step 48 of 155 in the full curriculum
Static to Contextual EmbeddingsFile Ingestion for AI

Tokens gave the model its vocabulary. Embeddings gave those tokens useful geometry. Now an engineer needs a measurement: when the real next token appears, how much probability did the model assign to it?

Suppose a shipping update begins The package left the. A model that assigns high probability to hub and low probability to volcano understands this local pattern better than a model that treats both as equally plausible. Perplexity converts that held-out prediction behavior into one number.

You will build that number from probabilities, implement its failure-resistant details, and write an evaluation report that answers a more important question than "is the score low?": is this score comparable to the previous run, and is it enough for the product decision?

Key idea: Perplexity is a next-token fit metric for causal language models. It's useful when you keep the evaluation contract fixed. It isn't a score for factuality, helpfulness, or safe product behavior.

From surprise to a metric

A causal language model assigns a probability to every possible next token. During evaluation, you don't reward the model for a token it could have emitted. You score the probability it assigned to the token that actually occurred in held-out text.

If the observed token has probability ppp, its negative log-likelihood (NLL), or surprise, is:

surprise(p)=−ln⁡(p)\text{surprise}(p) = -\ln(p)surprise(p)=−ln(p)

A certain correct prediction has probability 1 and surprise 0. A probability close to 0 produces a large penalty. This is why confident misses hurt language-model loss so much.

compare-token-surprise.py
1import math 2 3probabilities = { 4 "hub": 0.60, 5 "warehouse": 0.25, 6 "volcano": 0.001, 7} 8 9for token in ["hub", "warehouse", "volcano"]: 10 surprise = -math.log(probabilities[token]) 11 print(f"{token:9s} probability={probabilities[token]:.3f} surprise={surprise:.3f} nats")
Output
1hub probability=0.600 surprise=0.511 nats 2warehouse probability=0.250 surprise=1.386 nats 3volcano probability=0.001 surprise=6.908 nats

The held-out token matters. If the log says hub, the first score counts. It doesn't matter that warehouse also sounded reasonable: likelihood evaluates the text the model was asked to predict.

Average surprise becomes perplexity

For held-out tokens x1,x2,…,xNx_1, x_2, \ldots, x_Nx1​,x2​,…,xN​, the average NLL is:

L=−1N∑i=1Nln⁡pθ(xi∣x<i)L = -\frac{1}{N}\sum_{i=1}^{N}\ln p_\theta(x_i \mid x_{<i})L=−N1​i=1∑N​lnpθ​(xi​∣x<i​)

Perplexity exponentiates that average:

PPL=exp⁡(L)\text{PPL} = \exp(L)PPL=exp(L)

This is the standard definition used for autoregressive, or causal, language models. It isn't the standard metric for masked models such as BERT, because they are trained to predict masked positions rather than the next token in sequence.[1]

Start with three observed token probabilities:

compute-perplexity-from-probabilities.py
1import math 2 3observed_probabilities = [0.50, 0.10, 0.80] 4token_nll = [-math.log(probability) for probability in observed_probabilities] 5average_nll = sum(token_nll) / len(token_nll) 6perplexity = math.exp(average_nll) 7 8print(f"token NLL: {[round(value, 3) for value in token_nll]}") 9print(f"average NLL: {average_nll:.3f} nats") 10print(f"perplexity: {perplexity:.2f}")
Output
1token NLL: [0.693, 2.303, 0.223] 2average NLL: 1.073 nats 3perplexity: 2.92

The output means the model behaved, on average, as though it faced about 2.92 equally likely choices at each prediction step. That effective choice count is an interpretation, not a claim that exactly 2.92 vocabulary tokens were available.

recover-effective-choice-count.py
1import math 2 3for equally_likely_options in [1, 4, 20, 100]: 4 probability = 1 / equally_likely_options 5 loss = -math.log(probability) 6 perplexity = math.exp(loss) 7 print(f"{equally_likely_options:3d} options -> loss={loss:.3f}, PPL={perplexity:.1f}")
Output
11 options -> loss=-0.000, PPL=1.0 2 4 options -> loss=1.386, PPL=4.0 3 20 options -> loss=2.996, PPL=20.0 4100 options -> loss=4.605, PPL=100.0
Line chart showing how average negative log-likelihood maps exponentially to perplexity. Line chart showing how average negative log-likelihood maps exponentially to perplexity.
Read loss and perplexity as two views of the same average surprise: a fixed loss reduction multiplies perplexity by a fixed ratio.
Perplexity shown as effective next-token branching factor. Perplexity shown as effective next-token branching factor.
Effective choices are intuitive only inside one evaluation setup; a lower value is better only after the protocol matches.

Compute from logits without numerical failure

Models produce logits, not probabilities. A naive implementation calls exp(logit) directly. Large logits can overflow even though the eventual softmax probabilities are ordinary values. Stable log-softmax subtracts the largest logit before exponentiating.

stable-log-softmax-for-perplexity.py
1import math 2 3def stable_log_softmax(logits: list[float]) -> list[float]: 4 maximum = max(logits) 5 log_normalizer = maximum + math.log( 6 sum(math.exp(value - maximum) for value in logits) 7 ) 8 return [value - log_normalizer for value in logits] 9 10logits = [1000.0, 998.0, 997.0] 11observed_token_id = 0 12 13try: 14 math.exp(logits[observed_token_id]) 15except OverflowError: 16 print("naive exp(logit) overflowed") 17 18log_probabilities = stable_log_softmax(logits) 19nll = -log_probabilities[observed_token_id] 20print(f"stable NLL={nll:.3f}, PPL={math.exp(nll):.3f}")
Output
1naive exp(logit) overflowed 2stable NLL=0.170, PPL=1.185

In a framework evaluator, cross-entropy normally applies this stable computation for you. You still need to know the principle when debugging inf losses, implementing metrics, or reviewing a custom evaluation loop.

The comparison contract

A perplexity score is never complete without its units and conditioning rules. At minimum, log:

Contract fieldWhy it changes the score
Dataset and splitA delivery-status corpus isn't a legal-policy corpus; train data isn't held-out data.
Tokenizer revisionTokens set the denominator and the events being predicted.
Context length and strideMore usable left context generally makes token prediction easier.
Special-token and masking policyScoring or skipping initial and padding tokens changes the aggregate.
Model objectiveA causal next-token model isn't directly comparable to a masked-language objective.
Compact guide for valid versus invalid raw perplexity comparisons. Compact guide for valid versus invalid raw perplexity comparisons.
A raw PPL improvement is credible only when dataset, tokenizer, context, stride, and objective are held fixed.

Represent that contract in code before you compare checkpoints:

enforce-a-perplexity-contract.py
1from dataclasses import dataclass 2 3@dataclass(frozen=True) 4class EvaluationContract: 5 dataset: str 6 tokenizer: str 7 context_tokens: int 8 stride_tokens: int 9 objective: str = "causal-next-token" 10 11def comparable(left: EvaluationContract, right: EvaluationContract) -> bool: 12 return left == right 13 14baseline = EvaluationContract("support-holdout-v3", "bpe-v7", 2048, 512) 15new_checkpoint = EvaluationContract("support-holdout-v3", "bpe-v7", 2048, 512) 16short_context_run = EvaluationContract("support-holdout-v3", "bpe-v7", 512, 512) 17 18print("baseline vs new checkpoint:", comparable(baseline, new_checkpoint)) 19print("baseline vs short context:", comparable(baseline, short_context_run))
Output
1baseline vs new checkpoint: True 2baseline vs short context: False

Now compare two model checkpoints on the same token outcomes:

compare-checkpoints-on-one-held-out-set.py
1import math 2 3def perplexity(observed_probabilities: list[float]) -> float: 4 average_nll = sum(-math.log(p) for p in observed_probabilities) / len( 5 observed_probabilities 6 ) 7 return math.exp(average_nll) 8 9held_out_probabilities = { 10 "checkpoint-0400": [0.31, 0.44, 0.18, 0.52, 0.24], 11 "checkpoint-0800": [0.42, 0.59, 0.29, 0.61, 0.35], 12} 13 14for name, probabilities in held_out_probabilities.items(): 15 print(f"{name}: PPL={perplexity(probabilities):.2f}")
Output
1checkpoint-0400: PPL=3.18 2checkpoint-0800: PPL=2.31

The result supports a narrow statement: checkpoint-0800 predicts tokens in this held-out set better under this protocol. It doesn't yet prove better support answers.

Aggregate loss once

Evaluation windows are rarely the same size. Averaging each window's already exponentiated PPL gives a short hard window too much influence. Sum NLL weighted by scored-token count, divide once, and exponentiate once.

aggregate-loss-before-exponentiating.py
1import math 2 3windows = [ 4 {"average_nll": 0.50, "scored_tokens": 2}, 5 {"average_nll": 2.00, "scored_tokens": 8}, 6] 7 8wrong = sum(math.exp(window["average_nll"]) for window in windows) / len(windows) 9total_nll = sum( 10 window["average_nll"] * window["scored_tokens"] for window in windows 11) 12total_tokens = sum(window["scored_tokens"] for window in windows) 13right = math.exp(total_nll / total_tokens) 14 15print(f"average of window PPLs: {wrong:.2f}") 16print(f"token-weighted corpus PPL: {right:.2f}")
Output
1average of window PPLs: 4.52 2token-weighted corpus PPL: 5.47

Different tokenizers need common units

Token-level perplexity depends on tokenization. The string Package at hub may be four subword tokens for one model and fourteen character tokens for another. A probability event per large subword isn't the same unit as a probability event per character. Hugging Face's perplexity documentation explicitly warns that tokenization affects PPL comparisons.[1]

For the same raw UTF-8 evaluation text, bits per byte (BPB) gives both models one shared denominator. PALOMA uses BPB as a practical compromise when tokenizers differ, while noting that it still scores the canonical token sequence chosen by each tokenizer rather than marginalizing over every valid segmentation.[2]

BPB=−∑iln⁡pθ(xi∣x<i)Bln⁡2\text{BPB} = \frac{-\sum_i \ln p_\theta(x_i \mid x_{<i})} {B \ln 2}BPB=Bln2−∑i​lnpθ​(xi​∣x<i​)​

where BBB is the byte count of the original text. A related metric, bits per character, is useful when a benchmark defines character units instead of bytes.

Tokenizer normalization visual comparing subword, character, and byte units for the same text. Tokenizer normalization visual comparing subword, character, and byte units for the same text.
Token PPL changes with segmentation. BPB compares surprise over the same raw bytes, making cross-tokenizer comparison fairer rather than perfect.
compare-tokenizers-with-bits-per-byte.py
1import math 2 3text = "Package at hub" 4byte_count = len(text.encode("utf-8")) 5evaluations = [ 6 {"name": "subword model", "tokens": 4, "total_nll": 8.4}, 7 {"name": "character model", "tokens": 14, "total_nll": 9.0}, 8] 9 10for run in evaluations: 11 ppl = math.exp(run["total_nll"] / run["tokens"]) 12 bpb = run["total_nll"] / (byte_count * math.log(2)) 13 print(f"{run['name']:15s} token PPL={ppl:.2f}, BPB={bpb:.3f}") 14 15print("Lower BPB identifies less surprise on identical bytes.")
Output
1subword model token PPL=8.17, BPB=0.866 2character model token PPL=1.90, BPB=0.927 3Lower BPB identifies less surprise on identical bytes.

The character model looks dramatically better under raw token PPL because it predicts smaller units. BPB gives the fairer comparison: the subword model assigned more total probability to the same bytes. It doesn't erase every tokenization effect, so keep logging the tokenizer and use a fixed vocabulary when possible.

Long documents need a scoring policy

A real evaluation file may contain thousands of tokens, while a model accepts only a fixed number of context tokens. Cutting text into disjoint blocks is fast, but tokens at each block boundary lose usable left context. A strided sliding window reuses context and scores only newly exposed target tokens.

Hugging Face demonstrates this protocol for GPT-2 Large on WikiText-2: a no-overlap stride = 1024 run reports PPL 19.44, while stride = 512 reports 16.44 for the same model and corpus. The score improved because more context was supplied, not because model weights changed.[1]

Strided sliding-window perplexity evaluation showing reused context in blue, newly scored tokens in green, and one final aggregate loss. Strided sliding-window perplexity evaluation showing reused context in blue, newly scored tokens in green, and one final aggregate loss.
Blue tokens condition predictions without being counted again; green target tokens contribute loss once before the final exponentiation.

First simulate which positions a sliding-window loop scores:

score-each-target-token-once.py
1tokens = list("ABCDEFGHIJ") 2windows = [ 3 {"context": (0, 5), "score": (1, 5)}, 4 {"context": (3, 8), "score": (5, 8)}, 5 {"context": (5, 10), "score": (8, 10)}, 6] 7 8scored_tokens: list[str] = [] 9for index, window in enumerate(windows, start=1): 10 begin, end = window["context"] 11 score_begin, score_end = window["score"] 12 context = "".join(tokens[begin:end]) 13 scored = "".join(tokens[score_begin:score_end]) 14 scored_tokens.extend(tokens[score_begin:score_end]) 15 print(f"window {index}: context={context}, newly scored={scored}") 16 17print("scored exactly once:", scored_tokens == tokens[1:])
Output
1window 1: context=ABCDE, newly scored=BCDE 2window 2: context=DEFGH, newly scored=FGH 3window 3: context=FGHIJ, newly scored=IJ 4scored exactly once: True

The first token is input context because a causal model needs a previous position before it can score a next-token target. In a framework implementation, context-only labels are commonly masked with -100 so cross-entropy ignores them.[1]

Here is a dependency-free evaluation loop over precomputed token NLL values. A real model supplies the losses; aggregation logic stays the same.

aggregate-a-strided-evaluation-run.py
1import math 2 3new_target_losses = [ 4 [0.30, 0.72, 0.51, 0.43], 5 [0.27, 0.61, 0.38], 6 [0.56, 0.48], 7] 8 9total_nll = sum(sum(window) for window in new_target_losses) 10scored_tokens = sum(len(window) for window in new_target_losses) 11perplexity = math.exp(total_nll / scored_tokens) 12 13print(f"scored tokens={scored_tokens}") 14print(f"average NLL={total_nll / scored_tokens:.3f}") 15print(f"PPL={perplexity:.2f}")
Output
1scored tokens=9 2average NLL=0.473 3PPL=1.61

For every reported PPL, store max_context_tokens, stride_tokens, the first-token policy, and the masking policy beside the score. Those details are measurement settings, not implementation trivia.

PPL answers one question, not every question

Suppose your support assistant predicts common delivery-status language fluently but states a wrong refund deadline. Perplexity can reward fluent next-token prediction without detecting that policy failure. Likewise, changing decoding strategy can change generated text quality even when the underlying model is unchanged, as Holtzman et al. demonstrated when studying repetitive neural generation.[3]

Use PPL for the question it answers:

DecisionUseful measurement
Did a base-model checkpoint get better at held-out next-token prediction?PPL under fixed protocol, or BPB across tokenizers
Did the assistant provide the correct order status and cite supplied evidence?Task-specific deterministic checks
Did an open-ended reply follow a rubric for clarity and groundedness?Calibrated judge or human review
Is a release safe for a high-impact workflow?Task regressions plus human-reviewed edge cases
separate-language-fit-from-product-correctness.py
1candidates = [ 2 {"name": "fluent-wrong", "ppl": 8.9, "policy_checks_passed": 1}, 3 {"name": "grounded-answer", "ppl": 9.8, "policy_checks_passed": 3}, 4] 5 6best_language_fit = min(candidates, key=lambda row: row["ppl"]) 7best_product_answer = max(candidates, key=lambda row: row["policy_checks_passed"]) 8 9print("best held-out language fit:", best_language_fit["name"]) 10print("best support policy result:", best_product_answer["name"])
Output
1best held-out language fit: fluent-wrong 2best support policy result: grounded-answer
Production evaluation stack from perplexity through task checks, calibrated rubric review, and human audit. Production evaluation stack from perplexity through task checks, calibrated rubric review, and human audit.
PPL catches cheap language-fit regressions; release decisions move upward to checks that represent actual user and policy failures.

When deterministic checks end

Many support cases can be checked without a model judge: expected status, required citation identifier, and forbidden policy claim are deterministic. Use those checks first.

build-a-deterministic-support-eval.py
1EXPECTED_STATUS = "delayed" 2REQUIRED_SOURCE = "tracking_event_483" 3 4def score_answer(answer: dict[str, str]) -> tuple[int, list[str]]: 5 failures: list[str] = [] 6 if answer["status"] != EXPECTED_STATUS: 7 failures.append("wrong status") 8 if answer["source"] != REQUIRED_SOURCE: 9 failures.append("missing evidence") 10 return 2 - len(failures), failures 11 12answers = [ 13 {"name": "A", "status": "delayed", "source": "tracking_event_483"}, 14 {"name": "B", "status": "delivered", "source": "tracking_event_483"}, 15] 16 17for answer in answers: 18 score, failures = score_answer(answer) 19 print(answer["name"], score, failures or ["pass"])
Output
1A 2 ['pass'] 2B 1 ['wrong status']

Open-ended tone, clarity, and partial correctness may require rubric review. LLM judges can scale that review, but the MT-Bench and Chatbot Arena study documents position, verbosity, and self-enhancement bias. Treat a judge as a calibrated measurement instrument, not truth.[4]

LLM-as-judge pipeline with prompt, target response, rubric, independent judge, score, rationale, and bias controls. LLM-as-judge pipeline with prompt, target response, rubric, independent judge, score, rationale, and bias controls.
A judge is appropriate only after its rubric and ordering policy are fixed and its scores are checked against human-labeled cases.

This lesson draws the boundary. Later evaluation lessons build judge calibration, benchmark selection, and online experimentation in depth.

Keep evaluation data clean

PPL needs held-out text. If training data includes your evaluation records, lower loss may reflect memorization rather than generalization. Product task suites have the same failure: if prompt examples or fine-tuning rows include hidden test tickets, release metrics lose meaning.

fail-on-evaluation-data-leakage.py
1training_record_ids = {"ticket-101", "ticket-102", "ticket-103"} 2validation_record_ids = {"ticket-201", "ticket-202", "ticket-103"} 3 4overlap = training_record_ids & validation_record_ids 5if overlap: 6 print("FAIL leaked record ids:", sorted(overlap)) 7else: 8 print("PASS validation set is disjoint")
Output
1FAIL leaked record ids: ['ticket-103']

For public LLM benchmarks, test content can also enter later training corpora. LiveBench addresses that risk with frequently updated questions from recent sources and objective ground-truth scoring; it limits contamination risk rather than making every future score immune to leakage.[5]

Evaluation hygiene flow separating training records from held-out and rotating challenge sets. Evaluation hygiene flow separating training records from held-out and rotating challenge sets.
Protect measurement before debating scores: reject train/eval overlap, keep a private holdout, and refresh challenge cases over time.

Build an evaluation report

An engineering metric becomes useful when it ships with enough context to reproduce a decision. A compact report should include metric value, protocol fields, leakage checks, and product task gates.

emit-a-release-evaluation-report.py
1from dataclasses import dataclass 2 3@dataclass(frozen=True) 4class Report: 5 checkpoint: str 6 perplexity: float 7 dataset: str 8 tokenizer: str 9 context_tokens: int 10 stride_tokens: int 11 leaked_records: int 12 policy_pass_rate: float 13 14def release_gate(report: Report) -> str: 15 if report.leaked_records: 16 return "BLOCK: contaminated evaluation set" 17 if report.policy_pass_rate < 1.0: 18 return "BLOCK: product regressions" 19 return "PASS: protocol recorded and product checks passed" 20 21report = Report( 22 checkpoint="support-lm-0800", 23 perplexity=9.81, 24 dataset="support-holdout-v3", 25 tokenizer="bpe-v7", 26 context_tokens=2048, 27 stride_tokens=512, 28 leaked_records=0, 29 policy_pass_rate=1.0, 30) 31 32print(f"{report.checkpoint}: PPL={report.perplexity} @ {report.context_tokens}/{report.stride_tokens}") 33print(release_gate(report))
Output
1support-lm-0800: PPL=9.81 @ 2048/512 2PASS: protocol recorded and product checks passed

The report refuses two common shortcuts: treating an untrusted held-out set as evidence, and treating language fit as a substitute for application correctness.

Key takeaways

  1. Perplexity is exp(average NLL): an interpretable view of held-out next-token surprise.
  2. Raw PPL comparison requires the same dataset, tokenizer, objective, context, stride, and masking policy.
  3. Bits per byte puts models with different tokenizers onto one raw-text denominator.
  4. Long-document evaluation must score new target tokens once while reusing context and aggregating loss before exponentiating.
  5. Low PPL doesn't establish factual, useful, or safe outputs; application checks and calibrated review answer those questions.
  6. Leakage invalidates confident evaluation claims, whether the set measures PPL or product behavior.

Mastery check

Key concepts

  • Held-out next-token likelihood
  • Cross-entropy to perplexity
  • Stable log-probability scoring
  • Evaluation protocol contracts
  • Bits-per-byte normalization
  • Strided context windows
  • Intrinsic versus product quality
  • Leakage-resistant evaluation sets

Evaluation rubric

  • Foundational: Computes token surprise, average NLL, and PPL from observed probabilities.
  • Intermediate: Explains effective choice count without treating it as a vocabulary-size claim.
  • Intermediate: Rejects invalid raw PPL comparisons by checking protocol fields.
  • Advanced: Uses BPB when tokenizers differ and aggregates strided loss correctly.
  • Advanced: Designs a report that separates language-fit metrics from product and leakage gates.

Follow-up questions

Common pitfalls

Comparing scores without a protocol

Symptom: A team declares victory from PPL 10 versus PPL 12 but can't name the tokenizer, data split, or stride.

Cause: The score was treated as a universal model rating instead of a metric with units and conditioning rules.

Fix: Store the evaluation contract with every result and compare raw PPL only when contracts match.

Averaging window perplexities

Symptom: Long-document PPL changes when window boundaries move, even though scored token losses are unchanged.

Cause: Per-window perplexities were averaged directly.

Fix: Sum token NLL across all windows, divide by scored-token count once, then exponentiate.

Selecting chat behavior using language fit alone

Symptom: A fluent model ships a wrong policy answer because it had the lowest PPL.

Cause: Intrinsic next-token evaluation was confused with application correctness.

Fix: Gate releases on deterministic task checks and calibrated review in addition to base-model fit metrics.

Testing on leaked records

Symptom: Evaluation looks unusually strong, then fails on genuinely new tickets.

Cause: Training or prompt examples overlap with hidden evaluation data.

Fix: Enforce disjoint identifiers, keep private held-out records, and rotate realistic challenge cases.

Next Step
Continue to File Ingestion for AI

You can now measure prediction fit and protect evaluation sets; next you will turn source documents into clean, traceable records that a system can evaluate and cite.

PreviousStatic to Contextual Embeddings
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Perplexity of fixed-length models

Hugging Face · 2026

PALOMA: A Benchmark for Evaluating Language Model Fit.

Magnusson, I., et al. · 2024 · NeurIPS 2024 Datasets and Benchmarks Track

The Curious Case of Neural Text Degeneration.

Holtzman, A., et al. · 2020 · ICLR 2020

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.

Zheng, L., et al. · 2023 · NeurIPS 2023

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

White, C., et al. · 2024