LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnCore LLM FoundationsLLM Benchmarks & Limitations
📊MediumEvaluation & Benchmarks

LLM Benchmarks & Limitations

Build an evaluation suite for a policy-answering LLM: score evidence use, understand public benchmark contracts, control judge bias, and make release decisions from private tests.

19 min read
Learning path
Step 51 of 155 in the full curriculum
Chunking StrategiesInstruction Tuning & Chat Templates

In the previous lesson, you turned a returns-policy document into chunks that preserve answerable evidence. One retrieved chunk contained this exact rule:

Retrieved evidence: Damaged electronics: report within 48 hours with photos.

Now a support assistant answers: "Send photos within 48 hours so we can review the damage." That looks good. A second model answers: "You can return any electronics within 30 days." That sounds helpful, but the retrieved clause doesn't support it.

This chapter answers the question that comes next: how do you measure whether a large language model (LLM) system is safe to improve or deploy? You'll build a small private evaluation set, run deterministic checks, learn what public benchmark scores do and don't prove, measure code generation with pass@k, control judge bias, and turn results into a release decision.

Evaluation path for a policy-answering assistant, from public signals and a declared harness through private grounded-answer cases, bias controls, and release gates. Evaluation path for a policy-answering assistant, from public signals and a declared harness through private grounded-answer cases, bias controls, and release gates.
An evaluation result becomes useful only after you know the task, harness, scorer, and release bar. Public scores can shortlist a model; private policy checks decide whether it handles customer traffic.

Start With Questions You Must Answer Correctly

A benchmark is a collection of tasks plus a scoring procedure. A production evaluation is the same idea applied to your own workload. Instead of asking only whether a model knows academic facts, you ask whether the entire system retrieves the right clause, answers from that clause, follows the output contract, and meets latency and cost limits.

Here is a tiny private evaluation set for ShopFlow. The policy statements are fictional product rules for this lab, not general retail advice.

CaseCustomer questionRetrieved policy evidenceAnswer must includeAnswer must not add
damaged-electronics"My earbuds arrived crushed. What do I send, and by when?"Damaged electronics: report within 48 hours with photos.48 hours, photos30 days
late-shipment"Tracking hasn't changed for a week. What happens now?"If a shipment has no scan for 7 days, open a carrier investigation.7 days, carrier investigationautomatic refund
sealed-return"Can I send unopened headphones back?"Unopened headphones may be returned within 30 days.unopened, 30 daysopened items accepted

These rows are more useful than a generic "answer quality" prompt because each one names the behavior that passes and the risky extra claim that fails.

Before scoring a model, validate the evaluation set itself. A duplicated ID can silently overwrite a result, and a required phrase absent from its cited evidence creates an impossible test.

validate-private-eval-set.py
1rows = [ 2 { 3 "id": "damaged-electronics", 4 "evidence": "Damaged electronics: report within 48 hours with photos.", 5 "required": ("48 hours", "photos"), 6 }, 7 { 8 "id": "late-shipment", 9 "evidence": "If a shipment has no scan for 7 days, open a carrier investigation.", 10 "required": ("7 days", "carrier investigation"), 11 }, 12 { 13 "id": "sealed-return", 14 "evidence": "Unopened headphones may be returned within 30 days.", 15 "required": ("unopened", "30 days"), 16 }, 17] 18 19ids = [row["id"] for row in rows] 20assert len(ids) == len(set(ids)), "evaluation IDs must be unique" 21 22for row in rows: 23 evidence = row["evidence"].lower() 24 missing = [phrase for phrase in row["required"] if phrase not in evidence] 25 assert not missing, f"{row['id']} has unsupported gold facts: {missing}" 26 print(f"{row['id']:20} valid required={len(row['required'])}") 27 28print(f"validated_cases={len(rows)}")
Output
1damaged-electronics valid required=2 2late-shipment valid required=2 3sealed-return valid required=2 4validated_cases=3

A Score Is Only Meaningful With Its Contract

When a model card says "86%," the number is incomplete without the task and harness that produced it. Five parts make an evaluation result reproducible:

Contract partQuestion to recordShopFlow policy example
DatasetWhich cases were scored, and at which version?policy-golden-v3, 120 reviewed support questions
Input pathWhat context reached the model?top-3 chunks from the index, with policy version IDs
Output contractWhat must the response look like?concise answer plus cited source_page
ScorerHow is success computed?required facts, forbidden claims, schema validity, human audit
Release barWhat result permits deployment?no critical policy misses, plus latency and cost bounds

The preceding lesson addressed the second column of this pipeline: getting complete evidence into the context. This lesson measures what happens after retrieval. Work from left to right:

Diagram showing 1. Private cases question + expected facts, 2. Retrieve evidence policy chunks, 3. Generate answer with source ID, and 4. Score behavior facts + risk + format. Diagram showing 1. Private cases question + expected facts, 2. Retrieve evidence policy chunks, 3. Generate answer with source ID, and 4. Score behavior facts + risk + format.
1. Private cases question + expected facts, 2. Retrieve evidence policy chunks, 3. Generate answer with source ID, and 4. Score behavior facts + risk + format.

If a response fails at step 4, don't immediately blame the model. Inspect the retrieved chunk first. A missing fact may be a retrieval failure, while a contradicted fact may be a generation or instruction-following failure.

First Gate: Preserve Policy Evidence

Begin with deterministic checks whenever the answer has explicit requirements. This isn't a complete semantic evaluator. It is a fast test that catches answers missing required policy terms or adding known unsafe claims.

The following lab uses three fixture answers. It also asserts that the evaluation rows themselves are valid: every required phrase must be present in the cited evidence.

score-grounded-policy-answers.py
1from dataclasses import dataclass 2 3@dataclass(frozen=True) 4class EvalCase: 5 case_id: str 6 evidence: str 7 required_phrases: tuple[str, ...] 8 forbidden_phrases: tuple[str, ...] 9 10cases = [ 11 EvalCase( 12 "damaged-electronics", 13 "Damaged electronics: report within 48 hours with photos.", 14 ("48 hours", "photos"), 15 ("30 days",), 16 ), 17 EvalCase( 18 "late-shipment", 19 "If a shipment has no scan for 7 days, open a carrier investigation.", 20 ("7 days", "carrier investigation"), 21 ("automatic refund",), 22 ), 23 EvalCase( 24 "sealed-return", 25 "Unopened headphones may be returned within 30 days.", 26 ("unopened", "30 days"), 27 ("opened items accepted",), 28 ), 29] 30 31answers = { 32 "damaged-electronics": "Please report the damage within 48 hours and attach photos.", 33 "late-shipment": "You qualify for an automatic refund now.", 34 "sealed-return": "Unopened headphones may be returned within 30 days.", 35} 36 37def evaluate(case: EvalCase, answer: str) -> tuple[list[str], list[str]]: 38 evidence = case.evidence.lower() 39 text = answer.lower() 40 assert all(phrase in evidence for phrase in case.required_phrases) 41 missing = [phrase for phrase in case.required_phrases if phrase not in text] 42 unsupported = [phrase for phrase in case.forbidden_phrases if phrase in text] 43 return missing, unsupported 44 45passed = 0 46for case in cases: 47 missing, unsupported = evaluate(case, answers[case.case_id]) 48 status = "PASS" if not missing and not unsupported else "FAIL" 49 passed += status == "PASS" 50 print(f"{case.case_id:20} {status} missing={missing} unsupported={unsupported}") 51 52print(f"summary: grounded_policy_pass={passed}/{len(cases)}")
Output
1damaged-electronics PASS missing=[] unsupported=[] 2late-shipment FAIL missing=['7 days', 'carrier investigation'] unsupported=['automatic refund'] 3sealed-return PASS missing=[] unsupported=[] 4summary: grounded_policy_pass=2/3

The failed late-shipment answer is fluent, but it never mentions the required investigation and invents an automatic refund. A helpful-sounding response isn't enough when a customer may act on an unsupported policy.

This split between retrieved context and generated answer also appears in evaluation research. Ragas evaluates whether retrieved context supports the answer and whether the answer addresses the question, rather than collapsing all failures into one vague score.[1]

Diagnose Retrieval and Generation Separately

Suppose an answer omits the 48 hours deadline. That doesn't tell you which component failed. Compare the gold requirement with both the retrieved context and the generated answer:

attribute-rag-failures.py
1cases = [ 2 { 3 "id": "good-answer", 4 "required": ("48 hours", "photos"), 5 "retrieved": "Damaged electronics: report within 48 hours with photos.", 6 "answer": "Report within 48 hours and send photos.", 7 }, 8 { 9 "id": "retrieval-miss", 10 "required": ("48 hours", "photos"), 11 "retrieved": "Electronics category page. Contact support for assistance.", 12 "answer": "Contact support for assistance.", 13 }, 14 { 15 "id": "generation-miss", 16 "required": ("48 hours", "photos"), 17 "retrieved": "Damaged electronics: report within 48 hours with photos.", 18 "answer": "Please send photos so we can investigate.", 19 }, 20] 21 22def failure_stage(case: dict[str, object]) -> str: 23 required = case["required"] 24 retrieved = str(case["retrieved"]).lower() 25 answer = str(case["answer"]).lower() 26 if any(term not in retrieved for term in required): 27 return "bad_retrieval" 28 if any(term not in answer for term in required): 29 return "bad_generation" 30 return "pass" 31 32for case in cases: 33 print(f"{case['id']:16} {failure_stage(case)}")
Output
1good-answer pass 2retrieval-miss bad_retrieval 3generation-miss bad_generation

Public Benchmarks Answer Narrower Questions

Private tests tell you whether your application meets its contract. Public benchmarks help compare capabilities under standardized tasks, but each family asks a different question.

MMLU evaluates multiple-choice accuracy across 57 academic and professional subjects.[2] It can reveal broad knowledge differences under a stated prompt protocol. It can't prove that a policy assistant quotes the right return clause.

GPQA narrows the question to difficult expert QA. Its 448 multiple-choice questions were written by domain experts in biology, physics, and chemistry.[3] A strong GPQA result can signal expert STEM question-answering ability under that harness, but it still doesn't prove that a support workflow follows policy.

Math benchmarks narrow the question. GSM8K uses grade-school word problems, while MATH uses harder competition-style problems; both are useful when a workflow must produce checkable mathematical answers, but they don't measure open-ended support quality.[4][5] At the harder frontier, Humanity's Last Exam (HLE) collects expert-level questions across domains, and FrontierMath targets advanced mathematical problems. ARC-AGI-2 asks systems to infer novel transformations from small visual-grid demonstrations rather than recall subject matter.[6][7][8] These benchmarks answer different research questions; a model isn't "better at reasoning" in every product merely because it improves on one of them.

HumanEval provides handwritten Python functions with tests and introduced functional correctness reporting with repeated samples and pass@k.[9] SWE-bench moves from isolated functions to real GitHub issues whose patches are tested in repository environments.[10] These are relevant if your system edits code, but neither evaluates customer-facing policy truthfulness.

MT-Bench uses a model judge for open-ended, multi-turn responses, while Chatbot Arena collects blind pairwise human preferences.[11][12] Preference is useful for tone and helpfulness, but an answer that users prefer can still cite the wrong policy.

Evaluation familyScored unitTypical metricUseful signalMissing deployment proof
Broad knowledgeMMLU questionmultiple-choice accuracyacademic/professional task coverageevidence grounding and tool behavior
Expert STEM QAGPQA questionmultiple-choice accuracydifficult biology, physics, and chemistry QAproduction workflow behavior
Checkable mathematicsGSM8K or MATH problemparsed final-answer accuracymathematical answer execution under a fixed harnessgrounded customer-policy answers
Frontier reasoning researchHLE, FrontierMath, or ARC-AGI-2 itembenchmark-specific correctnessexpert-question or novel-task performance in that suitegeneral deployment reliability
Executable codeHumanEval function or SWE-bench patchpass@k or resolved rateprograms that satisfy testssupport answer policy compliance
Open-ended preferenceMT-Bench or Arena comparisonjudge score or pairwise preferencestyle and perceived helpfulnessfactual ground truth
Private application evalquestion, retrieved clause, response, policy pass rate plus gatesbehavior on your release pathcapability outside your covered slices

The practical rule is simple: compare results only when the dataset, input path, output contract, prompt, sampling settings, tools, and scorer match. 84% on a multiple-choice set and 42% on repository fixes aren't competing measurements of the same property.

That rule is easy to encode. This helper allows comparisons only when the benchmark and the harness fields are identical.

check-harness-comparability.py
1runs = { 2 "candidate-a": { 3 "dataset": "policy-golden-v3", 4 "retriever": "chunks-v2/top3", 5 "prompt": "support-answer-v4", 6 "output_contract": "answer-with-source-page-v2", 7 "toolset": ("policy-search-v2",), 8 "temperature": 0.0, 9 "max_tokens": 180, 10 "scorer": "facts-v2", 11 "score": 0.97, 12 }, 13 "candidate-b": { 14 "dataset": "policy-golden-v3", 15 "retriever": "chunks-v2/top3", 16 "prompt": "support-answer-v4", 17 "output_contract": "answer-with-source-page-v2", 18 "toolset": ("policy-search-v2",), 19 "temperature": 0.0, 20 "max_tokens": 180, 21 "scorer": "facts-v2", 22 "score": 0.99, 23 }, 24 "candidate-c": { 25 "dataset": "policy-golden-v3", 26 "retriever": "chunks-v2/top5", 27 "prompt": "support-answer-v4", 28 "output_contract": "answer-with-source-page-v2", 29 "toolset": ("policy-search-v2",), 30 "temperature": 0.0, 31 "max_tokens": 180, 32 "scorer": "facts-v2", 33 "score": 1.00, 34 }, 35} 36 37contract_fields = ( 38 "dataset", 39 "retriever", 40 "prompt", 41 "output_contract", 42 "toolset", 43 "temperature", 44 "max_tokens", 45 "scorer", 46) 47 48def comparable(left: dict[str, object], right: dict[str, object]) -> bool: 49 return all(left[field] == right[field] for field in contract_fields) 50 51print("A vs B comparable:", comparable(runs["candidate-a"], runs["candidate-b"])) 52print("A vs C comparable:", comparable(runs["candidate-a"], runs["candidate-c"])) 53print("C differs because its retriever sees more chunks.")
Output
1A vs B comparable: True 2A vs C comparable: False 3C differs because its retriever sees more chunks.

pass@k: Measure a Workflow That Can Test Several Drafts

For code generation, a single answer isn't always the real workflow. A developer may generate several candidates and run tests until one passes. HumanEval's pass@k estimator measures the probability that at least one of kkk selected samples is correct, given nnn generated programs and ccc correct programs.[9]

Suppose an inventory-code task generated 100 candidates and tests accepted 20. Checking one random candidate succeeds with probability 0.20. Checking 10 distinct candidates succeeds much more often:

pass⁡@k=1−(n−ck)(nk)\operatorname{pass}@k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}pass@k=1−(kn​)(kn−c​)​

Here, nnn is the number of generated programs, ccc is the number that pass tests, and kkk is the number you are allowed to try.

estimate-pass-at-k.py
1from math import comb 2 3def pass_at_k(total_samples: int, correct_samples: int, k: int) -> float: 4 if not 0 <= correct_samples <= total_samples: 5 raise ValueError("correct_samples must be inside the sampled set") 6 if not 1 <= k <= total_samples: 7 raise ValueError("k must be between 1 and total_samples") 8 if total_samples - correct_samples < k: 9 return 1.0 10 return 1.0 - comb(total_samples - correct_samples, k) / comb(total_samples, k) 11 12for k in (1, 5, 10, 20): 13 print(f"pass@{k:<2} = {pass_at_k(100, 20, k):.4f}")
Output
1pass@1 = 0.2000 2pass@5 = 0.6807 3pass@10 = 0.9049 4pass@20 = 0.9934
Pass at k curve for 100 generated code samples with 20 correct candidates, showing win probability rising sharply as more attempts are tested. Pass at k curve for 100 generated code samples with 20 correct candidates, showing win probability rising sharply as more attempts are tested.
`pass@k` measures the value of generating and testing several code drafts. It doesn't mean a single unverified response became more reliable.

This distinction matters. Multiple tested drafts make sense for a generated inventory function because unit tests select a passing implementation. You shouldn't sample several customer-facing refund answers and silently choose the most confident-looking one without a policy-grounded scorer.

Open-Ended Answers Need Controlled Judgment

Deterministic checks catch explicit clauses, but they don't settle every question. Two answers can both cite the correct deadline while differing in clarity or empathy. For open-ended comparisons, teams often use humans or an LLM judge.

The MT-Bench study found that capable LLM judges can approximate human preferences in its setting, but also documents position, verbosity, and self-enhancement biases.[11] A judge result is measurement output, not ground truth.

The small simulation below intentionally gives a toy judge a position bias: if two answers contain the same required facts, it chooses whichever answer appears first. Running each comparison in both orders exposes that instability.

detect-position-biased-judge.py
1def biased_judge(answer_a: str, answer_b: str, required: tuple[str, ...]) -> str: 2 score_a = sum(term in answer_a.lower() for term in required) 3 score_b = sum(term in answer_b.lower() for term in required) 4 if score_a > score_b: 5 return "A" 6 if score_b > score_a: 7 return "B" 8 return "A" # Deliberate position bias for equal-quality answers. 9 10def swap_checked(answer_a: str, answer_b: str, required: tuple[str, ...]) -> dict[str, object]: 11 first_order = biased_judge(answer_a, answer_b, required) 12 swapped_order = biased_judge(answer_b, answer_a, required) 13 normalized_swapped = {"A": "B", "B": "A"}[swapped_order] 14 if first_order != normalized_swapped: 15 return {"winner": "needs_human_review", "stable": False} 16 return {"winner": first_order, "stable": True} 17 18required = ("48 hours", "photos") 19concise = "Report the damaged earbuds within 48 hours and attach photos." 20friendly = "Sorry they arrived damaged. Please send photos within 48 hours." 21unsupported = "Return the earbuds whenever convenient." 22 23print("equivalent:", swap_checked(concise, friendly, required)) 24print("factual miss:", swap_checked(concise, unsupported, required))
Output
1equivalent: {'winner': 'needs_human_review', 'stable': False} 2factual miss: {'winner': 'A', 'stable': True}

An unstable comparison isn't a failure of either candidate. It means the measurement can't distinguish them reliably under its current rubric. Send such cases to a human reviewer or improve the rubric before aggregating a winner.

Pairwise LLM judge control loop with answer-order swap, length control, reference anchor, stable winner check, and human review for unstable cases. Pairwise LLM judge control loop with answer-order swap, length control, reference anchor, stable winner check, and human review for unstable cases.
Pairwise grading needs controls. Swap answer order, anchor the rubric to policy facts, and treat unstable wins as a request for review rather than a model victory.

Public Test Sets Can Lose Their Meaning

A public benchmark is reproducible because everyone can access the tasks. That same visibility creates a risk: benchmark questions or solutions may appear in training corpora. A model that encountered a test item during training is no longer being measured on a genuinely unseen example.

Do not jump from risk to accusation. A public score alone doesn't prove contamination in a particular model. It means the report should state what decontamination checks were used, and high-stakes selection should include fresh or private tasks.

LiveCodeBench was designed around this problem: it continually collects coding tasks released over time and evaluates models on problems that appear after their stated training cutoff.[13] Time splits reduce exposure risk, although they still depend on accurate cutoff information and a stable harness.

Benchmark contamination cycle where public tasks spread onto web mirrors, leak into training corpora, inflate leaderboard scores, and force a shift toward time-split or private evals. Benchmark contamination cycle where public tasks spread onto web mirrors, leak into training corpora, inflate leaderboard scores, and force a shift toward time-split or private evals.
A public test remains valuable only while it measures unseen behavior. Fresh time-split tasks and private release cases preserve signal when popular static sets may have circulated widely.

For the ShopFlow policy assistant, keep private policy cases separate from prompt examples, demo transcripts, and support documentation that may later be used for tuning. If an evaluation case becomes a training example, retire it from the holdout set or record that it now measures regression behavior rather than generalization.

A time-split suite also helps when new policy versions arrive. A model tuned using cases created through March shouldn't be evaluated as "unseen" on those same cases. Hold out cases authored after the tuning snapshot:

make-time-split-holdout.py
1from datetime import date 2 3tuning_snapshot = date.fromisoformat("2026-03-31") 4policy_cases = [ 5 ("holiday-return-window", "2026-02-10"), 6 ("damaged-electronics-photo-rule", "2026-04-08"), 7 ("split-shipment-investigation", "2026-04-21"), 8 ("seller-battery-restriction", "2026-05-03"), 9] 10 11training_visible = [] 12fresh_holdout = [] 13for case_id, created_at in policy_cases: 14 destination = fresh_holdout if date.fromisoformat(created_at) > tuning_snapshot else training_visible 15 destination.append(case_id) 16 17print("known before tuning:", training_visible) 18print("fresh holdout:", fresh_holdout) 19assert "damaged-electronics-photo-rule" in fresh_holdout
Output
1known before tuning: ['holiday-return-window'] 2fresh holdout: ['damaged-electronics-photo-rule', 'split-shipment-investigation', 'seller-battery-restriction']

Turn Results Into a Release Gate

Your evaluation is useful when it changes an engineering decision. A model may give grounded answers but be too slow for live chat. Another may be fast and cheap, but fail critical policy cases. Define the bar before comparing models.

Production model selection flow where public benchmarks shortlist candidates, private support tasks test policy and formatting, and latency plus cost gates decide ship, route, or reject. Production model selection flow where public benchmarks shortlist candidates, private support tasks test policy and formatting, and latency plus cost gates decide ship, route, or reject.
Model selection ends with your own release gates. A candidate ships only when policy quality and operational limits pass together; otherwise route selectively or reject it.

The fixture values below represent measurements from a fictional private evaluation run. The code makes the release rule explicit: a candidate that misses policy or schema quality is rejected, while a correct but expensive model may be routed only to difficult cases.

apply-release-gates.py
1candidates = [ 2 { 3 "name": "fast-small", 4 "policy_pass_rate": 0.91, 5 "schema_pass_rate": 1.00, 6 "p95_latency_ms": 420, 7 "cost_per_case": 0.002, 8 }, 9 { 10 "name": "balanced", 11 "policy_pass_rate": 0.99, 12 "schema_pass_rate": 1.00, 13 "p95_latency_ms": 680, 14 "cost_per_case": 0.008, 15 }, 16 { 17 "name": "slow-specialist", 18 "policy_pass_rate": 1.00, 19 "schema_pass_rate": 1.00, 20 "p95_latency_ms": 1800, 21 "cost_per_case": 0.031, 22 }, 23] 24 25minimum_policy = 0.98 26minimum_schema = 0.995 27online_latency_limit_ms = 900 28online_cost_limit = 0.015 29 30def decision(candidate: dict[str, float | str]) -> str: 31 if candidate["policy_pass_rate"] < minimum_policy: 32 return "reject: policy quality below bar" 33 if candidate["schema_pass_rate"] < minimum_schema: 34 return "reject: output contract below bar" 35 if candidate["p95_latency_ms"] > online_latency_limit_ms: 36 return "route: reserve for escalations because latency is high" 37 if candidate["cost_per_case"] > online_cost_limit: 38 return "route: reserve for escalations because cost is high" 39 return "ship: passes online gates" 40 41for candidate in candidates: 42 print(f"{candidate['name']:15} {decision(candidate)}")
Output
1fast-small reject: policy quality below bar 2balanced ship: passes online gates 3slow-specialist route: reserve for escalations because latency is high

The public benchmark shortlist never appears inside decision. That is intentional. Public scores help decide which candidates deserve testing. A release gate uses measurements taken on the workload you are about to serve.

Inspect Failure Slices, Not Only the Average

An aggregate score can hide the one category with the largest customer impact. The run below clears four of six cases overall, but damaged-electronics answers fail most often.

report-failure-slices.py
1from collections import defaultdict 2 3results = [ 4 ("damaged-electronics", True), 5 ("damaged-electronics", False), 6 ("damaged-electronics", False), 7 ("late-shipment", True), 8 ("late-shipment", True), 9 ("sealed-return", True), 10] 11 12by_slice: dict[str, list[bool]] = defaultdict(list) 13for category, passed in results: 14 by_slice[category].append(passed) 15 16overall = sum(passed for _, passed in results) / len(results) 17print(f"overall={overall:.2%}") 18for category in sorted(by_slice): 19 values = by_slice[category] 20 rate = sum(values) / len(values) 21 print(f"{category:20} pass_rate={rate:.2%} cases={len(values)}")
Output
1overall=66.67% 2damaged-electronics pass_rate=33.33% cases=3 3late-shipment pass_rate=100.00% cases=2 4sealed-return pass_rate=100.00% cases=1

That report should block a release if damaged-item policy mistakes are critical, even when the average seems acceptable.

Small Samples Need an Uncertainty Margin

Three passing examples make a good tutorial, not a production release argument. Earlier statistics lessons introduced uncertainty in estimated rates. For a binary pass metric, a Wilson lower bound gives a conservative view of the pass rate supported by a finite sample:

require-confidence-before-release.py
1from math import sqrt 2 3def wilson_lower_bound(successes: int, total: int, z: float = 1.96) -> float: 4 proportion = successes / total 5 denominator = 1 + z**2 / total 6 center = proportion + z**2 / (2 * total) 7 margin = z * sqrt((proportion * (1 - proportion) + z**2 / (4 * total)) / total) 8 return (center - margin) / denominator 9 10release_floor = 0.95 11for successes, total in ((3, 3), (49, 50), (990, 1000)): 12 lower = wilson_lower_bound(successes, total) 13 status = "PASS" if lower >= release_floor else "COLLECT_MORE_OR_FIX" 14 print(f"{successes:>3}/{total:<4} lower_bound={lower:.3f} {status}")
Output
13/3 lower_bound=0.438 COLLECT_MORE_OR_FIX 2 49/50 lower_bound=0.895 COLLECT_MORE_OR_FIX 3990/1000 lower_bound=0.982 PASS

Even perfect performance on three examples has a weak lower bound. Build a sufficiently large, risk-stratified holdout before claiming a system clears a production-quality bar.

Write Down Enough to Reproduce the Claim

Evaluation reports should let another engineer rerun the comparison and inspect its failures. Record these fields before sharing any model ranking:

Report fieldWhy it mattersExample entry
Dataset version and holdout policyprevents accidental training leakagepolicy-golden-v3, never used for prompt tuning
Retrieval setupseparates missing evidence from bad answerschunker version, top-k, index snapshot
Prompt and chat formatinstruction formatting changes outputssystem prompt hash, template version
Model and sampling settingsoutput variance depends on decodingmodel ID, temperature, max tokens
Scorer and judge controlsmetrics can hide biasdeterministic gates, order swap, audit sample
Failure slicesaverages conceal unsafe casesdamaged electronics, late scans, multilingual queries
Latency and costa correct system still needs to runp50/p95 latency, cost per successful case

That prompt-and-chat-format row points to the next lesson. If a model regularly omits required structure or speaks in the wrong role, evaluation has exposed a behavior gap. You then need to understand how instruction tuning and chat templates teach the interaction contract.

Failure Patterns to Diagnose

SymptomLikely causeFirst fix to test
Answer omits the deadline although policy chunk contains itgeneration or prompt contract failurerequire cited facts and inspect prompt/template
Answer sounds confident but cites a clause absent from contextunsupported generationadd forbidden-claim checks and human review slice
Two published model scores reverse under a new harnessincomparable prompt or scoring setuprerun identical harness and publish configuration
Judge selects whichever answer appears firstposition biasscore both orders and reject unstable comparisons
Public score rises while private cases stagnatetask mismatch or contaminated public signalprioritize fresh private holdout and failure analysis
Correct model misses live latency targetoperational bottleneckroute difficult cases or select a faster candidate

Mastery check

Key concepts

  • Evaluation contracts and private golden sets
  • Grounded answer checks for RAG
  • Benchmark families and harness comparability
  • HumanEval and pass@k
  • Judge bias controls
  • Contamination and time-split holdouts
  • Quality, latency, and cost release gates

Evaluation rubric

  • Foundational: Explains why public benchmark scores can't replace a private policy-answer evaluation contract.
  • Intermediate: Builds a grounded-answer check that attributes retrieval versus generation failures.
  • Intermediate: Explains when pass@k is appropriate and why it doesn't equal policy-answer accuracy.
  • Advanced: Designs judge-order, failure-slice, uncertainty, latency, and cost controls for a release decision.

Common pitfalls

  • Treating a public benchmark gain as evidence that policy answers are grounded.
  • Comparing model scores produced by different prompts, tools, or scoring rules.
  • Accepting an LLM judge result without checking position bias or factual anchors.
  • Shipping on a mean quality score while ignoring critical slices, latency, or cost.

Follow-up questions

Practice extension

Add a fourth policy case whose retrieved context is intentionally wrong. Extend score-grounded-policy-answers.py to label failures as bad_retrieval or bad_generation. Then add a source_page field to the candidate response and reject answers whose citation doesn't match the evidence row. You have now built the first slice of an evaluation dashboard for a RAG assistant.

Next Step
Continue to Instruction Tuning & Chat Templates

You can now measure whether an answer is grounded, reproducible, and safe to release. Next you will learn how training examples and chat formatting shape the response behavior those evaluations are designed to test.

PreviousChunking Strategies
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

RAGAS: Automated Evaluation of Retrieval Augmented Generation.

Es, S., et al. · 2023 · arXiv preprint

Measuring Massive Multitask Language Understanding (MMLU).

Hendrycks, D., et al. · 2021 · ICLR 2021

GPQA: A Graduate-Level Google-Proof Q&A Benchmark.

Rein, D., et al. · 2023

Training Verifiers to Solve Math Word Problems (GSM8K).

Cobbe, K., et al. · 2021

Measuring Mathematical Problem Solving with the MATH Dataset.

Hendrycks, D., et al. · 2021 · NeurIPS 2021

Humanity's Last Exam

Center for AI Safety, Scale AI, and HLE Contributors Consortium · 2025 · arXiv preprint

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

Glazer, E., et al. (Epoch AI) · 2024

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

Chollet, F., et al. · 2025

Evaluating Large Language Models Trained on Code (HumanEval).

Chen, M., et al. · 2021 · arXiv preprint

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?.

Jimenez, C. E., et al. · 2024 · ICLR 2024

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.

Zheng, L., et al. · 2023 · NeurIPS 2023

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference.

Chiang, W. L., et al. · 2024

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Jain, N., et al. · 2024