LearnCore LLM FoundationsLLM Benchmarks & Limitations

📊MediumEvaluation & Benchmarks

LLM Benchmarks & Limitations

Build an evaluation suite for a policy-answering LLM: score evidence use, understand public benchmark contracts, control judge bias, and make release decisions from private tests.

19 min read

Learning path

Step 51 of 155 in the full curriculum

Chunking Strategies Instruction Tuning & Chat Templates

In the previous lesson, you turned a returns-policy document into chunks that preserve answerable evidence. One retrieved chunk contained this exact rule:

Retrieved evidence: Damaged electronics: report within 48 hours with photos.

Now a support assistant answers: "Send photos within 48 hours so we can review the damage." That looks good. A second model answers: "You can return any electronics within 30 days." That sounds helpful, but the retrieved clause doesn't support it.

This chapter answers the question that comes next: how do you measure whether a large language model (LLM) system is safe to improve or deploy? You'll build a small private evaluation set, run deterministic checks, learn what public benchmark scores do and don't prove, measure code generation with pass@k, control judge bias, and turn results into a release decision.

Evaluation path for a policy-answering assistant, from public signals and a declared harness through private grounded-answer cases, bias controls, and release gates. — An evaluation result becomes useful only after you know the task, harness, scorer, and release bar. Public scores can shortlist a model; private policy checks decide whether it handles customer traffic.

Start With Questions You Must Answer Correctly

A benchmark is a collection of tasks plus a scoring procedure. A production evaluation is the same idea applied to your own workload. Instead of asking only whether a model knows academic facts, you ask whether the entire system retrieves the right clause, answers from that clause, follows the output contract, and meets latency and cost limits.

Here is a tiny private evaluation set for ShopFlow. The policy statements are fictional product rules for this lab, not general retail advice.

Case	Customer question	Retrieved policy evidence	Answer must include	Answer must not add
`damaged-electronics`	"My earbuds arrived crushed. What do I send, and by when?"	Damaged electronics: report within 48 hours with photos.	`48 hours`, `photos`	`30 days`
`late-shipment`	"Tracking hasn't changed for a week. What happens now?"	If a shipment has no scan for 7 days, open a carrier investigation.	`7 days`, `carrier investigation`	`automatic refund`
`sealed-return`	"Can I send unopened headphones back?"	Unopened headphones may be returned within 30 days.	`unopened`, `30 days`	`opened items accepted`

These rows are more useful than a generic "answer quality" prompt because each one names the behavior that passes and the risky extra claim that fails.

Before scoring a model, validate the evaluation set itself. A duplicated ID can silently overwrite a result, and a required phrase absent from its cited evidence creates an impossible test.

validate-private-eval-set.py

rows = [
    {
        "id": "damaged-electronics",
        "evidence": "Damaged electronics: report within 48 hours with photos.",
        "required": ("48 hours", "photos"),
    },
    {
        "id": "late-shipment",
        "evidence": "If a shipment has no scan for 7 days, open a carrier investigation.",
        "required": ("7 days", "carrier investigation"),
    },
    {
        "id": "sealed-return",
        "evidence": "Unopened headphones may be returned within 30 days.",
        "required": ("unopened", "30 days"),
    },
]

ids = [row["id"] for row in rows]
assert len(ids) == len(set(ids)), "evaluation IDs must be unique"

for row in rows:
    evidence = row["evidence"].lower()
    missing = [phrase for phrase in row["required"] if phrase not in evidence]
    assert not missing, f"{row['id']} has unsupported gold facts: {missing}"
    print(f"{row['id']:20} valid required={len(row['required'])}")

print(f"validated_cases={len(rows)}")

Output

damaged-electronics  valid required=2
late-shipment        valid required=2
sealed-return        valid required=2
validated_cases=3

A Score Is Only Meaningful With Its Contract

When a model card says "86%," the number is incomplete without the task and harness that produced it. Five parts make an evaluation result reproducible:

Contract part	Question to record	ShopFlow policy example
Dataset	Which cases were scored, and at which version?	`policy-golden-v3`, 120 reviewed support questions
Input path	What context reached the model?	top-3 chunks from the index, with policy version IDs
Output contract	What must the response look like?	concise answer plus cited `source_page`
Scorer	How is success computed?	required facts, forbidden claims, schema validity, human audit
Release bar	What result permits deployment?	no critical policy misses, plus latency and cost bounds

The preceding lesson addressed the second column of this pipeline: getting complete evidence into the context. This lesson measures what happens after retrieval. Work from left to right:

Diagram showing 1. Private cases question + expected facts, 2. Retrieve evidence policy chunks, 3. Generate answer with source ID, and 4. Score behavior facts + risk + format. — 1. Private cases question + expected facts, 2. Retrieve evidence policy chunks, 3. Generate answer with source ID, and 4. Score behavior facts + risk + format.

If a response fails at step 4, don't immediately blame the model. Inspect the retrieved chunk first. A missing fact may be a retrieval failure, while a contradicted fact may be a generation or instruction-following failure.

First Gate: Preserve Policy Evidence

Begin with deterministic checks whenever the answer has explicit requirements. This isn't a complete semantic evaluator. It is a fast test that catches answers missing required policy terms or adding known unsafe claims.

The following lab uses three fixture answers. It also asserts that the evaluation rows themselves are valid: every required phrase must be present in the cited evidence.

score-grounded-policy-answers.py

from dataclasses import dataclass

@dataclass(frozen=True)
class EvalCase:
    case_id: str
    evidence: str
    required_phrases: tuple[str, ...]
    forbidden_phrases: tuple[str, ...]

cases = [
    EvalCase(
        "damaged-electronics",
        "Damaged electronics: report within 48 hours with photos.",
        ("48 hours", "photos"),
        ("30 days",),
    ),
    EvalCase(
        "late-shipment",
        "If a shipment has no scan for 7 days, open a carrier investigation.",
        ("7 days", "carrier investigation"),
        ("automatic refund",),
    ),
    EvalCase(
        "sealed-return",
        "Unopened headphones may be returned within 30 days.",
        ("unopened", "30 days"),
        ("opened items accepted",),
    ),
]

answers = {
    "damaged-electronics": "Please report the damage within 48 hours and attach photos.",
    "late-shipment": "You qualify for an automatic refund now.",
    "sealed-return": "Unopened headphones may be returned within 30 days.",
}

def evaluate(case: EvalCase, answer: str) -> tuple[list[str], list[str]]:
    evidence = case.evidence.lower()
    text = answer.lower()
    assert all(phrase in evidence for phrase in case.required_phrases)
    missing = [phrase for phrase in case.required_phrases if phrase not in text]
    unsupported = [phrase for phrase in case.forbidden_phrases if phrase in text]
    return missing, unsupported

passed = 0
for case in cases:
    missing, unsupported = evaluate(case, answers[case.case_id])
    status = "PASS" if not missing and not unsupported else "FAIL"
    passed += status == "PASS"
    print(f"{case.case_id:20} {status} missing={missing} unsupported={unsupported}")

print(f"summary: grounded_policy_pass={passed}/{len(cases)}")

Output

damaged-electronics  PASS missing=[] unsupported=[]
late-shipment        FAIL missing=['7 days', 'carrier investigation'] unsupported=['automatic refund']
sealed-return        PASS missing=[] unsupported=[]
summary: grounded_policy_pass=2/3

The failed late-shipment answer is fluent, but it never mentions the required investigation and invents an automatic refund. A helpful-sounding response isn't enough when a customer may act on an unsupported policy.

This split between retrieved context and generated answer also appears in evaluation research. Ragas evaluates whether retrieved context supports the answer and whether the answer addresses the question, rather than collapsing all failures into one vague score.^[1]

Diagnose Retrieval and Generation Separately

Suppose an answer omits the 48 hours deadline. That doesn't tell you which component failed. Compare the gold requirement with both the retrieved context and the generated answer:

attribute-rag-failures.py

cases = [
    {
        "id": "good-answer",
        "required": ("48 hours", "photos"),
        "retrieved": "Damaged electronics: report within 48 hours with photos.",
        "answer": "Report within 48 hours and send photos.",
    },
    {
        "id": "retrieval-miss",
        "required": ("48 hours", "photos"),
        "retrieved": "Electronics category page. Contact support for assistance.",
        "answer": "Contact support for assistance.",
    },
    {
        "id": "generation-miss",
        "required": ("48 hours", "photos"),
        "retrieved": "Damaged electronics: report within 48 hours with photos.",
        "answer": "Please send photos so we can investigate.",
    },
]

def failure_stage(case: dict[str, object]) -> str:
    required = case["required"]
    retrieved = str(case["retrieved"]).lower()
    answer = str(case["answer"]).lower()
    if any(term not in retrieved for term in required):
        return "bad_retrieval"
    if any(term not in answer for term in required):
        return "bad_generation"
    return "pass"

for case in cases:
    print(f"{case['id']:16} {failure_stage(case)}")

Output

good-answer      pass
retrieval-miss   bad_retrieval
generation-miss  bad_generation

Public Benchmarks Answer Narrower Questions

Private tests tell you whether your application meets its contract. Public benchmarks help compare capabilities under standardized tasks, but each family asks a different question.

MMLU evaluates multiple-choice accuracy across 57 academic and professional subjects.^[2] It can reveal broad knowledge differences under a stated prompt protocol. It can't prove that a policy assistant quotes the right return clause.

GPQA narrows the question to difficult expert QA. Its 448 multiple-choice questions were written by domain experts in biology, physics, and chemistry.^[3] A strong GPQA result can signal expert STEM question-answering ability under that harness, but it still doesn't prove that a support workflow follows policy.

Math benchmarks narrow the question. GSM8K uses grade-school word problems, while MATH uses harder competition-style problems; both are useful when a workflow must produce checkable mathematical answers, but they don't measure open-ended support quality.^[4]^[5] At the harder frontier, Humanity's Last Exam (HLE) collects expert-level questions across domains, and FrontierMath targets advanced mathematical problems. ARC-AGI-2 asks systems to infer novel transformations from small visual-grid demonstrations rather than recall subject matter.^[6]^[7]^[8] These benchmarks answer different research questions; a model isn't "better at reasoning" in every product merely because it improves on one of them.

HumanEval provides handwritten Python functions with tests and introduced functional correctness reporting with repeated samples and pass@k.^[9] SWE-bench moves from isolated functions to real GitHub issues whose patches are tested in repository environments.^[10] These are relevant if your system edits code, but neither evaluates customer-facing policy truthfulness.

MT-Bench uses a model judge for open-ended, multi-turn responses, while Chatbot Arena collects blind pairwise human preferences.^[11]^[12] Preference is useful for tone and helpfulness, but an answer that users prefer can still cite the wrong policy.

Evaluation family	Scored unit	Typical metric	Useful signal	Missing deployment proof
Broad knowledge	MMLU question	multiple-choice accuracy	academic/professional task coverage	evidence grounding and tool behavior
Expert STEM QA	GPQA question	multiple-choice accuracy	difficult biology, physics, and chemistry QA	production workflow behavior
Checkable mathematics	GSM8K or MATH problem	parsed final-answer accuracy	mathematical answer execution under a fixed harness	grounded customer-policy answers
Frontier reasoning research	HLE, FrontierMath, or ARC-AGI-2 item	benchmark-specific correctness	expert-question or novel-task performance in that suite	general deployment reliability
Executable code	HumanEval function or SWE-bench patch	`pass@k` or resolved rate	programs that satisfy tests	support answer policy compliance
Open-ended preference	MT-Bench or Arena comparison	judge score or pairwise preference	style and perceived helpfulness	factual ground truth
Private application eval	question, retrieved clause, response,	policy pass rate plus gates	behavior on your release path	capability outside your covered slices

The practical rule is simple: compare results only when the dataset, input path, output contract, prompt, sampling settings, tools, and scorer match. 84% on a multiple-choice set and 42% on repository fixes aren't competing measurements of the same property.

That rule is easy to encode. This helper allows comparisons only when the benchmark and the harness fields are identical.

check-harness-comparability.py

runs = {
    "candidate-a": {
        "dataset": "policy-golden-v3",
        "retriever": "chunks-v2/top3",
        "prompt": "support-answer-v4",
        "output_contract": "answer-with-source-page-v2",
        "toolset": ("policy-search-v2",),
        "temperature": 0.0,
        "max_tokens": 180,
        "scorer": "facts-v2",
        "score": 0.97,
    },
    "candidate-b": {
        "dataset": "policy-golden-v3",
        "retriever": "chunks-v2/top3",
        "prompt": "support-answer-v4",
        "output_contract": "answer-with-source-page-v2",
        "toolset": ("policy-search-v2",),
        "temperature": 0.0,
        "max_tokens": 180,
        "scorer": "facts-v2",
        "score": 0.99,
    },
    "candidate-c": {
        "dataset": "policy-golden-v3",
        "retriever": "chunks-v2/top5",
        "prompt": "support-answer-v4",
        "output_contract": "answer-with-source-page-v2",
        "toolset": ("policy-search-v2",),
        "temperature": 0.0,
        "max_tokens": 180,
        "scorer": "facts-v2",
        "score": 1.00,
    },
}

contract_fields = (
    "dataset",
    "retriever",
    "prompt",
    "output_contract",
    "toolset",
    "temperature",
    "max_tokens",
    "scorer",
)

def comparable(left: dict[str, object], right: dict[str, object]) -> bool:
    return all(left[field] == right[field] for field in contract_fields)

print("A vs B comparable:", comparable(runs["candidate-a"], runs["candidate-b"]))
print("A vs C comparable:", comparable(runs["candidate-a"], runs["candidate-c"]))
print("C differs because its retriever sees more chunks.")

Output

A vs B comparable: True
A vs C comparable: False
C differs because its retriever sees more chunks.

`pass@k`: Measure a Workflow That Can Test Several Drafts

For code generation, a single answer isn't always the real workflow. A developer may generate several candidates and run tests until one passes. HumanEval's pass@k estimator measures the probability that at least one of $k$ selected samples is correct, given $n$ generated programs and $c$ correct programs.^[9]

Suppose an inventory-code task generated 100 candidates and tests accepted 20. Checking one random candidate succeeds with probability 0.20. Checking 10 distinct candidates succeeds much more often:

\operatorname{pass}@k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}

Here, $n$ is the number of generated programs, $c$ is the number that pass tests, and $k$ is the number you are allowed to try.

estimate-pass-at-k.py

from math import comb

def pass_at_k(total_samples: int, correct_samples: int, k: int) -> float:
    if not 0 <= correct_samples <= total_samples:
        raise ValueError("correct_samples must be inside the sampled set")
    if not 1 <= k <= total_samples:
        raise ValueError("k must be between 1 and total_samples")
    if total_samples - correct_samples < k:
        return 1.0
    return 1.0 - comb(total_samples - correct_samples, k) / comb(total_samples, k)

for k in (1, 5, 10, 20):
    print(f"pass@{k:<2} = {pass_at_k(100, 20, k):.4f}")

Output

pass@1  = 0.2000
pass@5  = 0.6807
pass@10 = 0.9049
pass@20 = 0.9934

Pass at k curve for 100 generated code samples with 20 correct candidates, showing win probability rising sharply as more attempts are tested. — `pass@k` measures the value of generating and testing several code drafts. It doesn't mean a single unverified response became more reliable.

This distinction matters. Multiple tested drafts make sense for a generated inventory function because unit tests select a passing implementation. You shouldn't sample several customer-facing refund answers and silently choose the most confident-looking one without a policy-grounded scorer.

Open-Ended Answers Need Controlled Judgment

Deterministic checks catch explicit clauses, but they don't settle every question. Two answers can both cite the correct deadline while differing in clarity or empathy. For open-ended comparisons, teams often use humans or an LLM judge.

The MT-Bench study found that capable LLM judges can approximate human preferences in its setting, but also documents position, verbosity, and self-enhancement biases.^[11] A judge result is measurement output, not ground truth.

The small simulation below intentionally gives a toy judge a position bias: if two answers contain the same required facts, it chooses whichever answer appears first. Running each comparison in both orders exposes that instability.

detect-position-biased-judge.py

def biased_judge(answer_a: str, answer_b: str, required: tuple[str, ...]) -> str:
    score_a = sum(term in answer_a.lower() for term in required)
    score_b = sum(term in answer_b.lower() for term in required)
    if score_a > score_b:
        return "A"
    if score_b > score_a:
        return "B"
    return "A"  # Deliberate position bias for equal-quality answers.

def swap_checked(answer_a: str, answer_b: str, required: tuple[str, ...]) -> dict[str, object]:
    first_order = biased_judge(answer_a, answer_b, required)
    swapped_order = biased_judge(answer_b, answer_a, required)
    normalized_swapped = {"A": "B", "B": "A"}[swapped_order]
    if first_order != normalized_swapped:
        return {"winner": "needs_human_review", "stable": False}
    return {"winner": first_order, "stable": True}

required = ("48 hours", "photos")
concise = "Report the damaged earbuds within 48 hours and attach photos."
friendly = "Sorry they arrived damaged. Please send photos within 48 hours."
unsupported = "Return the earbuds whenever convenient."

print("equivalent:", swap_checked(concise, friendly, required))
print("factual miss:", swap_checked(concise, unsupported, required))

Output

equivalent: {'winner': 'needs_human_review', 'stable': False}
factual miss: {'winner': 'A', 'stable': True}

An unstable comparison isn't a failure of either candidate. It means the measurement can't distinguish them reliably under its current rubric. Send such cases to a human reviewer or improve the rubric before aggregating a winner.

Pairwise LLM judge control loop with answer-order swap, length control, reference anchor, stable winner check, and human review for unstable cases. — Pairwise grading needs controls. Swap answer order, anchor the rubric to policy facts, and treat unstable wins as a request for review rather than a model victory.

Public Test Sets Can Lose Their Meaning

A public benchmark is reproducible because everyone can access the tasks. That same visibility creates a risk: benchmark questions or solutions may appear in training corpora. A model that encountered a test item during training is no longer being measured on a genuinely unseen example.

Do not jump from risk to accusation. A public score alone doesn't prove contamination in a particular model. It means the report should state what decontamination checks were used, and high-stakes selection should include fresh or private tasks.

LiveCodeBench was designed around this problem: it continually collects coding tasks released over time and evaluates models on problems that appear after their stated training cutoff.^[13] Time splits reduce exposure risk, although they still depend on accurate cutoff information and a stable harness.

Benchmark contamination cycle where public tasks spread onto web mirrors, leak into training corpora, inflate leaderboard scores, and force a shift toward time-split or private evals. — A public test remains valuable only while it measures unseen behavior. Fresh time-split tasks and private release cases preserve signal when popular static sets may have circulated widely.

For the ShopFlow policy assistant, keep private policy cases separate from prompt examples, demo transcripts, and support documentation that may later be used for tuning. If an evaluation case becomes a training example, retire it from the holdout set or record that it now measures regression behavior rather than generalization.

A time-split suite also helps when new policy versions arrive. A model tuned using cases created through March shouldn't be evaluated as "unseen" on those same cases. Hold out cases authored after the tuning snapshot:

make-time-split-holdout.py

from datetime import date

tuning_snapshot = date.fromisoformat("2026-03-31")
policy_cases = [
    ("holiday-return-window", "2026-02-10"),
    ("damaged-electronics-photo-rule", "2026-04-08"),
    ("split-shipment-investigation", "2026-04-21"),
    ("seller-battery-restriction", "2026-05-03"),
]

training_visible = []
fresh_holdout = []
for case_id, created_at in policy_cases:
    destination = fresh_holdout if date.fromisoformat(created_at) > tuning_snapshot else training_visible
    destination.append(case_id)

print("known before tuning:", training_visible)
print("fresh holdout:", fresh_holdout)
assert "damaged-electronics-photo-rule" in fresh_holdout

Output

known before tuning: ['holiday-return-window']
fresh holdout: ['damaged-electronics-photo-rule', 'split-shipment-investigation', 'seller-battery-restriction']

Turn Results Into a Release Gate

Your evaluation is useful when it changes an engineering decision. A model may give grounded answers but be too slow for live chat. Another may be fast and cheap, but fail critical policy cases. Define the bar before comparing models.

Production model selection flow where public benchmarks shortlist candidates, private support tasks test policy and formatting, and latency plus cost gates decide ship, route, or reject. — Model selection ends with your own release gates. A candidate ships only when policy quality and operational limits pass together; otherwise route selectively or reject it.

The fixture values below represent measurements from a fictional private evaluation run. The code makes the release rule explicit: a candidate that misses policy or schema quality is rejected, while a correct but expensive model may be routed only to difficult cases.

apply-release-gates.py

candidates = [
    {
        "name": "fast-small",
        "policy_pass_rate": 0.91,
        "schema_pass_rate": 1.00,
        "p95_latency_ms": 420,
        "cost_per_case": 0.002,
    },
    {
        "name": "balanced",
        "policy_pass_rate": 0.99,
        "schema_pass_rate": 1.00,
        "p95_latency_ms": 680,
        "cost_per_case": 0.008,
    },
    {
        "name": "slow-specialist",
        "policy_pass_rate": 1.00,
        "schema_pass_rate": 1.00,
        "p95_latency_ms": 1800,
        "cost_per_case": 0.031,
    },
]

minimum_policy = 0.98
minimum_schema = 0.995
online_latency_limit_ms = 900
online_cost_limit = 0.015

def decision(candidate: dict[str, float | str]) -> str:
    if candidate["policy_pass_rate"] < minimum_policy:
        return "reject: policy quality below bar"
    if candidate["schema_pass_rate"] < minimum_schema:
        return "reject: output contract below bar"
    if candidate["p95_latency_ms"] > online_latency_limit_ms:
        return "route: reserve for escalations because latency is high"
    if candidate["cost_per_case"] > online_cost_limit:
        return "route: reserve for escalations because cost is high"
    return "ship: passes online gates"

for candidate in candidates:
    print(f"{candidate['name']:15} {decision(candidate)}")

Output

fast-small      reject: policy quality below bar
balanced        ship: passes online gates
slow-specialist route: reserve for escalations because latency is high

The public benchmark shortlist never appears inside decision. That is intentional. Public scores help decide which candidates deserve testing. A release gate uses measurements taken on the workload you are about to serve.

Inspect Failure Slices, Not Only the Average

An aggregate score can hide the one category with the largest customer impact. The run below clears four of six cases overall, but damaged-electronics answers fail most often.

report-failure-slices.py

from collections import defaultdict

results = [
    ("damaged-electronics", True),
    ("damaged-electronics", False),
    ("damaged-electronics", False),
    ("late-shipment", True),
    ("late-shipment", True),
    ("sealed-return", True),
]

by_slice: dict[str, list[bool]] = defaultdict(list)
for category, passed in results:
    by_slice[category].append(passed)

overall = sum(passed for _, passed in results) / len(results)
print(f"overall={overall:.2%}")
for category in sorted(by_slice):
    values = by_slice[category]
    rate = sum(values) / len(values)
    print(f"{category:20} pass_rate={rate:.2%} cases={len(values)}")

Output

overall=66.67%
damaged-electronics  pass_rate=33.33% cases=3
late-shipment        pass_rate=100.00% cases=2
sealed-return        pass_rate=100.00% cases=1

That report should block a release if damaged-item policy mistakes are critical, even when the average seems acceptable.

Small Samples Need an Uncertainty Margin

Three passing examples make a good tutorial, not a production release argument. Earlier statistics lessons introduced uncertainty in estimated rates. For a binary pass metric, a Wilson lower bound gives a conservative view of the pass rate supported by a finite sample:

require-confidence-before-release.py

from math import sqrt

def wilson_lower_bound(successes: int, total: int, z: float = 1.96) -> float:
    proportion = successes / total
    denominator = 1 + z**2 / total
    center = proportion + z**2 / (2 * total)
    margin = z * sqrt((proportion * (1 - proportion) + z**2 / (4 * total)) / total)
    return (center - margin) / denominator

release_floor = 0.95
for successes, total in ((3, 3), (49, 50), (990, 1000)):
    lower = wilson_lower_bound(successes, total)
    status = "PASS" if lower >= release_floor else "COLLECT_MORE_OR_FIX"
    print(f"{successes:>3}/{total:<4} lower_bound={lower:.3f} {status}")

Output

3/3    lower_bound=0.438 COLLECT_MORE_OR_FIX
 49/50   lower_bound=0.895 COLLECT_MORE_OR_FIX
990/1000 lower_bound=0.982 PASS

Even perfect performance on three examples has a weak lower bound. Build a sufficiently large, risk-stratified holdout before claiming a system clears a production-quality bar.

Write Down Enough to Reproduce the Claim

Evaluation reports should let another engineer rerun the comparison and inspect its failures. Record these fields before sharing any model ranking:

Report field	Why it matters	Example entry
Dataset version and holdout policy	prevents accidental training leakage	`policy-golden-v3`, never used for prompt tuning
Retrieval setup	separates missing evidence from bad answers	chunker version, top-k, index snapshot
Prompt and chat format	instruction formatting changes outputs	system prompt hash, template version
Model and sampling settings	output variance depends on decoding	model ID, temperature, max tokens
Scorer and judge controls	metrics can hide bias	deterministic gates, order swap, audit sample
Failure slices	averages conceal unsafe cases	damaged electronics, late scans, multilingual queries
Latency and cost	a correct system still needs to run	p50/p95 latency, cost per successful case

That prompt-and-chat-format row points to the next lesson. If a model regularly omits required structure or speaks in the wrong role, evaluation has exposed a behavior gap. You then need to understand how instruction tuning and chat templates teach the interaction contract.

Failure Patterns to Diagnose

Symptom	Likely cause	First fix to test
Answer omits the deadline although policy chunk contains it	generation or prompt contract failure	require cited facts and inspect prompt/template
Answer sounds confident but cites a clause absent from context	unsupported generation	add forbidden-claim checks and human review slice
Two published model scores reverse under a new harness	incomparable prompt or scoring setup	rerun identical harness and publish configuration
Judge selects whichever answer appears first	position bias	score both orders and reject unstable comparisons
Public score rises while private cases stagnate	task mismatch or contaminated public signal	prioritize fresh private holdout and failure analysis
Correct model misses live latency target	operational bottleneck	route difficult cases or select a faster candidate

Mastery check

Key concepts

Evaluation contracts and private golden sets
Grounded answer checks for RAG
Benchmark families and harness comparability
HumanEval and pass@k
Judge bias controls
Contamination and time-split holdouts
Quality, latency, and cost release gates

Evaluation rubric

Foundational: Explains why public benchmark scores can't replace a private policy-answer evaluation contract.
Intermediate: Builds a grounded-answer check that attributes retrieval versus generation failures.
Intermediate: Explains when pass@k is appropriate and why it doesn't equal policy-answer accuracy.
Advanced: Designs judge-order, failure-slice, uncertainty, latency, and cost controls for a release decision.

Common pitfalls

Treating a public benchmark gain as evidence that policy answers are grounded.
Comparing model scores produced by different prompts, tools, or scoring rules.
Accepting an LLM judge result without checking position bias or factual anchors.
Shipping on a mean quality score while ignoring critical slices, latency, or cost.

Follow-up questions

Practice extension

Add a fourth policy case whose retrieved context is intentionally wrong. Extend score-grounded-policy-answers.py to label failures as bad_retrieval or bad_generation. Then add a source_page field to the candidate response and reject answers whose citation doesn't match the evidence row. You have now built the first slice of an evaluation dashboard for a RAG assistant.

Next Step

Continue to Instruction Tuning & Chat Templates

You can now measure whether an answer is grounded, reproducible, and safe to release. Next you will learn how training examples and chat formatting shape the response behavior those evaluations are designed to test.

PreviousChunking Strategies

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

RAGAS: Automated Evaluation of Retrieval Augmented Generation.

Es, S., et al. · 2023 · arXiv preprint

Measuring Massive Multitask Language Understanding (MMLU).

Hendrycks, D., et al. · 2021 · ICLR 2021

GPQA: A Graduate-Level Google-Proof Q&A Benchmark.

Rein, D., et al. · 2023

Training Verifiers to Solve Math Word Problems (GSM8K).

Cobbe, K., et al. · 2021

Measuring Mathematical Problem Solving with the MATH Dataset.

Hendrycks, D., et al. · 2021 · NeurIPS 2021

Humanity's Last Exam

Center for AI Safety, Scale AI, and HLE Contributors Consortium · 2025 · arXiv preprint

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

Glazer, E., et al. (Epoch AI) · 2024

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

Chollet, F., et al. · 2025

Evaluating Large Language Models Trained on Code (HumanEval).

Chen, M., et al. · 2021 · arXiv preprint

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?.

Jimenez, C. E., et al. · 2024 · ICLR 2024

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.

Zheng, L., et al. · 2023 · NeurIPS 2023

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference.

Chiang, W. L., et al. · 2024

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Jain, N., et al. · 2024

Back to Topics

LearnCore LLM FoundationsLLM Benchmarks & Limitations

📊MediumEvaluation & Benchmarks

LLM Benchmarks & Limitations

Build an evaluation suite for a policy-answering LLM: score evidence use, understand public benchmark contracts, control judge bias, and make release decisions from private tests.

19 min read

Learning path

Step 51 of 155 in the full curriculum

Chunking Strategies Instruction Tuning & Chat Templates

In the previous lesson, you turned a returns-policy document into chunks that preserve answerable evidence. One retrieved chunk contained this exact rule:

Retrieved evidence: Damaged electronics: report within 48 hours with photos.

Start With Questions You Must Answer Correctly

Here is a tiny private evaluation set for ShopFlow. The policy statements are fictional product rules for this lab, not general retail advice.

Case	Customer question	Retrieved policy evidence	Answer must include	Answer must not add
`damaged-electronics`	"My earbuds arrived crushed. What do I send, and by when?"	Damaged electronics: report within 48 hours with photos.	`48 hours`, `photos`	`30 days`
`late-shipment`	"Tracking hasn't changed for a week. What happens now?"	If a shipment has no scan for 7 days, open a carrier investigation.	`7 days`, `carrier investigation`	`automatic refund`
`sealed-return`	"Can I send unopened headphones back?"	Unopened headphones may be returned within 30 days.	`unopened`, `30 days`	`opened items accepted`

These rows are more useful than a generic "answer quality" prompt because each one names the behavior that passes and the risky extra claim that fails.

Before scoring a model, validate the evaluation set itself. A duplicated ID can silently overwrite a result, and a required phrase absent from its cited evidence creates an impossible test.

validate-private-eval-set.py

rows = [
    {
        "id": "damaged-electronics",
        "evidence": "Damaged electronics: report within 48 hours with photos.",
        "required": ("48 hours", "photos"),
    },
    {
        "id": "late-shipment",
        "evidence": "If a shipment has no scan for 7 days, open a carrier investigation.",
        "required": ("7 days", "carrier investigation"),
    },
    {
        "id": "sealed-return",
        "evidence": "Unopened headphones may be returned within 30 days.",
        "required": ("unopened", "30 days"),
    },
]

ids = [row["id"] for row in rows]
assert len(ids) == len(set(ids)), "evaluation IDs must be unique"

for row in rows:
    evidence = row["evidence"].lower()
    missing = [phrase for phrase in row["required"] if phrase not in evidence]
    assert not missing, f"{row['id']} has unsupported gold facts: {missing}"
    print(f"{row['id']:20} valid required={len(row['required'])}")

print(f"validated_cases={len(rows)}")

Output

damaged-electronics  valid required=2
late-shipment        valid required=2
sealed-return        valid required=2
validated_cases=3

A Score Is Only Meaningful With Its Contract

When a model card says "86%," the number is incomplete without the task and harness that produced it. Five parts make an evaluation result reproducible:

Contract part	Question to record	ShopFlow policy example
Dataset	Which cases were scored, and at which version?	`policy-golden-v3`, 120 reviewed support questions
Input path	What context reached the model?	top-3 chunks from the index, with policy version IDs
Output contract	What must the response look like?	concise answer plus cited `source_page`
Scorer	How is success computed?	required facts, forbidden claims, schema validity, human audit
Release bar	What result permits deployment?	no critical policy misses, plus latency and cost bounds

The preceding lesson addressed the second column of this pipeline: getting complete evidence into the context. This lesson measures what happens after retrieval. Work from left to right:

First Gate: Preserve Policy Evidence

The following lab uses three fixture answers. It also asserts that the evaluation rows themselves are valid: every required phrase must be present in the cited evidence.

score-grounded-policy-answers.py

from dataclasses import dataclass

@dataclass(frozen=True)
class EvalCase:
    case_id: str
    evidence: str
    required_phrases: tuple[str, ...]
    forbidden_phrases: tuple[str, ...]

cases = [
    EvalCase(
        "damaged-electronics",
        "Damaged electronics: report within 48 hours with photos.",
        ("48 hours", "photos"),
        ("30 days",),
    ),
    EvalCase(
        "late-shipment",
        "If a shipment has no scan for 7 days, open a carrier investigation.",
        ("7 days", "carrier investigation"),
        ("automatic refund",),
    ),
    EvalCase(
        "sealed-return",
        "Unopened headphones may be returned within 30 days.",
        ("unopened", "30 days"),
        ("opened items accepted",),
    ),
]

answers = {
    "damaged-electronics": "Please report the damage within 48 hours and attach photos.",
    "late-shipment": "You qualify for an automatic refund now.",
    "sealed-return": "Unopened headphones may be returned within 30 days.",
}

def evaluate(case: EvalCase, answer: str) -> tuple[list[str], list[str]]:
    evidence = case.evidence.lower()
    text = answer.lower()
    assert all(phrase in evidence for phrase in case.required_phrases)
    missing = [phrase for phrase in case.required_phrases if phrase not in text]
    unsupported = [phrase for phrase in case.forbidden_phrases if phrase in text]
    return missing, unsupported

passed = 0
for case in cases:
    missing, unsupported = evaluate(case, answers[case.case_id])
    status = "PASS" if not missing and not unsupported else "FAIL"
    passed += status == "PASS"
    print(f"{case.case_id:20} {status} missing={missing} unsupported={unsupported}")

print(f"summary: grounded_policy_pass={passed}/{len(cases)}")

Output

damaged-electronics  PASS missing=[] unsupported=[]
late-shipment        FAIL missing=['7 days', 'carrier investigation'] unsupported=['automatic refund']
sealed-return        PASS missing=[] unsupported=[]
summary: grounded_policy_pass=2/3

Diagnose Retrieval and Generation Separately

Suppose an answer omits the 48 hours deadline. That doesn't tell you which component failed. Compare the gold requirement with both the retrieved context and the generated answer:

attribute-rag-failures.py

cases = [
    {
        "id": "good-answer",
        "required": ("48 hours", "photos"),
        "retrieved": "Damaged electronics: report within 48 hours with photos.",
        "answer": "Report within 48 hours and send photos.",
    },
    {
        "id": "retrieval-miss",
        "required": ("48 hours", "photos"),
        "retrieved": "Electronics category page. Contact support for assistance.",
        "answer": "Contact support for assistance.",
    },
    {
        "id": "generation-miss",
        "required": ("48 hours", "photos"),
        "retrieved": "Damaged electronics: report within 48 hours with photos.",
        "answer": "Please send photos so we can investigate.",
    },
]

def failure_stage(case: dict[str, object]) -> str:
    required = case["required"]
    retrieved = str(case["retrieved"]).lower()
    answer = str(case["answer"]).lower()
    if any(term not in retrieved for term in required):
        return "bad_retrieval"
    if any(term not in answer for term in required):
        return "bad_generation"
    return "pass"

for case in cases:
    print(f"{case['id']:16} {failure_stage(case)}")

Output

good-answer      pass
retrieval-miss   bad_retrieval
generation-miss  bad_generation

Public Benchmarks Answer Narrower Questions

Private tests tell you whether your application meets its contract. Public benchmarks help compare capabilities under standardized tasks, but each family asks a different question.

Evaluation family	Scored unit	Typical metric	Useful signal	Missing deployment proof
Broad knowledge	MMLU question	multiple-choice accuracy	academic/professional task coverage	evidence grounding and tool behavior
Expert STEM QA	GPQA question	multiple-choice accuracy	difficult biology, physics, and chemistry QA	production workflow behavior
Checkable mathematics	GSM8K or MATH problem	parsed final-answer accuracy	mathematical answer execution under a fixed harness	grounded customer-policy answers
Frontier reasoning research	HLE, FrontierMath, or ARC-AGI-2 item	benchmark-specific correctness	expert-question or novel-task performance in that suite	general deployment reliability
Executable code	HumanEval function or SWE-bench patch	`pass@k` or resolved rate	programs that satisfy tests	support answer policy compliance
Open-ended preference	MT-Bench or Arena comparison	judge score or pairwise preference	style and perceived helpfulness	factual ground truth
Private application eval	question, retrieved clause, response,	policy pass rate plus gates	behavior on your release path	capability outside your covered slices

That rule is easy to encode. This helper allows comparisons only when the benchmark and the harness fields are identical.

check-harness-comparability.py

runs = {
    "candidate-a": {
        "dataset": "policy-golden-v3",
        "retriever": "chunks-v2/top3",
        "prompt": "support-answer-v4",
        "output_contract": "answer-with-source-page-v2",
        "toolset": ("policy-search-v2",),
        "temperature": 0.0,
        "max_tokens": 180,
        "scorer": "facts-v2",
        "score": 0.97,
    },
    "candidate-b": {
        "dataset": "policy-golden-v3",
        "retriever": "chunks-v2/top3",
        "prompt": "support-answer-v4",
        "output_contract": "answer-with-source-page-v2",
        "toolset": ("policy-search-v2",),
        "temperature": 0.0,
        "max_tokens": 180,
        "scorer": "facts-v2",
        "score": 0.99,
    },
    "candidate-c": {
        "dataset": "policy-golden-v3",
        "retriever": "chunks-v2/top5",
        "prompt": "support-answer-v4",
        "output_contract": "answer-with-source-page-v2",
        "toolset": ("policy-search-v2",),
        "temperature": 0.0,
        "max_tokens": 180,
        "scorer": "facts-v2",
        "score": 1.00,
    },
}

contract_fields = (
    "dataset",
    "retriever",
    "prompt",
    "output_contract",
    "toolset",
    "temperature",
    "max_tokens",
    "scorer",
)

def comparable(left: dict[str, object], right: dict[str, object]) -> bool:
    return all(left[field] == right[field] for field in contract_fields)

print("A vs B comparable:", comparable(runs["candidate-a"], runs["candidate-b"]))
print("A vs C comparable:", comparable(runs["candidate-a"], runs["candidate-c"]))
print("C differs because its retriever sees more chunks.")

Output

A vs B comparable: True
A vs C comparable: False
C differs because its retriever sees more chunks.

`pass@k`: Measure a Workflow That Can Test Several Drafts

\operatorname{pass}@k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}

Here, $n$ is the number of generated programs, $c$ is the number that pass tests, and $k$ is the number you are allowed to try.

estimate-pass-at-k.py

from math import comb

def pass_at_k(total_samples: int, correct_samples: int, k: int) -> float:
    if not 0 <= correct_samples <= total_samples:
        raise ValueError("correct_samples must be inside the sampled set")
    if not 1 <= k <= total_samples:
        raise ValueError("k must be between 1 and total_samples")
    if total_samples - correct_samples < k:
        return 1.0
    return 1.0 - comb(total_samples - correct_samples, k) / comb(total_samples, k)

for k in (1, 5, 10, 20):
    print(f"pass@{k:<2} = {pass_at_k(100, 20, k):.4f}")

Output

pass@1  = 0.2000
pass@5  = 0.6807
pass@10 = 0.9049
pass@20 = 0.9934

Open-Ended Answers Need Controlled Judgment

detect-position-biased-judge.py

def biased_judge(answer_a: str, answer_b: str, required: tuple[str, ...]) -> str:
    score_a = sum(term in answer_a.lower() for term in required)
    score_b = sum(term in answer_b.lower() for term in required)
    if score_a > score_b:
        return "A"
    if score_b > score_a:
        return "B"
    return "A"  # Deliberate position bias for equal-quality answers.

def swap_checked(answer_a: str, answer_b: str, required: tuple[str, ...]) -> dict[str, object]:
    first_order = biased_judge(answer_a, answer_b, required)
    swapped_order = biased_judge(answer_b, answer_a, required)
    normalized_swapped = {"A": "B", "B": "A"}[swapped_order]
    if first_order != normalized_swapped:
        return {"winner": "needs_human_review", "stable": False}
    return {"winner": first_order, "stable": True}

required = ("48 hours", "photos")
concise = "Report the damaged earbuds within 48 hours and attach photos."
friendly = "Sorry they arrived damaged. Please send photos within 48 hours."
unsupported = "Return the earbuds whenever convenient."

print("equivalent:", swap_checked(concise, friendly, required))
print("factual miss:", swap_checked(concise, unsupported, required))

Output

equivalent: {'winner': 'needs_human_review', 'stable': False}
factual miss: {'winner': 'A', 'stable': True}

Public Test Sets Can Lose Their Meaning

make-time-split-holdout.py

from datetime import date

tuning_snapshot = date.fromisoformat("2026-03-31")
policy_cases = [
    ("holiday-return-window", "2026-02-10"),
    ("damaged-electronics-photo-rule", "2026-04-08"),
    ("split-shipment-investigation", "2026-04-21"),
    ("seller-battery-restriction", "2026-05-03"),
]

training_visible = []
fresh_holdout = []
for case_id, created_at in policy_cases:
    destination = fresh_holdout if date.fromisoformat(created_at) > tuning_snapshot else training_visible
    destination.append(case_id)

print("known before tuning:", training_visible)
print("fresh holdout:", fresh_holdout)
assert "damaged-electronics-photo-rule" in fresh_holdout

Output

known before tuning: ['holiday-return-window']
fresh holdout: ['damaged-electronics-photo-rule', 'split-shipment-investigation', 'seller-battery-restriction']

Turn Results Into a Release Gate

apply-release-gates.py

candidates = [
    {
        "name": "fast-small",
        "policy_pass_rate": 0.91,
        "schema_pass_rate": 1.00,
        "p95_latency_ms": 420,
        "cost_per_case": 0.002,
    },
    {
        "name": "balanced",
        "policy_pass_rate": 0.99,
        "schema_pass_rate": 1.00,
        "p95_latency_ms": 680,
        "cost_per_case": 0.008,
    },
    {
        "name": "slow-specialist",
        "policy_pass_rate": 1.00,
        "schema_pass_rate": 1.00,
        "p95_latency_ms": 1800,
        "cost_per_case": 0.031,
    },
]

minimum_policy = 0.98
minimum_schema = 0.995
online_latency_limit_ms = 900
online_cost_limit = 0.015

def decision(candidate: dict[str, float | str]) -> str:
    if candidate["policy_pass_rate"] < minimum_policy:
        return "reject: policy quality below bar"
    if candidate["schema_pass_rate"] < minimum_schema:
        return "reject: output contract below bar"
    if candidate["p95_latency_ms"] > online_latency_limit_ms:
        return "route: reserve for escalations because latency is high"
    if candidate["cost_per_case"] > online_cost_limit:
        return "route: reserve for escalations because cost is high"
    return "ship: passes online gates"

for candidate in candidates:
    print(f"{candidate['name']:15} {decision(candidate)}")

Output

fast-small      reject: policy quality below bar
balanced        ship: passes online gates
slow-specialist route: reserve for escalations because latency is high

Inspect Failure Slices, Not Only the Average

An aggregate score can hide the one category with the largest customer impact. The run below clears four of six cases overall, but damaged-electronics answers fail most often.

report-failure-slices.py

from collections import defaultdict

results = [
    ("damaged-electronics", True),
    ("damaged-electronics", False),
    ("damaged-electronics", False),
    ("late-shipment", True),
    ("late-shipment", True),
    ("sealed-return", True),
]

by_slice: dict[str, list[bool]] = defaultdict(list)
for category, passed in results:
    by_slice[category].append(passed)

overall = sum(passed for _, passed in results) / len(results)
print(f"overall={overall:.2%}")
for category in sorted(by_slice):
    values = by_slice[category]
    rate = sum(values) / len(values)
    print(f"{category:20} pass_rate={rate:.2%} cases={len(values)}")

Output

overall=66.67%
damaged-electronics  pass_rate=33.33% cases=3
late-shipment        pass_rate=100.00% cases=2
sealed-return        pass_rate=100.00% cases=1

That report should block a release if damaged-item policy mistakes are critical, even when the average seems acceptable.

Small Samples Need an Uncertainty Margin

require-confidence-before-release.py

from math import sqrt

def wilson_lower_bound(successes: int, total: int, z: float = 1.96) -> float:
    proportion = successes / total
    denominator = 1 + z**2 / total
    center = proportion + z**2 / (2 * total)
    margin = z * sqrt((proportion * (1 - proportion) + z**2 / (4 * total)) / total)
    return (center - margin) / denominator

release_floor = 0.95
for successes, total in ((3, 3), (49, 50), (990, 1000)):
    lower = wilson_lower_bound(successes, total)
    status = "PASS" if lower >= release_floor else "COLLECT_MORE_OR_FIX"
    print(f"{successes:>3}/{total:<4} lower_bound={lower:.3f} {status}")

Output

3/3    lower_bound=0.438 COLLECT_MORE_OR_FIX
 49/50   lower_bound=0.895 COLLECT_MORE_OR_FIX
990/1000 lower_bound=0.982 PASS

Even perfect performance on three examples has a weak lower bound. Build a sufficiently large, risk-stratified holdout before claiming a system clears a production-quality bar.

Write Down Enough to Reproduce the Claim

Evaluation reports should let another engineer rerun the comparison and inspect its failures. Record these fields before sharing any model ranking:

Report field	Why it matters	Example entry
Dataset version and holdout policy	prevents accidental training leakage	`policy-golden-v3`, never used for prompt tuning
Retrieval setup	separates missing evidence from bad answers	chunker version, top-k, index snapshot
Prompt and chat format	instruction formatting changes outputs	system prompt hash, template version
Model and sampling settings	output variance depends on decoding	model ID, temperature, max tokens
Scorer and judge controls	metrics can hide bias	deterministic gates, order swap, audit sample
Failure slices	averages conceal unsafe cases	damaged electronics, late scans, multilingual queries
Latency and cost	a correct system still needs to run	p50/p95 latency, cost per successful case

Failure Patterns to Diagnose

Symptom	Likely cause	First fix to test
Answer omits the deadline although policy chunk contains it	generation or prompt contract failure	require cited facts and inspect prompt/template
Answer sounds confident but cites a clause absent from context	unsupported generation	add forbidden-claim checks and human review slice
Two published model scores reverse under a new harness	incomparable prompt or scoring setup	rerun identical harness and publish configuration
Judge selects whichever answer appears first	position bias	score both orders and reject unstable comparisons
Public score rises while private cases stagnate	task mismatch or contaminated public signal	prioritize fresh private holdout and failure analysis
Correct model misses live latency target	operational bottleneck	route difficult cases or select a faster candidate

Mastery check

Key concepts

Evaluation contracts and private golden sets
Grounded answer checks for RAG
Benchmark families and harness comparability
HumanEval and pass@k
Judge bias controls
Contamination and time-split holdouts
Quality, latency, and cost release gates

Evaluation rubric

Foundational: Explains why public benchmark scores can't replace a private policy-answer evaluation contract.
Intermediate: Builds a grounded-answer check that attributes retrieval versus generation failures.
Intermediate: Explains when pass@k is appropriate and why it doesn't equal policy-answer accuracy.
Advanced: Designs judge-order, failure-slice, uncertainty, latency, and cost controls for a release decision.

Common pitfalls

Treating a public benchmark gain as evidence that policy answers are grounded.
Comparing model scores produced by different prompts, tools, or scoring rules.
Accepting an LLM judge result without checking position bias or factual anchors.
Shipping on a mean quality score while ignoring critical slices, latency, or cost.

Follow-up questions

Practice extension

Next Step

Continue to Instruction Tuning & Chat Templates

PreviousChunking Strategies

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

RAGAS: Automated Evaluation of Retrieval Augmented Generation.

Es, S., et al. · 2023 · arXiv preprint

Measuring Massive Multitask Language Understanding (MMLU).

Hendrycks, D., et al. · 2021 · ICLR 2021

GPQA: A Graduate-Level Google-Proof Q&A Benchmark.

Rein, D., et al. · 2023

Training Verifiers to Solve Math Word Problems (GSM8K).

Cobbe, K., et al. · 2021

Measuring Mathematical Problem Solving with the MATH Dataset.

Hendrycks, D., et al. · 2021 · NeurIPS 2021

Humanity's Last Exam

Center for AI Safety, Scale AI, and HLE Contributors Consortium · 2025 · arXiv preprint

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

Glazer, E., et al. (Epoch AI) · 2024

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

Chollet, F., et al. · 2025

Evaluating Large Language Models Trained on Code (HumanEval).

Chen, M., et al. · 2021 · arXiv preprint

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?.

Jimenez, C. E., et al. · 2024 · ICLR 2024

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.

Zheng, L., et al. · 2023 · NeurIPS 2023

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference.

Chiang, W. L., et al. · 2024

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Jain, N., et al. · 2024

LLM Benchmarks & Limitations

Start With Questions You Must Answer Correctly

A Score Is Only Meaningful With Its Contract

First Gate: Preserve Policy Evidence

Diagnose Retrieval and Generation Separately

Why check that required phrases occur in the evidence before evaluating model answers?

Public Benchmarks Answer Narrower Questions

A model is stronger on MMLU than another model is on SWE-bench. Which model should answer return-policy questions?

pass@k: Measure a Workflow That Can Test Several Drafts

Open-Ended Answers Need Controlled Judgment

Public Test Sets Can Lose Their Meaning

Turn Results Into a Release Gate

Inspect Failure Slices, Not Only the Average

Small Samples Need an Uncertainty Margin

Write Down Enough to Reproduce the Claim

Failure Patterns to Diagnose

Mastery check

Key concepts

Evaluation rubric

Common pitfalls

Follow-up questions

Why does the evaluation set include retrieved evidence as well as expected answers?

Why can't pass@10 be treated as the same claim as policy-answer accuracy?

When should an LLM judge comparison be sent to human review?

Practice extension

LLM Benchmarks & Limitations

Start With Questions You Must Answer Correctly

A Score Is Only Meaningful With Its Contract

First Gate: Preserve Policy Evidence

Diagnose Retrieval and Generation Separately

Why check that required phrases occur in the evidence before evaluating model answers?

Public Benchmarks Answer Narrower Questions

A model is stronger on MMLU than another model is on SWE-bench. Which model should answer return-policy questions?

pass@k: Measure a Workflow That Can Test Several Drafts

Open-Ended Answers Need Controlled Judgment

Public Test Sets Can Lose Their Meaning

Turn Results Into a Release Gate

Inspect Failure Slices, Not Only the Average

Small Samples Need an Uncertainty Margin

Write Down Enough to Reproduce the Claim

Failure Patterns to Diagnose

Mastery check

Key concepts

Evaluation rubric

Common pitfalls

Follow-up questions

Why does the evaluation set include retrieved evidence as well as expected answers?

Why can't pass@10 be treated as the same claim as policy-answer accuracy?

When should an LLM judge comparison be sent to human review?

Practice extension

`pass@k`: Measure a Workflow That Can Test Several Drafts

`pass@k`: Measure a Workflow That Can Test Several Drafts