Build an evaluation suite for a policy-answering LLM: score evidence use, understand public benchmark contracts, control judge bias, and make release decisions from private tests.
In the previous lesson, you turned a returns-policy document into chunks that preserve answerable evidence. One retrieved chunk contained this exact rule:
Retrieved evidence: Damaged electronics: report within 48 hours with photos.
Now a support assistant answers: "Send photos within 48 hours so we can review the damage." That looks good. A second model answers: "You can return any electronics within 30 days." That sounds helpful, but the retrieved clause doesn't support it.
This chapter answers the question that comes next: how do you measure whether a large language model (LLM) system is safe to improve or deploy? You'll build a small private evaluation set, run deterministic checks, learn what public benchmark scores do and don't prove, measure code generation with pass@k, control judge bias, and turn results into a release decision.
A benchmark is a collection of tasks plus a scoring procedure. A production evaluation is the same idea applied to your own workload. Instead of asking only whether a model knows academic facts, you ask whether the entire system retrieves the right clause, answers from that clause, follows the output contract, and meets latency and cost limits.
Here is a tiny private evaluation set for ShopFlow. The policy statements are fictional product rules for this lab, not general retail advice.
| Case | Customer question | Retrieved policy evidence | Answer must include | Answer must not add |
|---|---|---|---|---|
damaged-electronics | "My earbuds arrived crushed. What do I send, and by when?" | Damaged electronics: report within 48 hours with photos. | 48 hours, photos | 30 days |
late-shipment | "Tracking hasn't changed for a week. What happens now?" | If a shipment has no scan for 7 days, open a carrier investigation. | 7 days, carrier investigation | automatic refund |
sealed-return | "Can I send unopened headphones back?" | Unopened headphones may be returned within 30 days. | unopened, 30 days | opened items accepted |
These rows are more useful than a generic "answer quality" prompt because each one names the behavior that passes and the risky extra claim that fails.
Before scoring a model, validate the evaluation set itself. A duplicated ID can silently overwrite a result, and a required phrase absent from its cited evidence creates an impossible test.
1rows = [
2 {
3 "id": "damaged-electronics",
4 "evidence": "Damaged electronics: report within 48 hours with photos.",
5 "required": ("48 hours", "photos"),
6 },
7 {
8 "id": "late-shipment",
9 "evidence": "If a shipment has no scan for 7 days, open a carrier investigation.",
10 "required": ("7 days", "carrier investigation"),
11 },
12 {
13 "id": "sealed-return",
14 "evidence": "Unopened headphones may be returned within 30 days.",
15 "required": ("unopened", "30 days"),
16 },
17]
18
19ids = [row["id"] for row in rows]
20assert len(ids) == len(set(ids)), "evaluation IDs must be unique"
21
22for row in rows:
23 evidence = row["evidence"].lower()
24 missing = [phrase for phrase in row["required"] if phrase not in evidence]
25 assert not missing, f"{row['id']} has unsupported gold facts: {missing}"
26 print(f"{row['id']:20} valid required={len(row['required'])}")
27
28print(f"validated_cases={len(rows)}")1damaged-electronics valid required=2
2late-shipment valid required=2
3sealed-return valid required=2
4validated_cases=3When a model card says "86%," the number is incomplete without the task and harness that produced it. Five parts make an evaluation result reproducible:
| Contract part | Question to record | ShopFlow policy example |
|---|---|---|
| Dataset | Which cases were scored, and at which version? | policy-golden-v3, 120 reviewed support questions |
| Input path | What context reached the model? | top-3 chunks from the index, with policy version IDs |
| Output contract | What must the response look like? | concise answer plus cited source_page |
| Scorer | How is success computed? | required facts, forbidden claims, schema validity, human audit |
| Release bar | What result permits deployment? | no critical policy misses, plus latency and cost bounds |
The preceding lesson addressed the second column of this pipeline: getting complete evidence into the context. This lesson measures what happens after retrieval. Work from left to right:
If a response fails at step 4, don't immediately blame the model. Inspect the retrieved chunk first. A missing fact may be a retrieval failure, while a contradicted fact may be a generation or instruction-following failure.
Begin with deterministic checks whenever the answer has explicit requirements. This isn't a complete semantic evaluator. It is a fast test that catches answers missing required policy terms or adding known unsafe claims.
The following lab uses three fixture answers. It also asserts that the evaluation rows themselves are valid: every required phrase must be present in the cited evidence.
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class EvalCase:
5 case_id: str
6 evidence: str
7 required_phrases: tuple[str, ...]
8 forbidden_phrases: tuple[str, ...]
9
10cases = [
11 EvalCase(
12 "damaged-electronics",
13 "Damaged electronics: report within 48 hours with photos.",
14 ("48 hours", "photos"),
15 ("30 days",),
16 ),
17 EvalCase(
18 "late-shipment",
19 "If a shipment has no scan for 7 days, open a carrier investigation.",
20 ("7 days", "carrier investigation"),
21 ("automatic refund",),
22 ),
23 EvalCase(
24 "sealed-return",
25 "Unopened headphones may be returned within 30 days.",
26 ("unopened", "30 days"),
27 ("opened items accepted",),
28 ),
29]
30
31answers = {
32 "damaged-electronics": "Please report the damage within 48 hours and attach photos.",
33 "late-shipment": "You qualify for an automatic refund now.",
34 "sealed-return": "Unopened headphones may be returned within 30 days.",
35}
36
37def evaluate(case: EvalCase, answer: str) -> tuple[list[str], list[str]]:
38 evidence = case.evidence.lower()
39 text = answer.lower()
40 assert all(phrase in evidence for phrase in case.required_phrases)
41 missing = [phrase for phrase in case.required_phrases if phrase not in text]
42 unsupported = [phrase for phrase in case.forbidden_phrases if phrase in text]
43 return missing, unsupported
44
45passed = 0
46for case in cases:
47 missing, unsupported = evaluate(case, answers[case.case_id])
48 status = "PASS" if not missing and not unsupported else "FAIL"
49 passed += status == "PASS"
50 print(f"{case.case_id:20} {status} missing={missing} unsupported={unsupported}")
51
52print(f"summary: grounded_policy_pass={passed}/{len(cases)}")1damaged-electronics PASS missing=[] unsupported=[]
2late-shipment FAIL missing=['7 days', 'carrier investigation'] unsupported=['automatic refund']
3sealed-return PASS missing=[] unsupported=[]
4summary: grounded_policy_pass=2/3The failed late-shipment answer is fluent, but it never mentions the required investigation and invents an automatic refund. A helpful-sounding response isn't enough when a customer may act on an unsupported policy.
This split between retrieved context and generated answer also appears in evaluation research. Ragas evaluates whether retrieved context supports the answer and whether the answer addresses the question, rather than collapsing all failures into one vague score.[1]
Suppose an answer omits the 48 hours deadline. That doesn't tell you which component failed. Compare the gold requirement with both the retrieved context and the generated answer:
1cases = [
2 {
3 "id": "good-answer",
4 "required": ("48 hours", "photos"),
5 "retrieved": "Damaged electronics: report within 48 hours with photos.",
6 "answer": "Report within 48 hours and send photos.",
7 },
8 {
9 "id": "retrieval-miss",
10 "required": ("48 hours", "photos"),
11 "retrieved": "Electronics category page. Contact support for assistance.",
12 "answer": "Contact support for assistance.",
13 },
14 {
15 "id": "generation-miss",
16 "required": ("48 hours", "photos"),
17 "retrieved": "Damaged electronics: report within 48 hours with photos.",
18 "answer": "Please send photos so we can investigate.",
19 },
20]
21
22def failure_stage(case: dict[str, object]) -> str:
23 required = case["required"]
24 retrieved = str(case["retrieved"]).lower()
25 answer = str(case["answer"]).lower()
26 if any(term not in retrieved for term in required):
27 return "bad_retrieval"
28 if any(term not in answer for term in required):
29 return "bad_generation"
30 return "pass"
31
32for case in cases:
33 print(f"{case['id']:16} {failure_stage(case)}")1good-answer pass
2retrieval-miss bad_retrieval
3generation-miss bad_generationPrivate tests tell you whether your application meets its contract. Public benchmarks help compare capabilities under standardized tasks, but each family asks a different question.
MMLU evaluates multiple-choice accuracy across 57 academic and professional subjects.[2] It can reveal broad knowledge differences under a stated prompt protocol. It can't prove that a policy assistant quotes the right return clause.
GPQA narrows the question to difficult expert QA. Its 448 multiple-choice questions were written by domain experts in biology, physics, and chemistry.[3] A strong GPQA result can signal expert STEM question-answering ability under that harness, but it still doesn't prove that a support workflow follows policy.
Math benchmarks narrow the question. GSM8K uses grade-school word problems, while MATH uses harder competition-style problems; both are useful when a workflow must produce checkable mathematical answers, but they don't measure open-ended support quality.[4][5] At the harder frontier, Humanity's Last Exam (HLE) collects expert-level questions across domains, and FrontierMath targets advanced mathematical problems. ARC-AGI-2 asks systems to infer novel transformations from small visual-grid demonstrations rather than recall subject matter.[6][7][8] These benchmarks answer different research questions; a model isn't "better at reasoning" in every product merely because it improves on one of them.
HumanEval provides handwritten Python functions with tests and introduced functional correctness reporting with repeated samples and pass@k.[9] SWE-bench moves from isolated functions to real GitHub issues whose patches are tested in repository environments.[10] These are relevant if your system edits code, but neither evaluates customer-facing policy truthfulness.
MT-Bench uses a model judge for open-ended, multi-turn responses, while Chatbot Arena collects blind pairwise human preferences.[11][12] Preference is useful for tone and helpfulness, but an answer that users prefer can still cite the wrong policy.
| Evaluation family | Scored unit | Typical metric | Useful signal | Missing deployment proof |
|---|---|---|---|---|
| Broad knowledge | MMLU question | multiple-choice accuracy | academic/professional task coverage | evidence grounding and tool behavior |
| Expert STEM QA | GPQA question | multiple-choice accuracy | difficult biology, physics, and chemistry QA | production workflow behavior |
| Checkable mathematics | GSM8K or MATH problem | parsed final-answer accuracy | mathematical answer execution under a fixed harness | grounded customer-policy answers |
| Frontier reasoning research | HLE, FrontierMath, or ARC-AGI-2 item | benchmark-specific correctness | expert-question or novel-task performance in that suite | general deployment reliability |
| Executable code | HumanEval function or SWE-bench patch | pass@k or resolved rate | programs that satisfy tests | support answer policy compliance |
| Open-ended preference | MT-Bench or Arena comparison | judge score or pairwise preference | style and perceived helpfulness | factual ground truth |
| Private application eval | question, retrieved clause, response, | policy pass rate plus gates | behavior on your release path | capability outside your covered slices |
The practical rule is simple: compare results only when the dataset, input path, output contract, prompt, sampling settings, tools, and scorer match. 84% on a multiple-choice set and 42% on repository fixes aren't competing measurements of the same property.
That rule is easy to encode. This helper allows comparisons only when the benchmark and the harness fields are identical.
1runs = {
2 "candidate-a": {
3 "dataset": "policy-golden-v3",
4 "retriever": "chunks-v2/top3",
5 "prompt": "support-answer-v4",
6 "output_contract": "answer-with-source-page-v2",
7 "toolset": ("policy-search-v2",),
8 "temperature": 0.0,
9 "max_tokens": 180,
10 "scorer": "facts-v2",
11 "score": 0.97,
12 },
13 "candidate-b": {
14 "dataset": "policy-golden-v3",
15 "retriever": "chunks-v2/top3",
16 "prompt": "support-answer-v4",
17 "output_contract": "answer-with-source-page-v2",
18 "toolset": ("policy-search-v2",),
19 "temperature": 0.0,
20 "max_tokens": 180,
21 "scorer": "facts-v2",
22 "score": 0.99,
23 },
24 "candidate-c": {
25 "dataset": "policy-golden-v3",
26 "retriever": "chunks-v2/top5",
27 "prompt": "support-answer-v4",
28 "output_contract": "answer-with-source-page-v2",
29 "toolset": ("policy-search-v2",),
30 "temperature": 0.0,
31 "max_tokens": 180,
32 "scorer": "facts-v2",
33 "score": 1.00,
34 },
35}
36
37contract_fields = (
38 "dataset",
39 "retriever",
40 "prompt",
41 "output_contract",
42 "toolset",
43 "temperature",
44 "max_tokens",
45 "scorer",
46)
47
48def comparable(left: dict[str, object], right: dict[str, object]) -> bool:
49 return all(left[field] == right[field] for field in contract_fields)
50
51print("A vs B comparable:", comparable(runs["candidate-a"], runs["candidate-b"]))
52print("A vs C comparable:", comparable(runs["candidate-a"], runs["candidate-c"]))
53print("C differs because its retriever sees more chunks.")1A vs B comparable: True
2A vs C comparable: False
3C differs because its retriever sees more chunks.pass@k: Measure a Workflow That Can Test Several DraftsFor code generation, a single answer isn't always the real workflow. A developer may generate several candidates and run tests until one passes. HumanEval's pass@k estimator measures the probability that at least one of selected samples is correct, given generated programs and correct programs.[9]
Suppose an inventory-code task generated 100 candidates and tests accepted 20. Checking one random candidate succeeds with probability 0.20. Checking 10 distinct candidates succeeds much more often:
Here, is the number of generated programs, is the number that pass tests, and is the number you are allowed to try.
1from math import comb
2
3def pass_at_k(total_samples: int, correct_samples: int, k: int) -> float:
4 if not 0 <= correct_samples <= total_samples:
5 raise ValueError("correct_samples must be inside the sampled set")
6 if not 1 <= k <= total_samples:
7 raise ValueError("k must be between 1 and total_samples")
8 if total_samples - correct_samples < k:
9 return 1.0
10 return 1.0 - comb(total_samples - correct_samples, k) / comb(total_samples, k)
11
12for k in (1, 5, 10, 20):
13 print(f"pass@{k:<2} = {pass_at_k(100, 20, k):.4f}")1pass@1 = 0.2000
2pass@5 = 0.6807
3pass@10 = 0.9049
4pass@20 = 0.9934
This distinction matters. Multiple tested drafts make sense for a generated inventory function because unit tests select a passing implementation. You shouldn't sample several customer-facing refund answers and silently choose the most confident-looking one without a policy-grounded scorer.
Deterministic checks catch explicit clauses, but they don't settle every question. Two answers can both cite the correct deadline while differing in clarity or empathy. For open-ended comparisons, teams often use humans or an LLM judge.
The MT-Bench study found that capable LLM judges can approximate human preferences in its setting, but also documents position, verbosity, and self-enhancement biases.[11] A judge result is measurement output, not ground truth.
The small simulation below intentionally gives a toy judge a position bias: if two answers contain the same required facts, it chooses whichever answer appears first. Running each comparison in both orders exposes that instability.
1def biased_judge(answer_a: str, answer_b: str, required: tuple[str, ...]) -> str:
2 score_a = sum(term in answer_a.lower() for term in required)
3 score_b = sum(term in answer_b.lower() for term in required)
4 if score_a > score_b:
5 return "A"
6 if score_b > score_a:
7 return "B"
8 return "A" # Deliberate position bias for equal-quality answers.
9
10def swap_checked(answer_a: str, answer_b: str, required: tuple[str, ...]) -> dict[str, object]:
11 first_order = biased_judge(answer_a, answer_b, required)
12 swapped_order = biased_judge(answer_b, answer_a, required)
13 normalized_swapped = {"A": "B", "B": "A"}[swapped_order]
14 if first_order != normalized_swapped:
15 return {"winner": "needs_human_review", "stable": False}
16 return {"winner": first_order, "stable": True}
17
18required = ("48 hours", "photos")
19concise = "Report the damaged earbuds within 48 hours and attach photos."
20friendly = "Sorry they arrived damaged. Please send photos within 48 hours."
21unsupported = "Return the earbuds whenever convenient."
22
23print("equivalent:", swap_checked(concise, friendly, required))
24print("factual miss:", swap_checked(concise, unsupported, required))1equivalent: {'winner': 'needs_human_review', 'stable': False}
2factual miss: {'winner': 'A', 'stable': True}An unstable comparison isn't a failure of either candidate. It means the measurement can't distinguish them reliably under its current rubric. Send such cases to a human reviewer or improve the rubric before aggregating a winner.
A public benchmark is reproducible because everyone can access the tasks. That same visibility creates a risk: benchmark questions or solutions may appear in training corpora. A model that encountered a test item during training is no longer being measured on a genuinely unseen example.
Do not jump from risk to accusation. A public score alone doesn't prove contamination in a particular model. It means the report should state what decontamination checks were used, and high-stakes selection should include fresh or private tasks.
LiveCodeBench was designed around this problem: it continually collects coding tasks released over time and evaluates models on problems that appear after their stated training cutoff.[13] Time splits reduce exposure risk, although they still depend on accurate cutoff information and a stable harness.
For the ShopFlow policy assistant, keep private policy cases separate from prompt examples, demo transcripts, and support documentation that may later be used for tuning. If an evaluation case becomes a training example, retire it from the holdout set or record that it now measures regression behavior rather than generalization.
A time-split suite also helps when new policy versions arrive. A model tuned using cases created through March shouldn't be evaluated as "unseen" on those same cases. Hold out cases authored after the tuning snapshot:
1from datetime import date
2
3tuning_snapshot = date.fromisoformat("2026-03-31")
4policy_cases = [
5 ("holiday-return-window", "2026-02-10"),
6 ("damaged-electronics-photo-rule", "2026-04-08"),
7 ("split-shipment-investigation", "2026-04-21"),
8 ("seller-battery-restriction", "2026-05-03"),
9]
10
11training_visible = []
12fresh_holdout = []
13for case_id, created_at in policy_cases:
14 destination = fresh_holdout if date.fromisoformat(created_at) > tuning_snapshot else training_visible
15 destination.append(case_id)
16
17print("known before tuning:", training_visible)
18print("fresh holdout:", fresh_holdout)
19assert "damaged-electronics-photo-rule" in fresh_holdout1known before tuning: ['holiday-return-window']
2fresh holdout: ['damaged-electronics-photo-rule', 'split-shipment-investigation', 'seller-battery-restriction']Your evaluation is useful when it changes an engineering decision. A model may give grounded answers but be too slow for live chat. Another may be fast and cheap, but fail critical policy cases. Define the bar before comparing models.
The fixture values below represent measurements from a fictional private evaluation run. The code makes the release rule explicit: a candidate that misses policy or schema quality is rejected, while a correct but expensive model may be routed only to difficult cases.
1candidates = [
2 {
3 "name": "fast-small",
4 "policy_pass_rate": 0.91,
5 "schema_pass_rate": 1.00,
6 "p95_latency_ms": 420,
7 "cost_per_case": 0.002,
8 },
9 {
10 "name": "balanced",
11 "policy_pass_rate": 0.99,
12 "schema_pass_rate": 1.00,
13 "p95_latency_ms": 680,
14 "cost_per_case": 0.008,
15 },
16 {
17 "name": "slow-specialist",
18 "policy_pass_rate": 1.00,
19 "schema_pass_rate": 1.00,
20 "p95_latency_ms": 1800,
21 "cost_per_case": 0.031,
22 },
23]
24
25minimum_policy = 0.98
26minimum_schema = 0.995
27online_latency_limit_ms = 900
28online_cost_limit = 0.015
29
30def decision(candidate: dict[str, float | str]) -> str:
31 if candidate["policy_pass_rate"] < minimum_policy:
32 return "reject: policy quality below bar"
33 if candidate["schema_pass_rate"] < minimum_schema:
34 return "reject: output contract below bar"
35 if candidate["p95_latency_ms"] > online_latency_limit_ms:
36 return "route: reserve for escalations because latency is high"
37 if candidate["cost_per_case"] > online_cost_limit:
38 return "route: reserve for escalations because cost is high"
39 return "ship: passes online gates"
40
41for candidate in candidates:
42 print(f"{candidate['name']:15} {decision(candidate)}")1fast-small reject: policy quality below bar
2balanced ship: passes online gates
3slow-specialist route: reserve for escalations because latency is highThe public benchmark shortlist never appears inside decision. That is intentional. Public scores help decide which candidates deserve testing. A release gate uses measurements taken on the workload you are about to serve.
An aggregate score can hide the one category with the largest customer impact. The run below clears four of six cases overall, but damaged-electronics answers fail most often.
1from collections import defaultdict
2
3results = [
4 ("damaged-electronics", True),
5 ("damaged-electronics", False),
6 ("damaged-electronics", False),
7 ("late-shipment", True),
8 ("late-shipment", True),
9 ("sealed-return", True),
10]
11
12by_slice: dict[str, list[bool]] = defaultdict(list)
13for category, passed in results:
14 by_slice[category].append(passed)
15
16overall = sum(passed for _, passed in results) / len(results)
17print(f"overall={overall:.2%}")
18for category in sorted(by_slice):
19 values = by_slice[category]
20 rate = sum(values) / len(values)
21 print(f"{category:20} pass_rate={rate:.2%} cases={len(values)}")1overall=66.67%
2damaged-electronics pass_rate=33.33% cases=3
3late-shipment pass_rate=100.00% cases=2
4sealed-return pass_rate=100.00% cases=1That report should block a release if damaged-item policy mistakes are critical, even when the average seems acceptable.
Three passing examples make a good tutorial, not a production release argument. Earlier statistics lessons introduced uncertainty in estimated rates. For a binary pass metric, a Wilson lower bound gives a conservative view of the pass rate supported by a finite sample:
1from math import sqrt
2
3def wilson_lower_bound(successes: int, total: int, z: float = 1.96) -> float:
4 proportion = successes / total
5 denominator = 1 + z**2 / total
6 center = proportion + z**2 / (2 * total)
7 margin = z * sqrt((proportion * (1 - proportion) + z**2 / (4 * total)) / total)
8 return (center - margin) / denominator
9
10release_floor = 0.95
11for successes, total in ((3, 3), (49, 50), (990, 1000)):
12 lower = wilson_lower_bound(successes, total)
13 status = "PASS" if lower >= release_floor else "COLLECT_MORE_OR_FIX"
14 print(f"{successes:>3}/{total:<4} lower_bound={lower:.3f} {status}")13/3 lower_bound=0.438 COLLECT_MORE_OR_FIX
2 49/50 lower_bound=0.895 COLLECT_MORE_OR_FIX
3990/1000 lower_bound=0.982 PASSEven perfect performance on three examples has a weak lower bound. Build a sufficiently large, risk-stratified holdout before claiming a system clears a production-quality bar.
Evaluation reports should let another engineer rerun the comparison and inspect its failures. Record these fields before sharing any model ranking:
| Report field | Why it matters | Example entry |
|---|---|---|
| Dataset version and holdout policy | prevents accidental training leakage | policy-golden-v3, never used for prompt tuning |
| Retrieval setup | separates missing evidence from bad answers | chunker version, top-k, index snapshot |
| Prompt and chat format | instruction formatting changes outputs | system prompt hash, template version |
| Model and sampling settings | output variance depends on decoding | model ID, temperature, max tokens |
| Scorer and judge controls | metrics can hide bias | deterministic gates, order swap, audit sample |
| Failure slices | averages conceal unsafe cases | damaged electronics, late scans, multilingual queries |
| Latency and cost | a correct system still needs to run | p50/p95 latency, cost per successful case |
That prompt-and-chat-format row points to the next lesson. If a model regularly omits required structure or speaks in the wrong role, evaluation has exposed a behavior gap. You then need to understand how instruction tuning and chat templates teach the interaction contract.
| Symptom | Likely cause | First fix to test |
|---|---|---|
| Answer omits the deadline although policy chunk contains it | generation or prompt contract failure | require cited facts and inspect prompt/template |
| Answer sounds confident but cites a clause absent from context | unsupported generation | add forbidden-claim checks and human review slice |
| Two published model scores reverse under a new harness | incomparable prompt or scoring setup | rerun identical harness and publish configuration |
| Judge selects whichever answer appears first | position bias | score both orders and reject unstable comparisons |
| Public score rises while private cases stagnate | task mismatch or contaminated public signal | prioritize fresh private holdout and failure analysis |
| Correct model misses live latency target | operational bottleneck | route difficult cases or select a faster candidate |
pass@kpass@k is appropriate and why it doesn't equal policy-answer accuracy.Add a fourth policy case whose retrieved context is intentionally wrong. Extend score-grounded-policy-answers.py to label failures as bad_retrieval or bad_generation. Then add a source_page field to the candidate response and reject answers whose citation doesn't match the evidence row. You have now built the first slice of an evaluation dashboard for a RAG assistant.
RAGAS: Automated Evaluation of Retrieval Augmented Generation.
Es, S., et al. · 2023 · arXiv preprint
Measuring Massive Multitask Language Understanding (MMLU).
Hendrycks, D., et al. · 2021 · ICLR 2021
GPQA: A Graduate-Level Google-Proof Q&A Benchmark.
Rein, D., et al. · 2023
Training Verifiers to Solve Math Word Problems (GSM8K).
Cobbe, K., et al. · 2021
Measuring Mathematical Problem Solving with the MATH Dataset.
Hendrycks, D., et al. · 2021 · NeurIPS 2021
Humanity's Last Exam
Center for AI Safety, Scale AI, and HLE Contributors Consortium · 2025 · arXiv preprint
FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
Glazer, E., et al. (Epoch AI) · 2024
ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems
Chollet, F., et al. · 2025
Evaluating Large Language Models Trained on Code (HumanEval).
Chen, M., et al. · 2021 · arXiv preprint
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?.
Jimenez, C. E., et al. · 2024 · ICLR 2024
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.
Zheng, L., et al. · 2023 · NeurIPS 2023
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference.
Chiang, W. L., et al. · 2024
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Jain, N., et al. · 2024