Evaluate a permission-safe RAG answer trace with context, claim, citation, failure-attribution, and release gates before automating softer judgments.
policy-answerer-v3 selects one current, permitted policy chunk: deploy-freeze-approval-rule. That's necessary evidence for Maya's reply, but it isn't proof of a reliable answer. A generator can still say the deploy may start without a rollback plan, or cite the right rule beside a claim it doesn't support.
The next step is an evaluation harness for that trace. Retrieval-augmented generation (RAG) evaluation should tell you where a response first became unsafe: evidence access, candidate retrieval, context selection, answer claims, or citations. You'll turn each boundary into an executable eval gate before the next lesson introduces large language model (LLM) judges for semantic cases that deterministic labels can't cover.
Maya's question remains unchanged:
Can I deploy
payment-serviceduring the release freeze? Incident commander approval is attached, and the rollback plan is linked.
policy-answerer-v3 already established that hybrid retrieval found the correct rule and reranking selected it for context. The evaluation fixture carries that evidence path forward instead of inventing a new example.
1from __future__ import annotations
2
3from collections import defaultdict
4from dataclasses import dataclass, replace
5
6TARGET_ID = "deploy-freeze-approval-rule"
7
8@dataclass(frozen=True)
9class EvidenceChunk:
10 chunk_id: str
11 document_id: str
12 parent_id: str
13 version: str
14 permitted: bool
15 current: bool
16 text: str
17
18@dataclass(frozen=True)
19class GoldCase:
20 case_id: str
21 question: str
22 required_source_ids: frozenset[str]
23 required_points: frozenset[str]
24
25@dataclass(frozen=True)
26class RagTrace:
27 case_id: str
28 first_stage_ids: tuple[str, ...]
29 rerank_input_ids: tuple[str, ...]
30 reranked_ids: tuple[str, ...]
31 selected_context_ids: tuple[str, ...]
32 selected_versions: tuple[str, ...]
33 versions: tuple[tuple[str, str], ...]
34
35EVIDENCE = {
36 TARGET_ID: EvidenceChunk(
37 TARGET_ID,
38 "deploy-policy",
39 "deploy-policy-v2",
40 "deploy-policy/2026-06-01",
41 True,
42 True,
43 (
44 "Rule DEPLOY-17. Payment-service production deploys during a release "
45 "freeze require incident commander approval and a linked rollback plan "
46 "before rollout."
47 ),
48 ),
49 "payment-service-rollback-runbook": EvidenceChunk(
50 "payment-service-rollback-runbook",
51 "payment-rollback",
52 "payment-rollback-v1",
53 "payment-rollback/2026-05-20",
54 True,
55 True,
56 (
57 "Payment-service rollback drills must keep the previous artifact "
58 "available for 30 minutes. This runbook does not authorize freeze deploys."
59 ),
60 ),
61 "frontend-docs-deploy-rule": EvidenceChunk(
62 "frontend-docs-deploy-rule",
63 "frontend-docs",
64 "frontend-docs-v1",
65 "frontend-docs/2026-03-01",
66 True,
67 True,
68 "Frontend documentation deploys may ship during normal business hours without freeze approval.",
69 ),
70 "restricted-breakglass-note": EvidenceChunk(
71 "restricted-breakglass-note",
72 "breakglass-terms",
73 "breakglass-terms",
74 "breakglass/2026-05-01",
75 False,
76 True,
77 "Payment-service deploys may bypass freeze approval during executive escalations.",
78 ),
79}
80GOLD = GoldCase(
81 case_id="payment-freeze-deploy-001",
82 question=(
83 "Can I deploy payment-service during the release freeze? Incident commander "
84 "approval is attached, and the rollback plan is linked."
85 ),
86 required_source_ids=frozenset({TARGET_ID}),
87 required_points=frozenset({"freeze-scope", "approval", "rollback-plan"}),
88)
89TRACE = RagTrace(
90 case_id=GOLD.case_id,
91 first_stage_ids=(
92 "payment-service-rollback-runbook",
93 "frontend-docs-deploy-rule",
94 TARGET_ID,
95 ),
96 rerank_input_ids=(
97 "payment-service-rollback-runbook",
98 "frontend-docs-deploy-rule",
99 TARGET_ID,
100 ),
101 reranked_ids=(TARGET_ID, "payment-service-rollback-runbook", "frontend-docs-deploy-rule"),
102 selected_context_ids=(TARGET_ID,),
103 selected_versions=("deploy-policy/2026-06-01",),
104 versions=(
105 ("retriever", "policy-retriever-v2"),
106 ("index", "policy-index/2026-05-27"),
107 ("sparse", "bm25-tokenizer-v1"),
108 ("dense", "fixture-embeddings-v1"),
109 ("fusion", "rrf-k60"),
110 ("reranker", "fixture-cross-encoder-v1"),
111 ),
112)
113
114selected_sources = [
115 (
116 chunk.chunk_id,
117 chunk.document_id,
118 chunk.parent_id,
119 chunk.version,
120 )
121 for chunk in (EVIDENCE[chunk_id] for chunk_id in TRACE.selected_context_ids)
122]
123print("Selected context:", TRACE.selected_context_ids)
124print("Selected sources:", selected_sources)
125print("Pipeline versions:", dict(TRACE.versions))
126assert TRACE.selected_context_ids == (TARGET_ID,)1Selected context: ('deploy-freeze-approval-rule',)
2Selected sources: [('deploy-freeze-approval-rule', 'deploy-policy', 'deploy-policy-v2', 'deploy-policy/2026-06-01')]
3Pipeline versions: {'retriever': 'policy-retriever-v2', 'index': 'policy-index/2026-05-27', 'sparse': 'bm25-tokenizer-v1', 'dense': 'fixture-embeddings-v1', 'fusion': 'rrf-k60', 'reranker': 'fixture-cross-encoder-v1'}The trace is an evaluation input, not a log decoration. It binds a question to exact source identities and pipeline versions without storing raw policy text. During replay, the evaluator loads permitted source text from the versioned evidence store.
An answer should fail immediately if restricted, stale, or unknown evidence entered retrieval, reranking, or selected context. A relevance score can't make forbidden evidence acceptable, even when a later stage happens to drop it.
1def admissible_evidence_path(trace: RagTrace, gold: GoldCase) -> bool:
2 if trace.case_id != gold.case_id:
3 return False
4 if not trace.selected_context_ids:
5 return False
6 if len(trace.selected_context_ids) != len(trace.selected_versions):
7 return False
8 required_version_keys = {"retriever", "index", "sparse", "dense", "fusion", "reranker"}
9 version_map = dict(trace.versions)
10 if len(version_map) != len(trace.versions):
11 return False
12 if not required_version_keys.issubset(version_map):
13 return False
14 stage_ids = (
15 trace.first_stage_ids,
16 trace.rerank_input_ids,
17 trace.reranked_ids,
18 trace.selected_context_ids,
19 )
20 if any(len(ids) != len(set(ids)) for ids in stage_ids):
21 return False
22 traced_ids = tuple(chunk_id for ids in stage_ids for chunk_id in ids)
23 if any(chunk_id not in EVIDENCE for chunk_id in traced_ids):
24 return False
25 traced = [EVIDENCE[chunk_id] for chunk_id in traced_ids]
26 selected = [EVIDENCE[chunk_id] for chunk_id in trace.selected_context_ids]
27 rerank_input_came_from_retrieval = set(trace.rerank_input_ids).issubset(
28 trace.first_stage_ids
29 )
30 reranked_same_candidates = set(trace.reranked_ids) == set(trace.rerank_input_ids)
31 returned_by_reranker = set(trace.selected_context_ids).issubset(trace.reranked_ids)
32 versions_match = all(
33 chunk.version == version
34 for chunk, version in zip(selected, trace.selected_versions)
35 )
36 allowed_and_current = all(chunk.permitted and chunk.current for chunk in traced)
37 return (
38 rerank_input_came_from_retrieval
39 and reranked_same_candidates
40 and returned_by_reranker
41 and versions_match
42 and allowed_and_current
43 )
44
45restricted_trace = replace(
46 TRACE,
47 first_stage_ids=("restricted-breakglass-note",),
48 rerank_input_ids=("restricted-breakglass-note",),
49 reranked_ids=("restricted-breakglass-note",),
50 selected_context_ids=("restricted-breakglass-note",),
51 selected_versions=("breakglass/2026-05-01",),
52)
53blocked_candidate_trace = replace(
54 TRACE,
55 first_stage_ids=TRACE.first_stage_ids + ("restricted-breakglass-note",),
56 rerank_input_ids=TRACE.rerank_input_ids + ("restricted-breakglass-note",),
57 reranked_ids=TRACE.reranked_ids + ("restricted-breakglass-note",),
58)
59unknown_candidate_trace = replace(TRACE, first_stage_ids=TRACE.first_stage_ids + ("missing",))
60stale_version_trace = replace(TRACE, selected_versions=("deploy-policy/2025-02-01",))
61missing_version_trace = replace(TRACE, versions=TRACE.versions[:-1])
62wrong_case_trace = replace(TRACE, case_id="payment-freeze-deploy-002")
63duplicate_candidate_trace = replace(
64 TRACE,
65 first_stage_ids=TRACE.first_stage_ids + (TARGET_ID,),
66)
67print("Production path admissible:", admissible_evidence_path(TRACE, GOLD))
68print("Restricted context admissible:", admissible_evidence_path(restricted_trace, GOLD))
69print("Blocked candidate admissible:", admissible_evidence_path(blocked_candidate_trace, GOLD))
70print("Unknown candidate admissible:", admissible_evidence_path(unknown_candidate_trace, GOLD))
71print("Wrong-version path admissible:", admissible_evidence_path(stale_version_trace, GOLD))
72print("Incomplete trace admissible:", admissible_evidence_path(missing_version_trace, GOLD))
73print("Wrong-case trace admissible:", admissible_evidence_path(wrong_case_trace, GOLD))
74print("Duplicate candidate admissible:", admissible_evidence_path(duplicate_candidate_trace, GOLD))
75assert admissible_evidence_path(TRACE, GOLD)
76assert not admissible_evidence_path(restricted_trace, GOLD)
77assert not admissible_evidence_path(blocked_candidate_trace, GOLD)
78assert not admissible_evidence_path(unknown_candidate_trace, GOLD)
79assert not admissible_evidence_path(stale_version_trace, GOLD)
80assert not admissible_evidence_path(missing_version_trace, GOLD)
81assert not admissible_evidence_path(wrong_case_trace, GOLD)
82assert not admissible_evidence_path(duplicate_candidate_trace, GOLD)1Production path admissible: True
2Restricted context admissible: False
3Blocked candidate admissible: False
4Unknown candidate admissible: False
5Wrong-version path admissible: False
6Incomplete trace admissible: False
7Wrong-case trace admissible: False
8Duplicate candidate admissible: FalseThe case-ID check prevents labels from one replay from grading another replay. The uniqueness check rejects malformed rankings before duplicated IDs distort later metrics.
The previous two lessons already taught ranking metrics such as Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG). Those metrics answer whether the best evidence appears early in a ranking.[1] At answer evaluation time, retain two simpler questions:
| Layer | Gate | Meaning on this trace |
|---|---|---|
| Candidate retrieval | candidate_recall | Did hybrid search provide the required rule? |
| Selected context | context_recall | Did the rule survive reranking and admission? |
| Selected context | context_precision | Did context avoid unnecessary distractors? |
1def coverage(ids: tuple[str, ...], required_ids: frozenset[str]) -> float:
2 return len(set(ids) & required_ids) / len(required_ids)
3
4def context_precision(ids: tuple[str, ...], useful_ids: frozenset[str]) -> float:
5 return len(set(ids) & useful_ids) / len(ids) if ids else 0.0
6
7candidate_recall = coverage(TRACE.first_stage_ids, GOLD.required_source_ids)
8context_recall = coverage(TRACE.selected_context_ids, GOLD.required_source_ids)
9selected_precision = context_precision(TRACE.selected_context_ids, GOLD.required_source_ids)
10print(f"Candidate recall: {candidate_recall:.1f}")
11print(f"Selected-context recall: {context_recall:.1f}")
12print(f"Selected-context precision: {selected_precision:.1f}")
13assert (candidate_recall, context_recall, selected_precision) == (1.0, 1.0, 1.0)1Candidate recall: 1.0
2Selected-context recall: 1.0
3Selected-context precision: 1.0This result clears only the evidence path. It doesn't say what the model wrote.
Consider two responses produced from the same valid context:
| Response | What changed? |
|---|---|
unsafe-bypass | Adds a no-rollback-plan claim that the deploy rule never states |
supported-deploy | States only the deploy scope and conditions present in DEPLOY-17 |
For a labeled release case, represent an answer as atomic policy claims. Each claim names its cited source, the source phrases that must establish it, and the expected answer point it covers.
1@dataclass(frozen=True)
2class Claim:
3 claim_id: str
4 text: str
5 citation_id: str | None
6 support_phrases: tuple[str, ...]
7 answer_point: str
8
9@dataclass(frozen=True)
10class Answer:
11 answer_id: str
12 claims: tuple[Claim, ...]
13
14UNSAFE_BYPASS = Answer(
15 "unsafe-bypass",
16 (
17 Claim(
18 "freeze-scope",
19 "The request is governed by the release-freeze deploy rule.",
20 TARGET_ID,
21 ("payment-service production deploys", "release freeze"),
22 "freeze-scope",
23 ),
24 Claim(
25 "bypass",
26 "The deploy can start without a linked rollback plan.",
27 TARGET_ID,
28 ("without a linked rollback plan",),
29 "rollback-plan",
30 ),
31 ),
32)
33SUPPORTED_DEPLOY = Answer(
34 "supported-deploy",
35 (
36 Claim(
37 "freeze-scope",
38 "The request is governed by the release-freeze deploy rule.",
39 TARGET_ID,
40 ("payment-service production deploys", "release freeze"),
41 "freeze-scope",
42 ),
43 Claim(
44 "approval",
45 "Incident commander approval is required.",
46 TARGET_ID,
47 ("require incident commander approval",),
48 "approval",
49 ),
50 Claim(
51 "rollback-plan",
52 "A linked rollback plan is required before rollout.",
53 TARGET_ID,
54 ("linked rollback plan", "before rollout"),
55 "rollback-plan",
56 ),
57 ),
58)
59EMPTY_ANSWER = Answer("empty", ())
60
61print("Unsafe claims:", [claim.claim_id for claim in UNSAFE_BYPASS.claims])
62print("Supported claims:", [claim.claim_id for claim in SUPPORTED_DEPLOY.claims])
63print("Empty claims:", [claim.claim_id for claim in EMPTY_ANSWER.claims])1Unsafe claims: ['freeze-scope', 'bypass']
2Supported claims: ['freeze-scope', 'approval', 'rollback-plan']
3Empty claims: []This is a golden-set contract. It's intentionally strict and inspectable. It won't recognize every correct paraphrase in live traffic; that softer matching problem belongs after you understand the release invariant.
Faithfulness asks whether selected context supports the answer's claims. RAGAS defines this family of evaluation by decomposing a response into claims and checking support from retrieved context.[2] For a labeled high-impact policy case, exact required phrases give a deterministic first gate:
1def source_supports(claim: Claim, chunk: EvidenceChunk) -> bool:
2 source = chunk.text.lower()
3 return all(phrase.lower() in source for phrase in claim.support_phrases)
4
5def claim_supported_by_context(claim: Claim, trace: RagTrace) -> bool:
6 return any(
7 source_supports(claim, EVIDENCE[chunk_id])
8 for chunk_id in trace.selected_context_ids
9 )
10
11def faithfulness(answer: Answer, trace: RagTrace) -> float:
12 supported = sum(
13 claim_supported_by_context(claim, trace) for claim in answer.claims
14 )
15 return supported / len(answer.claims) if answer.claims else 0.0
16
17print(f"unsafe-bypass faithfulness: {faithfulness(UNSAFE_BYPASS, TRACE):.2f}")
18print(
19 "supported-deploy faithfulness: "
20 f"{faithfulness(SUPPORTED_DEPLOY, TRACE):.2f}"
21)
22print(f"empty faithfulness: {faithfulness(EMPTY_ANSWER, TRACE):.2f}")
23assert faithfulness(UNSAFE_BYPASS, TRACE) == 0.5
24assert faithfulness(SUPPORTED_DEPLOY, TRACE) == 1.0
25assert faithfulness(EMPTY_ANSWER, TRACE) == 0.01unsafe-bypass faithfulness: 0.50
2supported-deploy faithfulness: 1.00
3empty faithfulness: 0.00The unsafe response is on topic and cites a real selected rule. It still fails because one claim outruns the rule.
A citation metric needs two checks:
A fabricated rollback exemption can achieve perfect citation coverage by attaching the correct-looking chunk ID. That's why presence alone is a weak gate.
1MIS_CITED = Answer(
2 "mis-cited",
3 tuple(
4 replace(claim, citation_id="payment-service-rollback-runbook")
5 for claim in SUPPORTED_DEPLOY.claims
6 ),
7)
8
9def citation_coverage(answer: Answer) -> float:
10 cited = sum(claim.citation_id is not None for claim in answer.claims)
11 return cited / len(answer.claims) if answer.claims else 0.0
12
13def citation_support(answer: Answer, trace: RagTrace) -> float:
14 supported = 0
15 for claim in answer.claims:
16 if claim.citation_id not in trace.selected_context_ids:
17 continue
18 if source_supports(claim, EVIDENCE[claim.citation_id]):
19 supported += 1
20 return supported / len(answer.claims) if answer.claims else 0.0
21
22print(f"unsafe citation coverage: {citation_coverage(UNSAFE_BYPASS):.2f}")
23print(f"unsafe citation support: {citation_support(UNSAFE_BYPASS, TRACE):.2f}")
24print(f"mis-cited answer faithfulness: {faithfulness(MIS_CITED, TRACE):.2f}")
25print(f"mis-cited citation support: {citation_support(MIS_CITED, TRACE):.2f}")
26assert citation_coverage(UNSAFE_BYPASS) == 1.0
27assert citation_support(UNSAFE_BYPASS, TRACE) == 0.5
28assert faithfulness(MIS_CITED, TRACE) == 1.0
29assert citation_support(MIS_CITED, TRACE) == 0.01unsafe citation coverage: 1.00
2unsafe citation support: 0.50
3mis-cited answer faithfulness: 1.00
4mis-cited citation support: 0.00In the mis-cited response, selected context could support every sentence, so faithfulness is high. Its citations still fail because they don't point to the selected source that proves those sentences.
A faithful response can still omit a required condition. For this golden case, Maya needs three answer points: deploy scope, approval, and rollback plan. Treat point coverage as a labeled completeness gate:
1def supported_point_coverage(answer: Answer, trace: RagTrace, gold: GoldCase) -> float:
2 supported_points = {
3 claim.answer_point
4 for claim in answer.claims
5 if claim_supported_by_context(claim, trace)
6 }
7 return len(supported_points & gold.required_points) / len(gold.required_points)
8
9unsafe_coverage = supported_point_coverage(UNSAFE_BYPASS, TRACE, GOLD)
10supported_coverage = supported_point_coverage(SUPPORTED_DEPLOY, TRACE, GOLD)
11print(f"unsafe supported point coverage: {unsafe_coverage:.2f}")
12print(f"supported answer point coverage: {supported_coverage:.2f}")
13assert unsafe_coverage == 1 / 3
14assert supported_coverage == 1.01unsafe supported point coverage: 0.33
2supported answer point coverage: 1.00This metric is stronger than asking whether an answer sounds relevant. It knows exactly which policy conditions this regression case must preserve. On live questions without labels, you'll need sampled human review or a calibrated judge, which is the next lesson's problem.
One trace can fail at several layers after the first defect. If selected context drops the target rule, later unsupported claims are consequences, not the first fix. Diagnose in pipeline order.
1RETRIEVAL_MISS = replace(
2 TRACE,
3 first_stage_ids=("payment-service-rollback-runbook", "frontend-docs-deploy-rule"),
4 rerank_input_ids=("payment-service-rollback-runbook", "frontend-docs-deploy-rule"),
5 reranked_ids=("payment-service-rollback-runbook", "frontend-docs-deploy-rule"),
6 selected_context_ids=("payment-service-rollback-runbook",),
7 selected_versions=("payment-rollback/2026-05-20",),
8)
9SELECTION_MISS = replace(
10 TRACE,
11 selected_context_ids=("payment-service-rollback-runbook",),
12 selected_versions=("payment-rollback/2026-05-20",),
13)
14
15def first_failed_stage(trace: RagTrace, answer: Answer, gold: GoldCase) -> str:
16 if not admissible_evidence_path(trace, gold):
17 return "admissibility"
18 if coverage(trace.first_stage_ids, gold.required_source_ids) < 1.0:
19 return "candidate retrieval"
20 if coverage(trace.selected_context_ids, gold.required_source_ids) < 1.0:
21 return "context selection"
22 if not answer.claims:
23 return "answer completeness"
24 if faithfulness(answer, trace) < 1.0:
25 return "answer faithfulness"
26 if citation_support(answer, trace) < 1.0:
27 return "citation support"
28 if supported_point_coverage(answer, trace, gold) < 1.0:
29 return "answer completeness"
30 return "pass"
31
32diagnoses = {
33 "missing candidate": first_failed_stage(RETRIEVAL_MISS, SUPPORTED_DEPLOY, GOLD),
34 "dropped context": first_failed_stage(SELECTION_MISS, SUPPORTED_DEPLOY, GOLD),
35 "invented bypass": first_failed_stage(TRACE, UNSAFE_BYPASS, GOLD),
36 "wrong citation": first_failed_stage(TRACE, MIS_CITED, GOLD),
37 "empty answer": first_failed_stage(TRACE, EMPTY_ANSWER, GOLD),
38 "supported answer": first_failed_stage(TRACE, SUPPORTED_DEPLOY, GOLD),
39}
40for variant, stage in diagnoses.items():
41 print(f"{variant}: {stage}")
42assert diagnoses["missing candidate"] == "candidate retrieval"
43assert diagnoses["dropped context"] == "context selection"
44assert diagnoses["invented bypass"] == "answer faithfulness"
45assert diagnoses["wrong citation"] == "citation support"
46assert diagnoses["empty answer"] == "answer completeness"
47assert diagnoses["supported answer"] == "pass"1missing candidate: candidate retrieval
2dropped context: context selection
3invented bypass: answer faithfulness
4wrong citation: citation support
5empty answer: answer completeness
6supported answer: passThe ordering is important. If the required rule never reaches context, a claim-support failure doesn't justify tuning the generation prompt yet.
Evaluation should show both that a supported answer is accepted and that known bad answers are blocked. These variants become a tiny regression suite:
1@dataclass(frozen=True)
2class ReleaseCase:
3 name: str
4 trace: RagTrace
5 answer: Answer
6 should_release: bool
7
8def can_release(trace: RagTrace, answer: Answer, gold: GoldCase) -> bool:
9 return first_failed_stage(trace, answer, gold) == "pass"
10
11RELEASE_CASES = (
12 ReleaseCase("supported answer", TRACE, SUPPORTED_DEPLOY, True),
13 ReleaseCase("unsupported bypass", TRACE, UNSAFE_BYPASS, False),
14 ReleaseCase("wrong citation", TRACE, MIS_CITED, False),
15 ReleaseCase("empty answer", TRACE, EMPTY_ANSWER, False),
16 ReleaseCase("dropped source", SELECTION_MISS, SUPPORTED_DEPLOY, False),
17)
18regression_passes = 0
19for case in RELEASE_CASES:
20 observed = can_release(case.trace, case.answer, GOLD)
21 regression_passes += observed == case.should_release
22 print(f"{case.name}: release={observed}, expected={case.should_release}")
23print(f"Regression checks passed: {regression_passes}/{len(RELEASE_CASES)}")
24assert regression_passes == len(RELEASE_CASES)1supported answer: release=True, expected=True
2unsupported bypass: release=False, expected=False
3wrong citation: release=False, expected=False
4empty answer: release=False, expected=False
5dropped source: release=False, expected=False
6Regression checks passed: 5/5Release behavior is now concrete: pass the supported response and reject the four controlled defects.
A larger golden set should include release-freeze deploys, incident hotfixes, schema migrations, and data backfills. Aggregate quality can remain high while a costly policy slice regresses.
1@dataclass(frozen=True)
2class LabeledOutcome:
3 workflow: str
4 released_correctly: bool
5
6OUTCOMES = (
7 LabeledOutcome("release-freeze", True),
8 LabeledOutcome("release-freeze", False),
9 LabeledOutcome("incident-hotfix", True),
10 LabeledOutcome("incident-hotfix", True),
11 LabeledOutcome("schema-migration", False),
12)
13by_workflow: dict[str, list[bool]] = defaultdict(list)
14for outcome in OUTCOMES:
15 by_workflow[outcome.workflow].append(outcome.released_correctly)
16
17overall = sum(outcome.released_correctly for outcome in OUTCOMES) / len(OUTCOMES)
18print(f"Overall pass rate: {overall:.0%}")
19for workflow, results in sorted(by_workflow.items()):
20 print(f"{workflow}: {sum(results) / len(results):.0%}")
21assert overall == 0.6
22assert sum(by_workflow["schema-migration"]) == 01Overall pass rate: 60%
2incident-hotfix: 100%
3release-freeze: 50%
4schema-migration: 0%1release_report = {
2 "service_version": "policy-answerer-v4-eval",
3 "source_trace": TRACE.case_id,
4 "required_policy_version": EVIDENCE[TARGET_ID].version,
5 "checks": {
6 "admissible_evidence_path": admissible_evidence_path(TRACE, GOLD),
7 "supported_answer_released": can_release(TRACE, SUPPORTED_DEPLOY, GOLD),
8 "known_bad_answers_blocked": all(
9 not can_release(case.trace, case.answer, GOLD)
10 for case in RELEASE_CASES
11 if not case.should_release
12 ),
13 "slice_regression_clear": all(
14 sum(results) / len(results) >= 0.95
15 for results in by_workflow.values()
16 ),
17 },
18}
19print("Release checks:", release_report["checks"])
20print("Release allowed:", all(release_report["checks"].values()))
21assert not all(release_report["checks"].values())1Release checks: {'admissible_evidence_path': True, 'supported_answer_released': True, 'known_bad_answers_blocked': True, 'slice_regression_clear': False}
2Release allowed: FalseThe supported single trace passes, and bad-answer detection works. The broader fixture still blocks release because schema-migration has a failed slice. This is the correct result: a demonstration case can't overrule a failing workflow.
The harness used exact labels because policy release cases should be unambiguous. At scale, operators paraphrase questions and models paraphrase valid answers. RAGAS presents reference-free RAG metrics including faithfulness and answer relevance; ARES proposes learned judges for context relevance, answer faithfulness, and answer relevance.[2][3]
Those approaches don't remove the need for this trace design. An automated judge still needs:
| Input | Why the judge needs it |
|---|---|
| Question and expected workflow | Decide whether the answer addressed the request |
| Admitted context and version | Decide whether claims are grounded in allowed evidence |
| Atomic claims and citations | Explain which claim failed and which source was cited |
| Human-reviewed calibration set | Detect judge mistakes and drift |
The next lesson will replace selected deterministic judgments with rubric-driven LLM judgments while keeping hard evidence and citation gates visible.
Before releasing a RAG answer pipeline:
| Gate | Minimum evidence | First fix when it fails |
|---|---|---|
| Admissibility | Retrieved, reranked, and selected chunk IDs; versions; ACL/freshness decision | Evidence filtering |
| Candidate recall | Required chunk in retrieved candidates | Chunking, query, or retrieval lane |
| Context admission | Required chunk in selected context, no noise padding | Reranker or context gate |
| Faithfulness | Supported claim ledger | Prompt, abstention, or generation policy |
| Citation support | Claim-to-source validation | Citation attachment or validator |
| Slice health | Workflow-level regression report | Block release and debug affected slice |
Don't compress these gates into one quality percentage. A dashboard should tell an engineer what to repair first.
You're ready to evaluate a RAG answer pipeline when you can:
| Level | Evidence in submission |
|---|---|
| Foundational | Carries case ID, source identity, and pipeline versions into answer evaluation. |
| Applied | Separates candidate retrieval, context admission, claim support, citations, and completeness. |
| Strong | Rejects malformed traces and known bad answers while reporting the first failed stage. |
| Production-ready | Adds slice-level release gates and explains where calibrated judges may enter. |
Answer every question, then check your score. Score above 75% to mark this lesson complete.
10 questions remaining.
Introduction to Information Retrieval.
Manning, C. D., Raghavan, P., Schutze, H. · 2008 · Cambridge University Press
RAGAS: Automated Evaluation of Retrieval Augmented Generation.
Es, S., et al. · 2023 · arXiv preprint
ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems.
Saad-Falcon, J., et al. · 2023 · NAACL 2024