LearnApplied LLM EngineeringRAG Evaluation for Reliable Answers

📊MediumEvaluation & Benchmarks

RAG Evaluation for Reliable Answers

Evaluate a permission-safe RAG answer trace with context, claim, citation, failure-attribution, and release gates before automating softer judgments.

15 min read

Learning path

Step 67 of 158 in the full curriculum

Reranking and Cross-Encoders for RAG LLM-as-a-Judge Evaluation

policy-answerer-v3 selects one current, permitted policy chunk: deploy-freeze-approval-rule. That's necessary evidence for Maya's reply, but it isn't proof of a reliable answer. A generator can still say the deploy may start without a rollback plan, or cite the right rule beside a claim it doesn't support.

The next step is an evaluation harness for that trace. Retrieval-augmented generation (RAG) evaluation should tell you where a response first became unsafe: evidence access, candidate retrieval, context selection, answer claims, or citations. You'll turn each boundary into an executable eval gate before the next lesson introduces large language model (LLM) judges for semantic cases that deterministic labels can't cover.

Shared admissible RAG trace splitting into a supported deploy answer and an unsafe bypass answer; the supported path keeps every proof chip green while one unsupported bypass claim turns the unsafe path red and blocks release. — Both answers start from the same admissible trace. The supported path keeps every proof chip green, while one invented bypass claim is enough to block release.

Carry the trace forward

Maya's question remains unchanged:

Can I deploy payment-service during the release freeze? Incident commander approval is attached, and the rollback plan is linked.

policy-answerer-v3 already established that hybrid retrieval found the correct rule and reranking selected it for context. The evaluation fixture carries that evidence path forward instead of inventing a new example.

evaluation-trace-fixture.py

from __future__ import annotations

from collections import defaultdict
from dataclasses import dataclass, replace

TARGET_ID = "deploy-freeze-approval-rule"

@dataclass(frozen=True)
class EvidenceChunk:
    chunk_id: str
    document_id: str
    parent_id: str
    version: str
    permitted: bool
    current: bool
    text: str

@dataclass(frozen=True)
class GoldCase:
    case_id: str
    question: str
    required_source_ids: frozenset[str]
    required_points: frozenset[str]

@dataclass(frozen=True)
class RagTrace:
    case_id: str
    first_stage_ids: tuple[str, ...]
    rerank_input_ids: tuple[str, ...]
    reranked_ids: tuple[str, ...]
    selected_context_ids: tuple[str, ...]
    selected_versions: tuple[str, ...]
    versions: tuple[tuple[str, str], ...]

EVIDENCE = {
    TARGET_ID: EvidenceChunk(
        TARGET_ID,
        "deploy-policy",
        "deploy-policy-v2",
        "deploy-policy/2026-06-01",
        True,
        True,
        (
            "Rule DEPLOY-17. Payment-service production deploys during a release "
            "freeze require incident commander approval and a linked rollback plan "
            "before rollout."
        ),
    ),
    "payment-service-rollback-runbook": EvidenceChunk(
        "payment-service-rollback-runbook",
        "payment-rollback",
        "payment-rollback-v1",
        "payment-rollback/2026-05-20",
        True,
        True,
        (
            "Payment-service rollback drills must keep the previous artifact "
            "available for 30 minutes. This runbook does not authorize freeze deploys."
        ),
    ),
    "frontend-docs-deploy-rule": EvidenceChunk(
        "frontend-docs-deploy-rule",
        "frontend-docs",
        "frontend-docs-v1",
        "frontend-docs/2026-03-01",
        True,
        True,
        "Frontend documentation deploys may ship during normal business hours without freeze approval.",
    ),
    "restricted-breakglass-note": EvidenceChunk(
        "restricted-breakglass-note",
        "breakglass-terms",
        "breakglass-terms",
        "breakglass/2026-05-01",
        False,
        True,
        "Payment-service deploys may bypass freeze approval during executive escalations.",
    ),
}
GOLD = GoldCase(
    case_id="payment-freeze-deploy-001",
    question=(
        "Can I deploy payment-service during the release freeze? Incident commander "
        "approval is attached, and the rollback plan is linked."
    ),
    required_source_ids=frozenset({TARGET_ID}),
    required_points=frozenset({"freeze-scope", "approval", "rollback-plan"}),
)
TRACE = RagTrace(
    case_id=GOLD.case_id,
    first_stage_ids=(
        "payment-service-rollback-runbook",
        "frontend-docs-deploy-rule",
        TARGET_ID,
    ),
    rerank_input_ids=(
        "payment-service-rollback-runbook",
        "frontend-docs-deploy-rule",
        TARGET_ID,
    ),
    reranked_ids=(TARGET_ID, "payment-service-rollback-runbook", "frontend-docs-deploy-rule"),
    selected_context_ids=(TARGET_ID,),
    selected_versions=("deploy-policy/2026-06-01",),
    versions=(
        ("retriever", "policy-retriever-v2"),
        ("index", "policy-index/2026-05-27"),
        ("sparse", "bm25-tokenizer-v1"),
        ("dense", "fixture-embeddings-v1"),
        ("fusion", "rrf-k60"),
        ("reranker", "fixture-cross-encoder-v1"),
    ),
)

selected_sources = [
    (
        chunk.chunk_id,
        chunk.document_id,
        chunk.parent_id,
        chunk.version,
    )
    for chunk in (EVIDENCE[chunk_id] for chunk_id in TRACE.selected_context_ids)
]
print("Selected context:", TRACE.selected_context_ids)
print("Selected sources:", selected_sources)
print("Pipeline versions:", dict(TRACE.versions))
assert TRACE.selected_context_ids == (TARGET_ID,)

Output

Selected context: ('deploy-freeze-approval-rule',)
Selected sources: [('deploy-freeze-approval-rule', 'deploy-policy', 'deploy-policy-v2', 'deploy-policy/2026-06-01')]
Pipeline versions: {'retriever': 'policy-retriever-v2', 'index': 'policy-index/2026-05-27', 'sparse': 'bm25-tokenizer-v1', 'dense': 'fixture-embeddings-v1', 'fusion': 'rrf-k60', 'reranker': 'fixture-cross-encoder-v1'}

The trace is an evaluation input, not a log decoration. It binds a question to exact source identities and pipeline versions without storing raw policy text. During replay, the evaluator loads permitted source text from the versioned evidence store.

Gate admissibility before scoring quality

An answer should fail immediately if restricted, stale, or unknown evidence entered retrieval, reranking, or selected context. A relevance score can't make forbidden evidence acceptable, even when a later stage happens to drop it.

admissible-evidence-path-gate.py

def admissible_evidence_path(trace: RagTrace, gold: GoldCase) -> bool:
    if trace.case_id != gold.case_id:
        return False
    if not trace.selected_context_ids:
        return False
    if len(trace.selected_context_ids) != len(trace.selected_versions):
        return False
    required_version_keys = {"retriever", "index", "sparse", "dense", "fusion", "reranker"}
    version_map = dict(trace.versions)
    if len(version_map) != len(trace.versions):
        return False
    if not required_version_keys.issubset(version_map):
        return False
    stage_ids = (
        trace.first_stage_ids,
        trace.rerank_input_ids,
        trace.reranked_ids,
        trace.selected_context_ids,
    )
    if any(len(ids) != len(set(ids)) for ids in stage_ids):
        return False
    traced_ids = tuple(chunk_id for ids in stage_ids for chunk_id in ids)
    if any(chunk_id not in EVIDENCE for chunk_id in traced_ids):
        return False
    traced = [EVIDENCE[chunk_id] for chunk_id in traced_ids]
    selected = [EVIDENCE[chunk_id] for chunk_id in trace.selected_context_ids]
    rerank_input_came_from_retrieval = set(trace.rerank_input_ids).issubset(
        trace.first_stage_ids
    )
    reranked_same_candidates = set(trace.reranked_ids) == set(trace.rerank_input_ids)
    returned_by_reranker = set(trace.selected_context_ids).issubset(trace.reranked_ids)
    versions_match = all(
        chunk.version == version
        for chunk, version in zip(selected, trace.selected_versions)
    )
    allowed_and_current = all(chunk.permitted and chunk.current for chunk in traced)
    return (
        rerank_input_came_from_retrieval
        and reranked_same_candidates
        and returned_by_reranker
        and versions_match
        and allowed_and_current
    )

restricted_trace = replace(
    TRACE,
    first_stage_ids=("restricted-breakglass-note",),
    rerank_input_ids=("restricted-breakglass-note",),
    reranked_ids=("restricted-breakglass-note",),
    selected_context_ids=("restricted-breakglass-note",),
    selected_versions=("breakglass/2026-05-01",),
)
blocked_candidate_trace = replace(
    TRACE,
    first_stage_ids=TRACE.first_stage_ids + ("restricted-breakglass-note",),
    rerank_input_ids=TRACE.rerank_input_ids + ("restricted-breakglass-note",),
    reranked_ids=TRACE.reranked_ids + ("restricted-breakglass-note",),
)
unknown_candidate_trace = replace(TRACE, first_stage_ids=TRACE.first_stage_ids + ("missing",))
stale_version_trace = replace(TRACE, selected_versions=("deploy-policy/2025-02-01",))
missing_version_trace = replace(TRACE, versions=TRACE.versions[:-1])
wrong_case_trace = replace(TRACE, case_id="payment-freeze-deploy-002")
duplicate_candidate_trace = replace(
    TRACE,
    first_stage_ids=TRACE.first_stage_ids + (TARGET_ID,),
)
print("Production path admissible:", admissible_evidence_path(TRACE, GOLD))
print("Restricted context admissible:", admissible_evidence_path(restricted_trace, GOLD))
print("Blocked candidate admissible:", admissible_evidence_path(blocked_candidate_trace, GOLD))
print("Unknown candidate admissible:", admissible_evidence_path(unknown_candidate_trace, GOLD))
print("Wrong-version path admissible:", admissible_evidence_path(stale_version_trace, GOLD))
print("Incomplete trace admissible:", admissible_evidence_path(missing_version_trace, GOLD))
print("Wrong-case trace admissible:", admissible_evidence_path(wrong_case_trace, GOLD))
print("Duplicate candidate admissible:", admissible_evidence_path(duplicate_candidate_trace, GOLD))
assert admissible_evidence_path(TRACE, GOLD)
assert not admissible_evidence_path(restricted_trace, GOLD)
assert not admissible_evidence_path(blocked_candidate_trace, GOLD)
assert not admissible_evidence_path(unknown_candidate_trace, GOLD)
assert not admissible_evidence_path(stale_version_trace, GOLD)
assert not admissible_evidence_path(missing_version_trace, GOLD)
assert not admissible_evidence_path(wrong_case_trace, GOLD)
assert not admissible_evidence_path(duplicate_candidate_trace, GOLD)

Output

Production path admissible: True
Restricted context admissible: False
Blocked candidate admissible: False
Unknown candidate admissible: False
Wrong-version path admissible: False
Incomplete trace admissible: False
Wrong-case trace admissible: False
Duplicate candidate admissible: False

The case-ID check prevents labels from one replay from grading another replay. The uniqueness check rejects malformed rankings before duplicated IDs distort later metrics.

Separate candidates from selected context

The previous two lessons already taught ranking metrics such as Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG). Those metrics answer whether the best evidence appears early in a ranking.^[1] At answer evaluation time, retain two simpler questions:

Layer	Gate	Meaning on this trace
Candidate retrieval	`candidate_recall`	Did hybrid search provide the required rule?
Selected context	`context_recall`	Did the rule survive reranking and admission?
Selected context	`context_precision`	Did context avoid unnecessary distractors?

evidence-path-metrics.py

def coverage(ids: tuple[str, ...], required_ids: frozenset[str]) -> float:
    return len(set(ids) & required_ids) / len(required_ids)

def context_precision(ids: tuple[str, ...], useful_ids: frozenset[str]) -> float:
    return len(set(ids) & useful_ids) / len(ids) if ids else 0.0

candidate_recall = coverage(TRACE.first_stage_ids, GOLD.required_source_ids)
context_recall = coverage(TRACE.selected_context_ids, GOLD.required_source_ids)
selected_precision = context_precision(TRACE.selected_context_ids, GOLD.required_source_ids)
print(f"Candidate recall: {candidate_recall:.1f}")
print(f"Selected-context recall: {context_recall:.1f}")
print(f"Selected-context precision: {selected_precision:.1f}")
assert (candidate_recall, context_recall, selected_precision) == (1.0, 1.0, 1.0)

Output

Candidate recall: 1.0
Selected-context recall: 1.0
Selected-context precision: 1.0

This result clears only the evidence path. It doesn't say what the model wrote.

Turn an answer into testable claims

Consider two responses produced from the same valid context:

Response	What changed?
`unsafe-bypass`	Adds a no-rollback-plan claim that the deploy rule never states
`supported-deploy`	States only the deploy scope and conditions present in `DEPLOY-17`

For a labeled release case, represent an answer as atomic policy claims. Each claim names its cited source, the source phrases that must establish it, and the expected answer point it covers.

answer-claim-fixtures.py

@dataclass(frozen=True)
class Claim:
    claim_id: str
    text: str
    citation_id: str | None
    support_phrases: tuple[str, ...]
    answer_point: str

@dataclass(frozen=True)
class Answer:
    answer_id: str
    claims: tuple[Claim, ...]

UNSAFE_BYPASS = Answer(
    "unsafe-bypass",
    (
        Claim(
            "freeze-scope",
            "The request is governed by the release-freeze deploy rule.",
            TARGET_ID,
            ("payment-service production deploys", "release freeze"),
            "freeze-scope",
        ),
        Claim(
            "bypass",
            "The deploy can start without a linked rollback plan.",
            TARGET_ID,
            ("without a linked rollback plan",),
            "rollback-plan",
        ),
    ),
)
SUPPORTED_DEPLOY = Answer(
    "supported-deploy",
    (
        Claim(
            "freeze-scope",
            "The request is governed by the release-freeze deploy rule.",
            TARGET_ID,
            ("payment-service production deploys", "release freeze"),
            "freeze-scope",
        ),
        Claim(
            "approval",
            "Incident commander approval is required.",
            TARGET_ID,
            ("require incident commander approval",),
            "approval",
        ),
        Claim(
            "rollback-plan",
            "A linked rollback plan is required before rollout.",
            TARGET_ID,
            ("linked rollback plan", "before rollout"),
            "rollback-plan",
        ),
    ),
)
EMPTY_ANSWER = Answer("empty", ())

print("Unsafe claims:", [claim.claim_id for claim in UNSAFE_BYPASS.claims])
print("Supported claims:", [claim.claim_id for claim in SUPPORTED_DEPLOY.claims])
print("Empty claims:", [claim.claim_id for claim in EMPTY_ANSWER.claims])

Output

Unsafe claims: ['freeze-scope', 'bypass']
Supported claims: ['freeze-scope', 'approval', 'rollback-plan']
Empty claims: []

This is a golden-set contract. It's intentionally strict and inspectable. It won't recognize every correct paraphrase in live traffic; that softer matching problem belongs after you understand the release invariant.

Faithfulness checks claims against context

Faithfulness asks whether selected context supports the answer's claims. RAGAS defines this family of evaluation by decomposing a response into claims and checking support from retrieved context.^[2] For a labeled high-impact policy case, exact required phrases give a deterministic first gate:

\operatorname{faithfulness} = \frac{\text{claims supported by selected context}} {\text{claims in the answer}}

claim-faithfulness-gate.py

def source_supports(claim: Claim, chunk: EvidenceChunk) -> bool:
    source = chunk.text.lower()
    return all(phrase.lower() in source for phrase in claim.support_phrases)

def claim_supported_by_context(claim: Claim, trace: RagTrace) -> bool:
    return any(
        source_supports(claim, EVIDENCE[chunk_id])
        for chunk_id in trace.selected_context_ids
    )

def faithfulness(answer: Answer, trace: RagTrace) -> float:
    supported = sum(
        claim_supported_by_context(claim, trace) for claim in answer.claims
    )
    return supported / len(answer.claims) if answer.claims else 0.0

print(f"unsafe-bypass faithfulness: {faithfulness(UNSAFE_BYPASS, TRACE):.2f}")
print(
    "supported-deploy faithfulness: "
    f"{faithfulness(SUPPORTED_DEPLOY, TRACE):.2f}"
)
print(f"empty faithfulness: {faithfulness(EMPTY_ANSWER, TRACE):.2f}")
assert faithfulness(UNSAFE_BYPASS, TRACE) == 0.5
assert faithfulness(SUPPORTED_DEPLOY, TRACE) == 1.0
assert faithfulness(EMPTY_ANSWER, TRACE) == 0.0

Output

unsafe-bypass faithfulness: 0.50
supported-deploy faithfulness: 1.00
empty faithfulness: 0.00

The unsafe response is on topic and cites a real selected rule. It still fails because one claim outruns the rule.

Citation presence isn't citation support

A citation metric needs two checks:

Coverage: Does every policy claim cite a source?
Support: Does the cited selected source establish that claim?

A fabricated rollback exemption can achieve perfect citation coverage by attaching the correct-looking chunk ID. That's why presence alone is a weak gate.

citation-support-gate.py

MIS_CITED = Answer(
    "mis-cited",
    tuple(
        replace(claim, citation_id="payment-service-rollback-runbook")
        for claim in SUPPORTED_DEPLOY.claims
    ),
)

def citation_coverage(answer: Answer) -> float:
    cited = sum(claim.citation_id is not None for claim in answer.claims)
    return cited / len(answer.claims) if answer.claims else 0.0

def citation_support(answer: Answer, trace: RagTrace) -> float:
    supported = 0
    for claim in answer.claims:
        if claim.citation_id not in trace.selected_context_ids:
            continue
        if source_supports(claim, EVIDENCE[claim.citation_id]):
            supported += 1
    return supported / len(answer.claims) if answer.claims else 0.0

print(f"unsafe citation coverage: {citation_coverage(UNSAFE_BYPASS):.2f}")
print(f"unsafe citation support: {citation_support(UNSAFE_BYPASS, TRACE):.2f}")
print(f"mis-cited answer faithfulness: {faithfulness(MIS_CITED, TRACE):.2f}")
print(f"mis-cited citation support: {citation_support(MIS_CITED, TRACE):.2f}")
assert citation_coverage(UNSAFE_BYPASS) == 1.0
assert citation_support(UNSAFE_BYPASS, TRACE) == 0.5
assert faithfulness(MIS_CITED, TRACE) == 1.0
assert citation_support(MIS_CITED, TRACE) == 0.0

Output

unsafe citation coverage: 1.00
unsafe citation support: 0.50
mis-cited answer faithfulness: 1.00
mis-cited citation support: 0.00

In the mis-cited response, selected context could support every sentence, so faithfulness is high. Its citations still fail because they don't point to the selected source that proves those sentences.

Check whether the answer finished the task

A faithful response can still omit a required condition. For this golden case, Maya needs three answer points: deploy scope, approval, and rollback plan. Treat point coverage as a labeled completeness gate:

answer-point-coverage.py

def supported_point_coverage(answer: Answer, trace: RagTrace, gold: GoldCase) -> float:
    supported_points = {
        claim.answer_point
        for claim in answer.claims
        if claim_supported_by_context(claim, trace)
    }
    return len(supported_points & gold.required_points) / len(gold.required_points)

unsafe_coverage = supported_point_coverage(UNSAFE_BYPASS, TRACE, GOLD)
supported_coverage = supported_point_coverage(SUPPORTED_DEPLOY, TRACE, GOLD)
print(f"unsafe supported point coverage: {unsafe_coverage:.2f}")
print(f"supported answer point coverage: {supported_coverage:.2f}")
assert unsafe_coverage == 1 / 3
assert supported_coverage == 1.0

Output

unsafe supported point coverage: 0.33
supported answer point coverage: 1.00

This metric is stronger than asking whether an answer sounds relevant. It knows exactly which policy conditions this regression case must preserve. On live questions without labels, you'll need sampled human review or a calibrated judge, which is the next lesson's problem.

Attribute the first failure

One trace can fail at several layers after the first defect. If selected context drops the target rule, later unsupported claims are consequences, not the first fix. Diagnose in pipeline order.

Gate-by-gate diagnosis flow for one RAG trace: missing candidate stops at retrieval, dropped context stops at selection, empty answer stops at claims, invented bypass stops at faithfulness, wrong source stops at citation, and the supported answer clears every gate. — Every variant shares one gate path. The first red branch owns the repair, and the supported answer shows full pass-through.

first-failed-stage.py

RETRIEVAL_MISS = replace(
    TRACE,
    first_stage_ids=("payment-service-rollback-runbook", "frontend-docs-deploy-rule"),
    rerank_input_ids=("payment-service-rollback-runbook", "frontend-docs-deploy-rule"),
    reranked_ids=("payment-service-rollback-runbook", "frontend-docs-deploy-rule"),
    selected_context_ids=("payment-service-rollback-runbook",),
    selected_versions=("payment-rollback/2026-05-20",),
)
SELECTION_MISS = replace(
    TRACE,
    selected_context_ids=("payment-service-rollback-runbook",),
    selected_versions=("payment-rollback/2026-05-20",),
)

def first_failed_stage(trace: RagTrace, answer: Answer, gold: GoldCase) -> str:
    if not admissible_evidence_path(trace, gold):
        return "admissibility"
    if coverage(trace.first_stage_ids, gold.required_source_ids) < 1.0:
        return "candidate retrieval"
    if coverage(trace.selected_context_ids, gold.required_source_ids) < 1.0:
        return "context selection"
    if not answer.claims:
        return "answer completeness"
    if faithfulness(answer, trace) < 1.0:
        return "answer faithfulness"
    if citation_support(answer, trace) < 1.0:
        return "citation support"
    if supported_point_coverage(answer, trace, gold) < 1.0:
        return "answer completeness"
    return "pass"

diagnoses = {
    "missing candidate": first_failed_stage(RETRIEVAL_MISS, SUPPORTED_DEPLOY, GOLD),
    "dropped context": first_failed_stage(SELECTION_MISS, SUPPORTED_DEPLOY, GOLD),
    "invented bypass": first_failed_stage(TRACE, UNSAFE_BYPASS, GOLD),
    "wrong citation": first_failed_stage(TRACE, MIS_CITED, GOLD),
    "empty answer": first_failed_stage(TRACE, EMPTY_ANSWER, GOLD),
    "supported answer": first_failed_stage(TRACE, SUPPORTED_DEPLOY, GOLD),
}
for variant, stage in diagnoses.items():
    print(f"{variant}: {stage}")
assert diagnoses["missing candidate"] == "candidate retrieval"
assert diagnoses["dropped context"] == "context selection"
assert diagnoses["invented bypass"] == "answer faithfulness"
assert diagnoses["wrong citation"] == "citation support"
assert diagnoses["empty answer"] == "answer completeness"
assert diagnoses["supported answer"] == "pass"

Output

missing candidate: candidate retrieval
dropped context: context selection
invented bypass: answer faithfulness
wrong citation: citation support
empty answer: answer completeness
supported answer: pass

The ordering is important. If the required rule never reaches context, a claim-support failure doesn't justify tuning the generation prompt yet.

Make bad behaviors part of release testing

Evaluation should show both that a supported answer is accepted and that known bad answers are blocked. These variants become a tiny regression suite:

answer-release-regression.py

@dataclass(frozen=True)
class ReleaseCase:
    name: str
    trace: RagTrace
    answer: Answer
    should_release: bool

def can_release(trace: RagTrace, answer: Answer, gold: GoldCase) -> bool:
    return first_failed_stage(trace, answer, gold) == "pass"

RELEASE_CASES = (
    ReleaseCase("supported answer", TRACE, SUPPORTED_DEPLOY, True),
    ReleaseCase("unsupported bypass", TRACE, UNSAFE_BYPASS, False),
    ReleaseCase("wrong citation", TRACE, MIS_CITED, False),
    ReleaseCase("empty answer", TRACE, EMPTY_ANSWER, False),
    ReleaseCase("dropped source", SELECTION_MISS, SUPPORTED_DEPLOY, False),
)
regression_passes = 0
for case in RELEASE_CASES:
    observed = can_release(case.trace, case.answer, GOLD)
    regression_passes += observed == case.should_release
    print(f"{case.name}: release={observed}, expected={case.should_release}")
print(f"Regression checks passed: {regression_passes}/{len(RELEASE_CASES)}")
assert regression_passes == len(RELEASE_CASES)

Output

supported answer: release=True, expected=True
unsupported bypass: release=False, expected=False
wrong citation: release=False, expected=False
empty answer: release=False, expected=False
dropped source: release=False, expected=False
Regression checks passed: 5/5

Release behavior is now concrete: pass the supported response and reject the four controlled defects.

Slice results before trusting an average

A larger golden set should include release-freeze deploys, incident hotfixes, schema migrations, and data backfills. Aggregate quality can remain high while a costly policy slice regresses.

slice-level-report.py

@dataclass(frozen=True)
class LabeledOutcome:
    workflow: str
    released_correctly: bool

OUTCOMES = (
    LabeledOutcome("release-freeze", True),
    LabeledOutcome("release-freeze", False),
    LabeledOutcome("incident-hotfix", True),
    LabeledOutcome("incident-hotfix", True),
    LabeledOutcome("schema-migration", False),
)
by_workflow: dict[str, list[bool]] = defaultdict(list)
for outcome in OUTCOMES:
    by_workflow[outcome.workflow].append(outcome.released_correctly)

overall = sum(outcome.released_correctly for outcome in OUTCOMES) / len(OUTCOMES)
print(f"Overall pass rate: {overall:.0%}")
for workflow, results in sorted(by_workflow.items()):
    print(f"{workflow}: {sum(results) / len(results):.0%}")
assert overall == 0.6
assert sum(by_workflow["schema-migration"]) == 0

Output

Overall pass rate: 60%
incident-hotfix: 100%
release-freeze: 50%
schema-migration: 0%

release-report.py

release_report = {
    "service_version": "policy-answerer-v4-eval",
    "source_trace": TRACE.case_id,
    "required_policy_version": EVIDENCE[TARGET_ID].version,
    "checks": {
        "admissible_evidence_path": admissible_evidence_path(TRACE, GOLD),
        "supported_answer_released": can_release(TRACE, SUPPORTED_DEPLOY, GOLD),
        "known_bad_answers_blocked": all(
            not can_release(case.trace, case.answer, GOLD)
            for case in RELEASE_CASES
            if not case.should_release
        ),
        "slice_regression_clear": all(
            sum(results) / len(results) >= 0.95
            for results in by_workflow.values()
        ),
    },
}
print("Release checks:", release_report["checks"])
print("Release allowed:", all(release_report["checks"].values()))
assert not all(release_report["checks"].values())

Output

Release checks: {'admissible_evidence_path': True, 'supported_answer_released': True, 'known_bad_answers_blocked': True, 'slice_regression_clear': False}
Release allowed: False

The supported single trace passes, and bad-answer detection works. The broader fixture still blocks release because schema-migration has a failed slice. This is the correct result: a demonstration case can't overrule a failing workflow.

Where automated judges enter

The harness used exact labels because policy release cases should be unambiguous. At scale, operators paraphrase questions and models paraphrase valid answers. RAGAS presents reference-free RAG metrics including faithfulness and answer relevance; ARES proposes learned judges for context relevance, answer faithfulness, and answer relevance.^[2]^[3]

Those approaches don't remove the need for this trace design. An automated judge still needs:

Input	Why the judge needs it
Question and expected workflow	Decide whether the answer addressed the request
Admitted context and version	Decide whether claims are grounded in allowed evidence
Atomic claims and citations	Explain which claim failed and which source was cited
Human-reviewed calibration set	Detect judge mistakes and drift

The next lesson will replace selected deterministic judgments with rubric-driven LLM judgments while keeping hard evidence and citation gates visible.

Production checklist

Before releasing a RAG answer pipeline:

Gate	Minimum evidence	First fix when it fails
Admissibility	Retrieved, reranked, and selected chunk IDs; versions; ACL/freshness decision	Evidence filtering
Candidate recall	Required chunk in retrieved candidates	Chunking, query, or retrieval lane
Context admission	Required chunk in selected context, no noise padding	Reranker or context gate
Faithfulness	Supported claim ledger	Prompt, abstention, or generation policy
Citation support	Claim-to-source validation	Citation attachment or validator
Slice health	Workflow-level regression report	Block release and debug affected slice

Don't compress these gates into one quality percentage. A dashboard should tell an engineer what to repair first.

Mastery check

You're ready to evaluate a RAG answer pipeline when you can:

Carry versioned source identity and pipeline versions from retrieval into answer evaluation.
Treat authorization, freshness, trace integrity, and case identity as hard gates before quality scores.
Separate candidate recall, selected-context recall, claim support, citation support, and answer completeness.
Diagnose the first failing layer instead of tuning every component at once.
Build regression cases that accept supported answers and block known bad behaviors.
Explain why automated judges require calibration and can't replace deterministic gates.

Evaluation rubric

Level	Evidence in submission
Foundational	Carries case ID, source identity, and pipeline versions into answer evaluation.
Applied	Separates candidate retrieval, context admission, claim support, citations, and completeness.
Strong	Rejects malformed traces and known bad answers while reporting the first failed stage.
Production-ready	Adds slice-level release gates and explains where calibrated judges may enter.

Follow-up questions

Common pitfalls

Evaluation starts from answer text only

Symptom: A wrong response is visible, but nobody can tell whether evidence was missing, dropped, or ignored.
Cause: The system didn't retain source identities, retrieved IDs, reranker inputs, selected context IDs, pipeline versions, and citations beside the answer.
Fix: Store and replay a full evidence trace for every labeled evaluation case.

Citation coverage is mistaken for grounding

Symptom: Responses show citations on every sentence, yet reviewers find unsupported rollback-bypass or deploy-approval promises.
Cause: Tests check that a citation exists, not that the cited selected chunk supports the claim.
Fix: Validate claim-to-source support and block unsupported policy claims.

Judge scores replace hard safety gates

Symptom: A high judge score releases an answer based on restricted or stale policy text.
Cause: Semantic quality was evaluated before admission and source-version invariants.
Fix: Enforce authorization and freshness deterministically; apply judges only after evidence is admissible.

Next Step

Continue to LLM-as-a-Judge Evaluation

You can now diagnose a RAG response from its evidence trace and block unsupported claims deterministically. Next you'll automate softer semantic judgments with rubrics, calibration, and judge-failure controls.

PreviousReranking and Cross-Encoders for RAG

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Introduction to Information Retrieval.

Manning, C. D., Raghavan, P., Schutze, H. · 2008 · Cambridge University Press

RAGAS: Automated Evaluation of Retrieval Augmented Generation.

Es, S., et al. · 2023 · arXiv preprint

ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems.

Saad-Falcon, J., et al. · 2023 · NAACL 2024

RAG Evaluation for Reliable Answers

Carry the trace forward

Gate admissibility before scoring quality

A restricted break-glass rule would answer the question perfectly. Why must the evaluation fail before computing relevance?

Separate candidates from selected context

Turn an answer into testable claims

Faithfulness checks claims against context

Citation presence isn't citation support

An answer gets faithfulness 1.0 but citation support 0.0. What happened?

Check whether the answer finished the task

An answer emits no policy claims. Does its lack of unsupported claims make it releasable?

Attribute the first failure

Make bad behaviors part of release testing

Slice results before trusting an average

Where automated judges enter

Production checklist

Mastery check

Evaluation rubric

Follow-up questions

Candidate recall is 1.0, but selected-context recall is 0.0. Should you rewrite the answer prompt first?

Every claim has a citation, but citation support is below threshold. What does this measure expose?

One supported deploy example passes while schema-migration traces fail. Can you release the pipeline?

Common pitfalls

Evaluation starts from answer text only

Citation coverage is mistaken for grounding

Judge scores replace hard safety gates

Mastery Check