LeetLLM
LearnTracksPracticeBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Tracks
  • Practice
  • Blog
  • RSS

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 158 articles completed

🛠️Computing Foundations0/9
Git, Shell, Linux for AIDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models: Images & TextReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnApplied LLM EngineeringRAG Evaluation for Reliable Answers
📊MediumEvaluation & Benchmarks

RAG Evaluation for Reliable Answers

Evaluate a permission-safe RAG answer trace with context, claim, citation, failure-attribution, and release gates before automating softer judgments.

15 min read
Learning path
Step 67 of 158 in the full curriculum
Reranking and Cross-Encoders for RAGLLM-as-a-Judge Evaluation

policy-answerer-v3 selects one current, permitted policy chunk: deploy-freeze-approval-rule. That's necessary evidence for Maya's reply, but it isn't proof of a reliable answer. A generator can still say the deploy may start without a rollback plan, or cite the right rule beside a claim it doesn't support.

The next step is an evaluation harness for that trace. Retrieval-augmented generation (RAG) evaluation should tell you where a response first became unsafe: evidence access, candidate retrieval, context selection, answer claims, or citations. You'll turn each boundary into an executable eval gate before the next lesson introduces large language model (LLM) judges for semantic cases that deterministic labels can't cover.

Shared admissible RAG trace splitting into a supported deploy answer and an unsafe bypass answer; the supported path keeps every proof chip green while one unsupported bypass claim turns the unsafe path red and blocks release. Shared admissible RAG trace splitting into a supported deploy answer and an unsafe bypass answer; the supported path keeps every proof chip green while one unsupported bypass claim turns the unsafe path red and blocks release.
Both answers start from the same admissible trace. The supported path keeps every proof chip green, while one invented bypass claim is enough to block release.

Carry the trace forward

Maya's question remains unchanged:

Can I deploy payment-service during the release freeze? Incident commander approval is attached, and the rollback plan is linked.

policy-answerer-v3 already established that hybrid retrieval found the correct rule and reranking selected it for context. The evaluation fixture carries that evidence path forward instead of inventing a new example.

evaluation-trace-fixture.py
1from __future__ import annotations 2 3from collections import defaultdict 4from dataclasses import dataclass, replace 5 6TARGET_ID = "deploy-freeze-approval-rule" 7 8@dataclass(frozen=True) 9class EvidenceChunk: 10 chunk_id: str 11 document_id: str 12 parent_id: str 13 version: str 14 permitted: bool 15 current: bool 16 text: str 17 18@dataclass(frozen=True) 19class GoldCase: 20 case_id: str 21 question: str 22 required_source_ids: frozenset[str] 23 required_points: frozenset[str] 24 25@dataclass(frozen=True) 26class RagTrace: 27 case_id: str 28 first_stage_ids: tuple[str, ...] 29 rerank_input_ids: tuple[str, ...] 30 reranked_ids: tuple[str, ...] 31 selected_context_ids: tuple[str, ...] 32 selected_versions: tuple[str, ...] 33 versions: tuple[tuple[str, str], ...] 34 35EVIDENCE = { 36 TARGET_ID: EvidenceChunk( 37 TARGET_ID, 38 "deploy-policy", 39 "deploy-policy-v2", 40 "deploy-policy/2026-06-01", 41 True, 42 True, 43 ( 44 "Rule DEPLOY-17. Payment-service production deploys during a release " 45 "freeze require incident commander approval and a linked rollback plan " 46 "before rollout." 47 ), 48 ), 49 "payment-service-rollback-runbook": EvidenceChunk( 50 "payment-service-rollback-runbook", 51 "payment-rollback", 52 "payment-rollback-v1", 53 "payment-rollback/2026-05-20", 54 True, 55 True, 56 ( 57 "Payment-service rollback drills must keep the previous artifact " 58 "available for 30 minutes. This runbook does not authorize freeze deploys." 59 ), 60 ), 61 "frontend-docs-deploy-rule": EvidenceChunk( 62 "frontend-docs-deploy-rule", 63 "frontend-docs", 64 "frontend-docs-v1", 65 "frontend-docs/2026-03-01", 66 True, 67 True, 68 "Frontend documentation deploys may ship during normal business hours without freeze approval.", 69 ), 70 "restricted-breakglass-note": EvidenceChunk( 71 "restricted-breakglass-note", 72 "breakglass-terms", 73 "breakglass-terms", 74 "breakglass/2026-05-01", 75 False, 76 True, 77 "Payment-service deploys may bypass freeze approval during executive escalations.", 78 ), 79} 80GOLD = GoldCase( 81 case_id="payment-freeze-deploy-001", 82 question=( 83 "Can I deploy payment-service during the release freeze? Incident commander " 84 "approval is attached, and the rollback plan is linked." 85 ), 86 required_source_ids=frozenset({TARGET_ID}), 87 required_points=frozenset({"freeze-scope", "approval", "rollback-plan"}), 88) 89TRACE = RagTrace( 90 case_id=GOLD.case_id, 91 first_stage_ids=( 92 "payment-service-rollback-runbook", 93 "frontend-docs-deploy-rule", 94 TARGET_ID, 95 ), 96 rerank_input_ids=( 97 "payment-service-rollback-runbook", 98 "frontend-docs-deploy-rule", 99 TARGET_ID, 100 ), 101 reranked_ids=(TARGET_ID, "payment-service-rollback-runbook", "frontend-docs-deploy-rule"), 102 selected_context_ids=(TARGET_ID,), 103 selected_versions=("deploy-policy/2026-06-01",), 104 versions=( 105 ("retriever", "policy-retriever-v2"), 106 ("index", "policy-index/2026-05-27"), 107 ("sparse", "bm25-tokenizer-v1"), 108 ("dense", "fixture-embeddings-v1"), 109 ("fusion", "rrf-k60"), 110 ("reranker", "fixture-cross-encoder-v1"), 111 ), 112) 113 114selected_sources = [ 115 ( 116 chunk.chunk_id, 117 chunk.document_id, 118 chunk.parent_id, 119 chunk.version, 120 ) 121 for chunk in (EVIDENCE[chunk_id] for chunk_id in TRACE.selected_context_ids) 122] 123print("Selected context:", TRACE.selected_context_ids) 124print("Selected sources:", selected_sources) 125print("Pipeline versions:", dict(TRACE.versions)) 126assert TRACE.selected_context_ids == (TARGET_ID,)
Output
1Selected context: ('deploy-freeze-approval-rule',) 2Selected sources: [('deploy-freeze-approval-rule', 'deploy-policy', 'deploy-policy-v2', 'deploy-policy/2026-06-01')] 3Pipeline versions: {'retriever': 'policy-retriever-v2', 'index': 'policy-index/2026-05-27', 'sparse': 'bm25-tokenizer-v1', 'dense': 'fixture-embeddings-v1', 'fusion': 'rrf-k60', 'reranker': 'fixture-cross-encoder-v1'}

The trace is an evaluation input, not a log decoration. It binds a question to exact source identities and pipeline versions without storing raw policy text. During replay, the evaluator loads permitted source text from the versioned evidence store.

Gate admissibility before scoring quality

An answer should fail immediately if restricted, stale, or unknown evidence entered retrieval, reranking, or selected context. A relevance score can't make forbidden evidence acceptable, even when a later stage happens to drop it.

admissible-evidence-path-gate.py
1def admissible_evidence_path(trace: RagTrace, gold: GoldCase) -> bool: 2 if trace.case_id != gold.case_id: 3 return False 4 if not trace.selected_context_ids: 5 return False 6 if len(trace.selected_context_ids) != len(trace.selected_versions): 7 return False 8 required_version_keys = {"retriever", "index", "sparse", "dense", "fusion", "reranker"} 9 version_map = dict(trace.versions) 10 if len(version_map) != len(trace.versions): 11 return False 12 if not required_version_keys.issubset(version_map): 13 return False 14 stage_ids = ( 15 trace.first_stage_ids, 16 trace.rerank_input_ids, 17 trace.reranked_ids, 18 trace.selected_context_ids, 19 ) 20 if any(len(ids) != len(set(ids)) for ids in stage_ids): 21 return False 22 traced_ids = tuple(chunk_id for ids in stage_ids for chunk_id in ids) 23 if any(chunk_id not in EVIDENCE for chunk_id in traced_ids): 24 return False 25 traced = [EVIDENCE[chunk_id] for chunk_id in traced_ids] 26 selected = [EVIDENCE[chunk_id] for chunk_id in trace.selected_context_ids] 27 rerank_input_came_from_retrieval = set(trace.rerank_input_ids).issubset( 28 trace.first_stage_ids 29 ) 30 reranked_same_candidates = set(trace.reranked_ids) == set(trace.rerank_input_ids) 31 returned_by_reranker = set(trace.selected_context_ids).issubset(trace.reranked_ids) 32 versions_match = all( 33 chunk.version == version 34 for chunk, version in zip(selected, trace.selected_versions) 35 ) 36 allowed_and_current = all(chunk.permitted and chunk.current for chunk in traced) 37 return ( 38 rerank_input_came_from_retrieval 39 and reranked_same_candidates 40 and returned_by_reranker 41 and versions_match 42 and allowed_and_current 43 ) 44 45restricted_trace = replace( 46 TRACE, 47 first_stage_ids=("restricted-breakglass-note",), 48 rerank_input_ids=("restricted-breakglass-note",), 49 reranked_ids=("restricted-breakglass-note",), 50 selected_context_ids=("restricted-breakglass-note",), 51 selected_versions=("breakglass/2026-05-01",), 52) 53blocked_candidate_trace = replace( 54 TRACE, 55 first_stage_ids=TRACE.first_stage_ids + ("restricted-breakglass-note",), 56 rerank_input_ids=TRACE.rerank_input_ids + ("restricted-breakglass-note",), 57 reranked_ids=TRACE.reranked_ids + ("restricted-breakglass-note",), 58) 59unknown_candidate_trace = replace(TRACE, first_stage_ids=TRACE.first_stage_ids + ("missing",)) 60stale_version_trace = replace(TRACE, selected_versions=("deploy-policy/2025-02-01",)) 61missing_version_trace = replace(TRACE, versions=TRACE.versions[:-1]) 62wrong_case_trace = replace(TRACE, case_id="payment-freeze-deploy-002") 63duplicate_candidate_trace = replace( 64 TRACE, 65 first_stage_ids=TRACE.first_stage_ids + (TARGET_ID,), 66) 67print("Production path admissible:", admissible_evidence_path(TRACE, GOLD)) 68print("Restricted context admissible:", admissible_evidence_path(restricted_trace, GOLD)) 69print("Blocked candidate admissible:", admissible_evidence_path(blocked_candidate_trace, GOLD)) 70print("Unknown candidate admissible:", admissible_evidence_path(unknown_candidate_trace, GOLD)) 71print("Wrong-version path admissible:", admissible_evidence_path(stale_version_trace, GOLD)) 72print("Incomplete trace admissible:", admissible_evidence_path(missing_version_trace, GOLD)) 73print("Wrong-case trace admissible:", admissible_evidence_path(wrong_case_trace, GOLD)) 74print("Duplicate candidate admissible:", admissible_evidence_path(duplicate_candidate_trace, GOLD)) 75assert admissible_evidence_path(TRACE, GOLD) 76assert not admissible_evidence_path(restricted_trace, GOLD) 77assert not admissible_evidence_path(blocked_candidate_trace, GOLD) 78assert not admissible_evidence_path(unknown_candidate_trace, GOLD) 79assert not admissible_evidence_path(stale_version_trace, GOLD) 80assert not admissible_evidence_path(missing_version_trace, GOLD) 81assert not admissible_evidence_path(wrong_case_trace, GOLD) 82assert not admissible_evidence_path(duplicate_candidate_trace, GOLD)
Output
1Production path admissible: True 2Restricted context admissible: False 3Blocked candidate admissible: False 4Unknown candidate admissible: False 5Wrong-version path admissible: False 6Incomplete trace admissible: False 7Wrong-case trace admissible: False 8Duplicate candidate admissible: False

The case-ID check prevents labels from one replay from grading another replay. The uniqueness check rejects malformed rankings before duplicated IDs distort later metrics.

Separate candidates from selected context

The previous two lessons already taught ranking metrics such as Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG). Those metrics answer whether the best evidence appears early in a ranking.[1] At answer evaluation time, retain two simpler questions:

LayerGateMeaning on this trace
Candidate retrievalcandidate_recallDid hybrid search provide the required rule?
Selected contextcontext_recallDid the rule survive reranking and admission?
Selected contextcontext_precisionDid context avoid unnecessary distractors?
evidence-path-metrics.py
1def coverage(ids: tuple[str, ...], required_ids: frozenset[str]) -> float: 2 return len(set(ids) & required_ids) / len(required_ids) 3 4def context_precision(ids: tuple[str, ...], useful_ids: frozenset[str]) -> float: 5 return len(set(ids) & useful_ids) / len(ids) if ids else 0.0 6 7candidate_recall = coverage(TRACE.first_stage_ids, GOLD.required_source_ids) 8context_recall = coverage(TRACE.selected_context_ids, GOLD.required_source_ids) 9selected_precision = context_precision(TRACE.selected_context_ids, GOLD.required_source_ids) 10print(f"Candidate recall: {candidate_recall:.1f}") 11print(f"Selected-context recall: {context_recall:.1f}") 12print(f"Selected-context precision: {selected_precision:.1f}") 13assert (candidate_recall, context_recall, selected_precision) == (1.0, 1.0, 1.0)
Output
1Candidate recall: 1.0 2Selected-context recall: 1.0 3Selected-context precision: 1.0

This result clears only the evidence path. It doesn't say what the model wrote.

Turn an answer into testable claims

Consider two responses produced from the same valid context:

ResponseWhat changed?
unsafe-bypassAdds a no-rollback-plan claim that the deploy rule never states
supported-deployStates only the deploy scope and conditions present in DEPLOY-17

For a labeled release case, represent an answer as atomic policy claims. Each claim names its cited source, the source phrases that must establish it, and the expected answer point it covers.

answer-claim-fixtures.py
1@dataclass(frozen=True) 2class Claim: 3 claim_id: str 4 text: str 5 citation_id: str | None 6 support_phrases: tuple[str, ...] 7 answer_point: str 8 9@dataclass(frozen=True) 10class Answer: 11 answer_id: str 12 claims: tuple[Claim, ...] 13 14UNSAFE_BYPASS = Answer( 15 "unsafe-bypass", 16 ( 17 Claim( 18 "freeze-scope", 19 "The request is governed by the release-freeze deploy rule.", 20 TARGET_ID, 21 ("payment-service production deploys", "release freeze"), 22 "freeze-scope", 23 ), 24 Claim( 25 "bypass", 26 "The deploy can start without a linked rollback plan.", 27 TARGET_ID, 28 ("without a linked rollback plan",), 29 "rollback-plan", 30 ), 31 ), 32) 33SUPPORTED_DEPLOY = Answer( 34 "supported-deploy", 35 ( 36 Claim( 37 "freeze-scope", 38 "The request is governed by the release-freeze deploy rule.", 39 TARGET_ID, 40 ("payment-service production deploys", "release freeze"), 41 "freeze-scope", 42 ), 43 Claim( 44 "approval", 45 "Incident commander approval is required.", 46 TARGET_ID, 47 ("require incident commander approval",), 48 "approval", 49 ), 50 Claim( 51 "rollback-plan", 52 "A linked rollback plan is required before rollout.", 53 TARGET_ID, 54 ("linked rollback plan", "before rollout"), 55 "rollback-plan", 56 ), 57 ), 58) 59EMPTY_ANSWER = Answer("empty", ()) 60 61print("Unsafe claims:", [claim.claim_id for claim in UNSAFE_BYPASS.claims]) 62print("Supported claims:", [claim.claim_id for claim in SUPPORTED_DEPLOY.claims]) 63print("Empty claims:", [claim.claim_id for claim in EMPTY_ANSWER.claims])
Output
1Unsafe claims: ['freeze-scope', 'bypass'] 2Supported claims: ['freeze-scope', 'approval', 'rollback-plan'] 3Empty claims: []

This is a golden-set contract. It's intentionally strict and inspectable. It won't recognize every correct paraphrase in live traffic; that softer matching problem belongs after you understand the release invariant.

Faithfulness checks claims against context

Faithfulness asks whether selected context supports the answer's claims. RAGAS defines this family of evaluation by decomposing a response into claims and checking support from retrieved context.[2] For a labeled high-impact policy case, exact required phrases give a deterministic first gate:

faithfulness⁡=claims supported by selected contextclaims in the answer\operatorname{faithfulness} = \frac{\text{claims supported by selected context}} {\text{claims in the answer}}faithfulness=claims in the answerclaims supported by selected context​
claim-faithfulness-gate.py
1def source_supports(claim: Claim, chunk: EvidenceChunk) -> bool: 2 source = chunk.text.lower() 3 return all(phrase.lower() in source for phrase in claim.support_phrases) 4 5def claim_supported_by_context(claim: Claim, trace: RagTrace) -> bool: 6 return any( 7 source_supports(claim, EVIDENCE[chunk_id]) 8 for chunk_id in trace.selected_context_ids 9 ) 10 11def faithfulness(answer: Answer, trace: RagTrace) -> float: 12 supported = sum( 13 claim_supported_by_context(claim, trace) for claim in answer.claims 14 ) 15 return supported / len(answer.claims) if answer.claims else 0.0 16 17print(f"unsafe-bypass faithfulness: {faithfulness(UNSAFE_BYPASS, TRACE):.2f}") 18print( 19 "supported-deploy faithfulness: " 20 f"{faithfulness(SUPPORTED_DEPLOY, TRACE):.2f}" 21) 22print(f"empty faithfulness: {faithfulness(EMPTY_ANSWER, TRACE):.2f}") 23assert faithfulness(UNSAFE_BYPASS, TRACE) == 0.5 24assert faithfulness(SUPPORTED_DEPLOY, TRACE) == 1.0 25assert faithfulness(EMPTY_ANSWER, TRACE) == 0.0
Output
1unsafe-bypass faithfulness: 0.50 2supported-deploy faithfulness: 1.00 3empty faithfulness: 0.00

The unsafe response is on topic and cites a real selected rule. It still fails because one claim outruns the rule.

Citation presence isn't citation support

A citation metric needs two checks:

  1. Coverage: Does every policy claim cite a source?
  2. Support: Does the cited selected source establish that claim?

A fabricated rollback exemption can achieve perfect citation coverage by attaching the correct-looking chunk ID. That's why presence alone is a weak gate.

citation-support-gate.py
1MIS_CITED = Answer( 2 "mis-cited", 3 tuple( 4 replace(claim, citation_id="payment-service-rollback-runbook") 5 for claim in SUPPORTED_DEPLOY.claims 6 ), 7) 8 9def citation_coverage(answer: Answer) -> float: 10 cited = sum(claim.citation_id is not None for claim in answer.claims) 11 return cited / len(answer.claims) if answer.claims else 0.0 12 13def citation_support(answer: Answer, trace: RagTrace) -> float: 14 supported = 0 15 for claim in answer.claims: 16 if claim.citation_id not in trace.selected_context_ids: 17 continue 18 if source_supports(claim, EVIDENCE[claim.citation_id]): 19 supported += 1 20 return supported / len(answer.claims) if answer.claims else 0.0 21 22print(f"unsafe citation coverage: {citation_coverage(UNSAFE_BYPASS):.2f}") 23print(f"unsafe citation support: {citation_support(UNSAFE_BYPASS, TRACE):.2f}") 24print(f"mis-cited answer faithfulness: {faithfulness(MIS_CITED, TRACE):.2f}") 25print(f"mis-cited citation support: {citation_support(MIS_CITED, TRACE):.2f}") 26assert citation_coverage(UNSAFE_BYPASS) == 1.0 27assert citation_support(UNSAFE_BYPASS, TRACE) == 0.5 28assert faithfulness(MIS_CITED, TRACE) == 1.0 29assert citation_support(MIS_CITED, TRACE) == 0.0
Output
1unsafe citation coverage: 1.00 2unsafe citation support: 0.50 3mis-cited answer faithfulness: 1.00 4mis-cited citation support: 0.00

In the mis-cited response, selected context could support every sentence, so faithfulness is high. Its citations still fail because they don't point to the selected source that proves those sentences.

Check whether the answer finished the task

A faithful response can still omit a required condition. For this golden case, Maya needs three answer points: deploy scope, approval, and rollback plan. Treat point coverage as a labeled completeness gate:

answer-point-coverage.py
1def supported_point_coverage(answer: Answer, trace: RagTrace, gold: GoldCase) -> float: 2 supported_points = { 3 claim.answer_point 4 for claim in answer.claims 5 if claim_supported_by_context(claim, trace) 6 } 7 return len(supported_points & gold.required_points) / len(gold.required_points) 8 9unsafe_coverage = supported_point_coverage(UNSAFE_BYPASS, TRACE, GOLD) 10supported_coverage = supported_point_coverage(SUPPORTED_DEPLOY, TRACE, GOLD) 11print(f"unsafe supported point coverage: {unsafe_coverage:.2f}") 12print(f"supported answer point coverage: {supported_coverage:.2f}") 13assert unsafe_coverage == 1 / 3 14assert supported_coverage == 1.0
Output
1unsafe supported point coverage: 0.33 2supported answer point coverage: 1.00

This metric is stronger than asking whether an answer sounds relevant. It knows exactly which policy conditions this regression case must preserve. On live questions without labels, you'll need sampled human review or a calibrated judge, which is the next lesson's problem.

Attribute the first failure

One trace can fail at several layers after the first defect. If selected context drops the target rule, later unsupported claims are consequences, not the first fix. Diagnose in pipeline order.

Gate-by-gate diagnosis flow for one RAG trace: missing candidate stops at retrieval, dropped context stops at selection, empty answer stops at claims, invented bypass stops at faithfulness, wrong source stops at citation, and the supported answer clears every gate. Gate-by-gate diagnosis flow for one RAG trace: missing candidate stops at retrieval, dropped context stops at selection, empty answer stops at claims, invented bypass stops at faithfulness, wrong source stops at citation, and the supported answer clears every gate.
Every variant shares one gate path. The first red branch owns the repair, and the supported answer shows full pass-through.
first-failed-stage.py
1RETRIEVAL_MISS = replace( 2 TRACE, 3 first_stage_ids=("payment-service-rollback-runbook", "frontend-docs-deploy-rule"), 4 rerank_input_ids=("payment-service-rollback-runbook", "frontend-docs-deploy-rule"), 5 reranked_ids=("payment-service-rollback-runbook", "frontend-docs-deploy-rule"), 6 selected_context_ids=("payment-service-rollback-runbook",), 7 selected_versions=("payment-rollback/2026-05-20",), 8) 9SELECTION_MISS = replace( 10 TRACE, 11 selected_context_ids=("payment-service-rollback-runbook",), 12 selected_versions=("payment-rollback/2026-05-20",), 13) 14 15def first_failed_stage(trace: RagTrace, answer: Answer, gold: GoldCase) -> str: 16 if not admissible_evidence_path(trace, gold): 17 return "admissibility" 18 if coverage(trace.first_stage_ids, gold.required_source_ids) < 1.0: 19 return "candidate retrieval" 20 if coverage(trace.selected_context_ids, gold.required_source_ids) < 1.0: 21 return "context selection" 22 if not answer.claims: 23 return "answer completeness" 24 if faithfulness(answer, trace) < 1.0: 25 return "answer faithfulness" 26 if citation_support(answer, trace) < 1.0: 27 return "citation support" 28 if supported_point_coverage(answer, trace, gold) < 1.0: 29 return "answer completeness" 30 return "pass" 31 32diagnoses = { 33 "missing candidate": first_failed_stage(RETRIEVAL_MISS, SUPPORTED_DEPLOY, GOLD), 34 "dropped context": first_failed_stage(SELECTION_MISS, SUPPORTED_DEPLOY, GOLD), 35 "invented bypass": first_failed_stage(TRACE, UNSAFE_BYPASS, GOLD), 36 "wrong citation": first_failed_stage(TRACE, MIS_CITED, GOLD), 37 "empty answer": first_failed_stage(TRACE, EMPTY_ANSWER, GOLD), 38 "supported answer": first_failed_stage(TRACE, SUPPORTED_DEPLOY, GOLD), 39} 40for variant, stage in diagnoses.items(): 41 print(f"{variant}: {stage}") 42assert diagnoses["missing candidate"] == "candidate retrieval" 43assert diagnoses["dropped context"] == "context selection" 44assert diagnoses["invented bypass"] == "answer faithfulness" 45assert diagnoses["wrong citation"] == "citation support" 46assert diagnoses["empty answer"] == "answer completeness" 47assert diagnoses["supported answer"] == "pass"
Output
1missing candidate: candidate retrieval 2dropped context: context selection 3invented bypass: answer faithfulness 4wrong citation: citation support 5empty answer: answer completeness 6supported answer: pass

The ordering is important. If the required rule never reaches context, a claim-support failure doesn't justify tuning the generation prompt yet.

Make bad behaviors part of release testing

Evaluation should show both that a supported answer is accepted and that known bad answers are blocked. These variants become a tiny regression suite:

answer-release-regression.py
1@dataclass(frozen=True) 2class ReleaseCase: 3 name: str 4 trace: RagTrace 5 answer: Answer 6 should_release: bool 7 8def can_release(trace: RagTrace, answer: Answer, gold: GoldCase) -> bool: 9 return first_failed_stage(trace, answer, gold) == "pass" 10 11RELEASE_CASES = ( 12 ReleaseCase("supported answer", TRACE, SUPPORTED_DEPLOY, True), 13 ReleaseCase("unsupported bypass", TRACE, UNSAFE_BYPASS, False), 14 ReleaseCase("wrong citation", TRACE, MIS_CITED, False), 15 ReleaseCase("empty answer", TRACE, EMPTY_ANSWER, False), 16 ReleaseCase("dropped source", SELECTION_MISS, SUPPORTED_DEPLOY, False), 17) 18regression_passes = 0 19for case in RELEASE_CASES: 20 observed = can_release(case.trace, case.answer, GOLD) 21 regression_passes += observed == case.should_release 22 print(f"{case.name}: release={observed}, expected={case.should_release}") 23print(f"Regression checks passed: {regression_passes}/{len(RELEASE_CASES)}") 24assert regression_passes == len(RELEASE_CASES)
Output
1supported answer: release=True, expected=True 2unsupported bypass: release=False, expected=False 3wrong citation: release=False, expected=False 4empty answer: release=False, expected=False 5dropped source: release=False, expected=False 6Regression checks passed: 5/5

Release behavior is now concrete: pass the supported response and reject the four controlled defects.

Slice results before trusting an average

A larger golden set should include release-freeze deploys, incident hotfixes, schema migrations, and data backfills. Aggregate quality can remain high while a costly policy slice regresses.

slice-level-report.py
1@dataclass(frozen=True) 2class LabeledOutcome: 3 workflow: str 4 released_correctly: bool 5 6OUTCOMES = ( 7 LabeledOutcome("release-freeze", True), 8 LabeledOutcome("release-freeze", False), 9 LabeledOutcome("incident-hotfix", True), 10 LabeledOutcome("incident-hotfix", True), 11 LabeledOutcome("schema-migration", False), 12) 13by_workflow: dict[str, list[bool]] = defaultdict(list) 14for outcome in OUTCOMES: 15 by_workflow[outcome.workflow].append(outcome.released_correctly) 16 17overall = sum(outcome.released_correctly for outcome in OUTCOMES) / len(OUTCOMES) 18print(f"Overall pass rate: {overall:.0%}") 19for workflow, results in sorted(by_workflow.items()): 20 print(f"{workflow}: {sum(results) / len(results):.0%}") 21assert overall == 0.6 22assert sum(by_workflow["schema-migration"]) == 0
Output
1Overall pass rate: 60% 2incident-hotfix: 100% 3release-freeze: 50% 4schema-migration: 0%
release-report.py
1release_report = { 2 "service_version": "policy-answerer-v4-eval", 3 "source_trace": TRACE.case_id, 4 "required_policy_version": EVIDENCE[TARGET_ID].version, 5 "checks": { 6 "admissible_evidence_path": admissible_evidence_path(TRACE, GOLD), 7 "supported_answer_released": can_release(TRACE, SUPPORTED_DEPLOY, GOLD), 8 "known_bad_answers_blocked": all( 9 not can_release(case.trace, case.answer, GOLD) 10 for case in RELEASE_CASES 11 if not case.should_release 12 ), 13 "slice_regression_clear": all( 14 sum(results) / len(results) >= 0.95 15 for results in by_workflow.values() 16 ), 17 }, 18} 19print("Release checks:", release_report["checks"]) 20print("Release allowed:", all(release_report["checks"].values())) 21assert not all(release_report["checks"].values())
Output
1Release checks: {'admissible_evidence_path': True, 'supported_answer_released': True, 'known_bad_answers_blocked': True, 'slice_regression_clear': False} 2Release allowed: False

The supported single trace passes, and bad-answer detection works. The broader fixture still blocks release because schema-migration has a failed slice. This is the correct result: a demonstration case can't overrule a failing workflow.

Where automated judges enter

The harness used exact labels because policy release cases should be unambiguous. At scale, operators paraphrase questions and models paraphrase valid answers. RAGAS presents reference-free RAG metrics including faithfulness and answer relevance; ARES proposes learned judges for context relevance, answer faithfulness, and answer relevance.[2][3]

Those approaches don't remove the need for this trace design. An automated judge still needs:

InputWhy the judge needs it
Question and expected workflowDecide whether the answer addressed the request
Admitted context and versionDecide whether claims are grounded in allowed evidence
Atomic claims and citationsExplain which claim failed and which source was cited
Human-reviewed calibration setDetect judge mistakes and drift

The next lesson will replace selected deterministic judgments with rubric-driven LLM judgments while keeping hard evidence and citation gates visible.

Production checklist

Before releasing a RAG answer pipeline:

GateMinimum evidenceFirst fix when it fails
AdmissibilityRetrieved, reranked, and selected chunk IDs; versions; ACL/freshness decisionEvidence filtering
Candidate recallRequired chunk in retrieved candidatesChunking, query, or retrieval lane
Context admissionRequired chunk in selected context, no noise paddingReranker or context gate
FaithfulnessSupported claim ledgerPrompt, abstention, or generation policy
Citation supportClaim-to-source validationCitation attachment or validator
Slice healthWorkflow-level regression reportBlock release and debug affected slice

Don't compress these gates into one quality percentage. A dashboard should tell an engineer what to repair first.

Mastery check

You're ready to evaluate a RAG answer pipeline when you can:

  • Carry versioned source identity and pipeline versions from retrieval into answer evaluation.
  • Treat authorization, freshness, trace integrity, and case identity as hard gates before quality scores.
  • Separate candidate recall, selected-context recall, claim support, citation support, and answer completeness.
  • Diagnose the first failing layer instead of tuning every component at once.
  • Build regression cases that accept supported answers and block known bad behaviors.
  • Explain why automated judges require calibration and can't replace deterministic gates.

Evaluation rubric

LevelEvidence in submission
FoundationalCarries case ID, source identity, and pipeline versions into answer evaluation.
AppliedSeparates candidate retrieval, context admission, claim support, citations, and completeness.
StrongRejects malformed traces and known bad answers while reporting the first failed stage.
Production-readyAdds slice-level release gates and explains where calibrated judges may enter.

Follow-up questions

Common pitfalls

Evaluation starts from answer text only

  • Symptom: A wrong response is visible, but nobody can tell whether evidence was missing, dropped, or ignored.
  • Cause: The system didn't retain source identities, retrieved IDs, reranker inputs, selected context IDs, pipeline versions, and citations beside the answer.
  • Fix: Store and replay a full evidence trace for every labeled evaluation case.

Citation coverage is mistaken for grounding

  • Symptom: Responses show citations on every sentence, yet reviewers find unsupported rollback-bypass or deploy-approval promises.
  • Cause: Tests check that a citation exists, not that the cited selected chunk supports the claim.
  • Fix: Validate claim-to-source support and block unsupported policy claims.

Judge scores replace hard safety gates

  • Symptom: A high judge score releases an answer based on restricted or stale policy text.
  • Cause: Semantic quality was evaluated before admission and source-version invariants.
  • Fix: Enforce authorization and freshness deterministically; apply judges only after evidence is admissible.
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.During replay, a trace has the correct case ID and selected_context_ids ('deploy-freeze-approval-rule'), but selected_versions says 'deploy-policy/2025-02-01' while the evidence store has 'deploy-policy/2026-06-01'. The trace also omits the reranker version key. What should the evaluator do before scoring quality?
2.A trace retrieved and reranked a restricted break-glass chunk, then selected only the permitted current deploy rule for final context. How should the admissibility gate handle the trace before computing relevance?
3.A labeled RAG trace shows candidate recall of 1.0 because the required rule was in first_stage_ids, but selected-context recall of 0.0 because selected_context_ids omitted it. What should the first repair target be?
4.An admissible trace has required_source_ids {'deploy-freeze-approval-rule'} and selected_context_ids ('deploy-freeze-approval-rule', 'payment-service-rollback-runbook', 'frontend-docs-deploy-rule'). What are selected-context recall and context precision?
5.The selected context contains only DEPLOY-17, which says payment-service production deploys during a release freeze require incident commander approval and a linked rollback plan. An answer makes two claims: 'The request is governed by the release-freeze deploy rule' and 'The deploy can start without a linked rollback plan.' What faithfulness score does the deterministic claim gate assign?
6.An answer has citation coverage of 1.0 because every policy claim has a citation ID, but citation support is 0.5. What does that expose?
7.In a labeled policy case, the answer must cover the deploy scope, approval, and rollback plan. The trace is admissible, but the model emits no policy claims. What should the release gate do?
8.A release suite currently checks only that the supported deploy answer is accepted. Which addition most directly tests the release invariant against known bad behavior?
9.A release report says the supported deploy case passes and all known bad answer variants are blocked. The slice report still shows schema-migration at 0% pass rate. Can the pipeline be released?
10.Exact phrase labels are too brittle for paraphrased live answers, so you add an LLM judge for faithfulness and answer relevance. What design should remain in place?

10 questions remaining.

Next Step
Continue to LLM-as-a-Judge Evaluation

You can now diagnose a RAG response from its evidence trace and block unsupported claims deterministically. Next you'll automate softer semantic judgments with rubrics, calibration, and judge-failure controls.

PreviousReranking and Cross-Encoders for RAG
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Introduction to Information Retrieval.

Manning, C. D., Raghavan, P., Schutze, H. · 2008 · Cambridge University Press

RAGAS: Automated Evaluation of Retrieval Augmented Generation.

Es, S., et al. · 2023 · arXiv preprint

ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems.

Saad-Falcon, J., et al. · 2023 · NAACL 2024