LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnApplied LLM EngineeringHallucination Detection & Mitigation
🛡️MediumAlignment & Safety

Hallucination Detection & Mitigation

Build a claim-level grounding gate for delivery updates that verifies evidence, catches confident fabrication, abstains safely, and records release traces.

14 min read
Learning path
Step 67 of 155 in the full curriculum
Bias & Fairness in LLMsLLM Observability & Monitoring

The fairness lesson blocked a soft evaluator when it could route equivalent customer requests differently. Factual reliability needs an equally strict boundary: a fluent large language model (LLM) answer can't invent a carrier event just because the customer would like certainty.

ShopFlow's next ticket asks, "Where is order #A10234?" The admitted FastShip record says the parcel departed a regional hub on May 26 at 08:14 UTC. It contains no delivery estimate. A draft answer that adds "It will arrive on May 28" may sound helpful, but the system has no source for that promise.

This lesson builds tracking-answerer-v1, a small serving gate. It turns an answer into atomic claims, checks each claim against versioned evidence, routes unsupported details to abstention, and records why release remains blocked. For a customer-facing grounded answer, "not present in admitted evidence" is enough reason not to serve a factual claim. It doesn't prove the claim is false in the wider world.

Three verdicts are enough to start

Hallucination taxonomies can become abstract quickly. For a retrieved delivery record, begin with the source relationship:

VerdictMeaning in this productDelivery-update exampleCustomer action
SupportedAdmitted source states the fact"Last scan: departed regional hub on May 26."May be served
Not supportedAdmitted source doesn't establish the fact"Delivery is expected May 28."Remove or abstain
ContradictedAdmitted source states an incompatible fact"Your order has been delivered." while status is in transitBlock and investigate

Surveys of LLM hallucination distinguish statements that conflict with supplied context from statements that add information not established by it.[1] In an open-world setting, an unsupported statement might later turn out true. In this answer path it's still unsafe, because the product promised evidence-grounded delivery updates.

Diagram showing Versioned carrier record, Draft answer claims, Claim verifier, and Serve gate. Diagram showing Versioned carrier record, Draft answer claims, Claim verifier, and Serve gate.
Versioned carrier record, Draft answer claims, Claim verifier, and Serve gate.

Figure 1: A grounded delivery answer is served only when every factual claim has an admitted supporting source.

A delivery-answer grounding gate that checks draft claims against a versioned FastShip record and abstains when an invented ETA lacks evidence. A delivery-answer grounding gate that checks draft claims against a versioned FastShip record and abstains when an invented ETA lacks evidence.
The carrier record supports a scan event, not a delivery estimate. Verification blocks the invented ETA before a confident sentence reaches the customer.

Put the evidence in code

The carrier record isn't prose decoration. It's the authority the answer must obey. Our first cell records its source ID and version alongside the candidate claims.

versioned-carrier-evidence.py
1from collections import Counter 2from dataclasses import dataclass 3from enum import Enum 4 5@dataclass(frozen=True) 6class Evidence: 7 source_id: str 8 version: str 9 facts: dict[str, str] 10 11@dataclass(frozen=True) 12class Claim: 13 claim_id: str 14 text: str 15 field: str 16 value: str 17 citation_id: str 18 19tracking = Evidence( 20 source_id="fastship-A10234", 21 version="scan-feed/2026-05-27T10:00:00Z", 22 facts={ 23 "carrier": "FastShip", 24 "status": "in transit", 25 "last_scan": "departed regional hub", 26 "last_scan_at": "May 26 at 08:14 UTC", 27 }, 28) 29sources = {tracking.source_id: tracking} 30 31draft_claims = [ 32 Claim("carrier", "Carrier: FastShip.", "carrier", "FastShip", tracking.source_id), 33 Claim("scan_place", "Last scan: departed regional hub.", "last_scan", "departed regional hub", tracking.source_id), 34 Claim("scan_time", "Scan time: May 26 at 08:14 UTC.", "last_scan_at", "May 26 at 08:14 UTC", tracking.source_id), 35 Claim("eta", "Expected delivery is May 28.", "delivery_eta", "May 28", tracking.source_id), 36] 37 38print(f"Evidence version: {tracking.version}") 39print(f"Available facts: {sorted(tracking.facts)}") 40print(f"Draft factual claims: {len(draft_claims)}")
Output
1Evidence version: scan-feed/2026-05-27T10:00:00Z 2Available facts: ['carrier', 'last_scan', 'last_scan_at', 'status'] 3Draft factual claims: 4

An LLM can write the ETA sentence because it has seen delivery-date patterns. That doesn't make the sentence admissible. The admitted evidence has no delivery_eta field.

Verify each atomic claim

FActScore popularized evaluating long-form generations as atomic facts rather than one answer-level impression.[2] The verifier below uses the same reasoning on a small operational record: a claim is either supported by its cited source, missing from that source, contradicted by a different value, or missing an admitted source entirely.

claim-verdicts.py
1class Verdict(str, Enum): 2 SUPPORTED = "supported" 3 NOT_SUPPORTED = "not_supported" 4 CONTRADICTED = "contradicted" 5 NO_SOURCE = "no_source" 6 7@dataclass(frozen=True) 8class Verification: 9 claim: Claim 10 verdict: Verdict 11 evidence_version: str | None 12 13def verify_claim(claim: Claim) -> Verification: 14 source = sources.get(claim.citation_id) 15 if source is None: 16 return Verification(claim, Verdict.NO_SOURCE, None) 17 expected = source.facts.get(claim.field) 18 if expected is None: 19 return Verification(claim, Verdict.NOT_SUPPORTED, source.version) 20 if expected != claim.value: 21 return Verification(claim, Verdict.CONTRADICTED, source.version) 22 return Verification(claim, Verdict.SUPPORTED, source.version) 23 24draft_verdicts = [verify_claim(claim) for claim in draft_claims] 25delivered_claim = Claim( 26 "delivered", 27 "Order #A10234 has been delivered.", 28 "status", 29 "delivered", 30 tracking.source_id, 31) 32 33assert [item.verdict for item in draft_verdicts] == [ 34 Verdict.SUPPORTED, 35 Verdict.SUPPORTED, 36 Verdict.SUPPORTED, 37 Verdict.NOT_SUPPORTED, 38] 39assert verify_claim(delivered_claim).verdict == Verdict.CONTRADICTED 40 41for result in draft_verdicts + [verify_claim(delivered_claim)]: 42 print(f"{result.claim.claim_id:10} {result.verdict.value:15} {result.claim.text}")
Output
1carrier supported Carrier: FastShip. 2scan_place supported Last scan: departed regional hub. 3scan_time supported Scan time: May 26 at 08:14 UTC. 4eta not_supported Expected delivery is May 28. 5delivered contradicted Order #A10234 has been delivered.

The distinction matters. The ETA isn't proven false; it's simply absent from this source. The delivered statement is worse: it conflicts with status=in transit.

Route the response, not only the score

A detector that writes a warning beside an unsafe response has not protected the customer. Once any factual claim fails, tracking-answerer-v1 keeps supported information and replaces the invented promise with a bounded statement.

safe-answer-route.py
1def safe_answer(claims: list[Claim]) -> dict[str, object]: 2 verdicts = [verify_claim(claim) for claim in claims] 3 failures = [item for item in verdicts if item.verdict != Verdict.SUPPORTED] 4 supported_text = " ".join( 5 item.claim.text for item in verdicts if item.verdict == Verdict.SUPPORTED 6 ) 7 if failures: 8 return { 9 "route": "abstain_on_missing_detail", 10 "answer": supported_text + " The carrier record does not provide a delivery estimate yet.", 11 "blocked_claims": [item.claim.claim_id for item in failures], 12 } 13 return { 14 "route": "serve", 15 "answer": supported_text, 16 "blocked_claims": [], 17 } 18 19decision = safe_answer(draft_claims) 20assert decision["route"] == "abstain_on_missing_detail" 21assert decision["blocked_claims"] == ["eta"] 22assert "May 28" not in str(decision["answer"]) 23 24print(f"Route: {decision['route']}") 25print(f"Blocked claims: {decision['blocked_claims']}") 26print(f"Customer answer: {decision['answer']}")
Output
1Route: abstain_on_missing_detail 2Blocked claims: ['eta'] 3Customer answer: Carrier: FastShip. Last scan: departed regional hub. Scan time: May 26 at 08:14 UTC. The carrier record does not provide a delivery estimate yet.

Measure failure before choosing mitigation

A single blocked ETA gives a regression case, not a release metric. Build a small suite that separates four failure modes: a clean scan summary, an unsupported ETA, a contradiction, and a missing source.

Four delivery-answer regression cases mapped to supported, unsupported, contradicted, and missing-source claim verdicts. Four delivery-answer regression cases mapped to supported, unsupported, contradicted, and missing-source claim verdicts.
Atomic verdicts locate the failure. Unsupported and contradicted claims both leave the customer path, but they call for different investigation.
grounding-regression-suite.py
1@dataclass(frozen=True) 2class AnswerCase: 3 case_id: str 4 claims: list[Claim] 5 6cases = [ 7 AnswerCase("clean_scan", draft_claims[:3]), 8 AnswerCase("invented_eta", draft_claims), 9 AnswerCase("wrong_status", [delivered_claim]), 10 AnswerCase( 11 "unadmitted_source", 12 [Claim("carrier", "FastShip is carrying order #B404.", "carrier", "FastShip", "missing-feed")], 13 ), 14] 15 16def has_unsafe_claim(case: AnswerCase) -> bool: 17 return any(verify_claim(claim).verdict != Verdict.SUPPORTED for claim in case.claims) 18 19baseline_served_unsafe = sum(has_unsafe_claim(case) for case in cases) 20verdict_counts = Counter( 21 verify_claim(claim).verdict.value 22 for case in cases 23 for claim in case.claims 24) 25 26print(f"Baseline unsafe serves if all drafts ship: {baseline_served_unsafe}/{len(cases)}") 27print(f"Claim verdict counts: {dict(verdict_counts)}") 28assert baseline_served_unsafe == 3
Output
1Baseline unsafe serves if all drafts ship: 3/4 2Claim verdict counts: {'supported': 6, 'not_supported': 1, 'contradicted': 1, 'no_source': 1}

For a grounded status product, two simple metrics are useful:

MetricCalculationRelease meaning
Claim support rateSupported factual claims / all factual claimsHow much draft content evidence admits
Unsafe serve rateServed answers containing any failed factual claim / served answersWhether bad claims reach customers
Abstention rateAnswers withheld or safely shortened / total requestsCost of being cautious

Claim support can improve while unsafe serves remain unacceptable. A single contradicted delivery status shown to a customer is still a serious failure.

Consistency is an alarm, not evidence

Black-box methods such as SelfCheckGPT compare an answer to additional sampled generations; disagreement can flag factual content that the model is inventing without an external database.[3] Semantic entropy similarly groups sampled answers by meaning and detects high uncertainty when those meanings vary.[4] These are useful when no admitted source is available or when you need to prioritize expensive checks.

They don't authorize a delivery claim. A model can repeat the same unsupported ETA every time.

A detection hierarchy where evidence verification controls delivery facts and consistency sampling only escalates uncertainty when evidence has already passed. A detection hierarchy where evidence verification controls delivery facts and consistency sampling only escalates uncertainty when evidence has already passed.
Sampling disagreement can find unstable answers, but stable repetition isn't proof. Source support remains the serving authority for tracking facts.
consistency-is-not-truth.py
1def disagreement_rate(values: list[str]) -> float: 2 most_common_count = Counter(values).most_common(1)[0][1] 3 return 1 - most_common_count / len(values) 4 5unstable_eta_samples = ["May 28", "May 29", "May 28", "May 30"] 6stable_false_eta_samples = ["May 28", "May 28", "May 28", "May 28"] 7eta_verdict = verify_claim(draft_claims[-1]).verdict 8 9print(f"Unstable ETA disagreement: {disagreement_rate(unstable_eta_samples):.2f}") 10print(f"Repeated ETA disagreement: {disagreement_rate(stable_false_eta_samples):.2f}") 11print(f"Repeated ETA evidence verdict: {eta_verdict.value}") 12 13assert disagreement_rate(stable_false_eta_samples) == 0.0 14assert eta_verdict == Verdict.NOT_SUPPORTED
Output
1Unstable ETA disagreement: 0.50 2Repeated ETA disagreement: 0.00 3Repeated ETA evidence verdict: not_supported

This fixes a common misconception: low uncertainty means "the model repeats itself," not "the fact is true."

Combine signals with the right authority

Evidence failure blocks a fact even when samples agree. When factual evidence passes but generation is unstable, consistency can route the case to review rather than presenting an answer that changes from run to run.

evidence-first-routing.py
1def route_with_signals(case: AnswerCase, sampled_statuses: list[str]) -> str: 2 if has_unsafe_claim(case): 3 return "abstain_evidence_failure" 4 if disagreement_rate(sampled_statuses) > 0.25: 5 return "review_unstable_generation" 6 return "serve_supported_answer" 7 8clean_case = cases[0] 9eta_case = cases[1] 10 11routes = { 12 "clean_stable": route_with_signals(clean_case, ["in transit"] * 4), 13 "clean_unstable": route_with_signals( 14 clean_case, ["in transit", "in transit", "out for delivery", "delivered"] 15 ), 16 "eta_repeated": route_with_signals(eta_case, stable_false_eta_samples), 17} 18 19for name, route_name in routes.items(): 20 print(f"{name:14} -> {route_name}") 21 22assert routes["clean_stable"] == "serve_supported_answer" 23assert routes["clean_unstable"] == "review_unstable_generation" 24assert routes["eta_repeated"] == "abstain_evidence_failure"
Output
1clean_stable -> serve_supported_answer 2clean_unstable -> review_unstable_generation 3eta_repeated -> abstain_evidence_failure

Citations must be checked, not decorated

A response that prints [fastship-A10234] isn't necessarily grounded. The citation must resolve to the admitted version and support the nearby claim. Otherwise a model can attach a real-looking source marker to an invented delivery promise.

citation-faithfulness.py
1def cited_sentence(claim: Claim) -> str: 2 result = verify_claim(claim) 3 if result.verdict != Verdict.SUPPORTED: 4 raise ValueError(f"cannot cite {claim.claim_id}: {result.verdict.value}") 5 return f"{claim.text} [{claim.citation_id}@{result.evidence_version}]" 6 7served_sentences = [cited_sentence(claim) for claim in draft_claims[:3]] 8served_answer = " ".join(served_sentences) 9 10try: 11 cited_sentence(draft_claims[-1]) 12except ValueError as error: 13 blocked_citation = str(error) 14 15print(served_answer) 16print(blocked_citation) 17assert "May 28" not in served_answer 18assert "not_supported" in blocked_citation
Output
1Carrier: FastShip. [fastship-A10234@scan-feed/2026-05-27T10:00:00Z] Last scan: departed regional hub. [fastship-A10234@scan-feed/2026-05-27T10:00:00Z] Scan time: May 26 at 08:14 UTC. [fastship-A10234@scan-feed/2026-05-27T10:00:00Z] 2cannot cite eta: not_supported

This is a narrower version of claim-level factual evaluation: decompose, retrieve or admit evidence, verify, and retain the provenance for failed and passed claims. It also extends the evaluation lesson: citation presence is no longer confused with citation support.

Attribute failures before adding complexity

An unsupported answer may begin with retrieval or generation:

First failed stageSymptomAppropriate next action
Evidence admissionNo carrier record was retrieved for an orderFix retrieval, permissions, freshness, or tool failure
Claim generationSource exists, but answer adds unsupported ETATighten generation and post-generation claim gate
Consistency onlySupported facts vary across samplesReview decoding or prompt stability; don't call it a source failure

Techniques such as Chain-of-Verification ask a model to draft verification questions, answer them independently, and revise the response; the original paper reports reduced hallucination across list questions, closed-book question answering, and long-form generation.[5] That's a possible additional generator control. It doesn't replace an authoritative carrier record or the final claim gate in this product.

first-failure-attribution.py
1@dataclass(frozen=True) 2class RunTrace: 3 request_id: str 4 case: AnswerCase 5 evidence_admitted: bool 6 sampled_statuses: list[str] 7 8def first_failed_stage(trace: RunTrace) -> str: 9 if not trace.evidence_admitted: 10 return "evidence_admission" 11 if has_unsafe_claim(trace.case): 12 return "claim_generation" 13 if disagreement_rate(trace.sampled_statuses) > 0.25: 14 return "generation_stability" 15 return "passed" 16 17traces = [ 18 RunTrace("req-clean", cases[0], True, ["in transit"] * 4), 19 RunTrace("req-eta", cases[1], True, stable_false_eta_samples), 20 RunTrace("req-missing", cases[3], False, ["unknown"] * 4), 21 RunTrace("req-vary", cases[0], True, ["in transit", "delivered", "in transit", "out for delivery"]), 22] 23 24for trace in traces: 25 print(f"{trace.request_id:11} -> {first_failed_stage(trace)}") 26 27assert [first_failed_stage(trace) for trace in traces] == [ 28 "passed", 29 "claim_generation", 30 "evidence_admission", 31 "generation_stability", 32]
Output
1req-clean -> passed 2req-eta -> claim_generation 3req-missing -> evidence_admission 4req-vary -> generation_stability

Evaluate a candidate gate

Research benchmarks help compare methods, but they don't execute this carrier feed, prompt, citation format, or customer route. Use external datasets for breadth and product regressions for release:

ArtifactTeaches or testsPlace in this workflow
SelfCheckGPT[3]Sample consistency without external factsTriage signal when evidence is absent or costly
Semantic entropy[4]Meaning-level uncertainty over samplesEscalation feature for confabulations
FActScore[2]Atomic factual precisionDesign model for claim-level verification
ShopFlow regression Exact carrier facts and response policyRelease gate for tracking-answerer-v1
A hallucination validation stack with research probes for breadth, product claim regressions for release, and monitored carrier-answer traces after deployment. A hallucination validation stack with research probes for breadth, product claim regressions for release, and monitored carrier-answer traces after deployment.
External research probes explain behavior and compare detectors. Versioned product regressions decide whether this delivery-answer path may ship.
candidate-gate-metrics.py
1def baseline_router(case: AnswerCase) -> str: 2 return "serve" 3 4def grounded_router(case: AnswerCase) -> str: 5 return "abstain" if has_unsafe_claim(case) else "serve" 6 7def score_router(router) -> dict[str, float | int]: 8 served = [case for case in cases if router(case) == "serve"] 9 unsafe_serves = sum(has_unsafe_claim(case) for case in served) 10 supported_cases = [case for case in cases if not has_unsafe_claim(case)] 11 supported_serves = sum(router(case) == "serve" for case in supported_cases) 12 return { 13 "served": len(served), 14 "unsafe_serves": unsafe_serves, 15 "unsafe_serve_rate": unsafe_serves / max(len(served), 1), 16 "supported_coverage": supported_serves / len(supported_cases), 17 "abstentions": len(cases) - len(served), 18 } 19 20baseline_metrics = score_router(baseline_router) 21candidate_metrics = score_router(grounded_router) 22 23for name, metrics in [("baseline", baseline_metrics), ("candidate", candidate_metrics)]: 24 print( 25 f"{name:9} unsafe_rate={metrics['unsafe_serve_rate']:.1%} " 26 f"coverage={metrics['supported_coverage']:.1%} abstentions={metrics['abstentions']}" 27 ) 28 29assert baseline_metrics["unsafe_serve_rate"] == 0.75 30assert candidate_metrics["unsafe_serve_rate"] == 0.0 31assert candidate_metrics["supported_coverage"] == 1.0
Output
1baseline unsafe_rate=75.0% coverage=100.0% abstentions=0 2candidate unsafe_rate=0.0% coverage=100.0% abstentions=3

This is a useful repair on four deliberately small cases. It isn't enough to claim production factuality. The candidate abstains on three of four requests here because the test suite is failure-heavy by design.

Keep the mitigation stack small and testable

For this workflow, extra decoding tricks aren't the next priority. The current failure is already localized: the draft adds a fact absent from a live source. Start with controls whose effect can be tested against that source:

A focused mitigation stack for carrier updates: admit versioned evidence, constrain claims, verify citation support, route unsupported answers, and monitor outcomes. A focused mitigation stack for carrier updates: admit versioned evidence, constrain claims, verify citation support, route unsupported answers, and monitor outcomes.
Use the smallest control chain that resolves the measured failure. Broader model interventions can be evaluated later if supported answers still fail.
LayerInvariantFailure it prevents
Admit evidenceRecord source ID and version before answer generationStale or untraceable facts
Generate conservativelyAsk only for claims licensed by source fieldsUnnecessary unsupported detail
Verify claimsEvery factual clause receives a verdictFluent fabrications
Route safelyFailed clauses trigger removal, abstention, or reviewUnsafe customer promises
Measure after launchRetain verdicts, versions, route, and ownerSilent recurrence

Hand the next lesson a trace

The next chapter can't monitor a correctness property that this chapter never records. Store enough information to reconstruct the answer decision without retaining unnecessary customer text: source versions, verdict counts, route, and failure stage.

release-trace-contract.py
1@dataclass(frozen=True) 2class MonitorEvent: 3 request_id: str 4 evidence_version: str | None 5 route: str 6 first_failed_stage: str 7 verdict_counts: dict[str, int] 8 9def monitor_event(trace: RunTrace) -> MonitorEvent: 10 counts = Counter(verify_claim(claim).verdict.value for claim in trace.case.claims) 11 return MonitorEvent( 12 request_id=trace.request_id, 13 evidence_version=tracking.version if trace.evidence_admitted else None, 14 route=grounded_router(trace.case), 15 first_failed_stage=first_failed_stage(trace), 16 verdict_counts=dict(counts), 17 ) 18 19events = [monitor_event(trace) for trace in traces] 20requirements = { 21 "known_unsafe_claims_not_served": candidate_metrics["unsafe_serves"] == 0, 22 "source_version_logged_when_admitted": all( 23 event.evidence_version is not None 24 for event in events 25 if event.first_failed_stage != "evidence_admission" 26 ), 27 "representative_labeled_holdout_collected": False, 28 "monitoring_alert_owner_assigned": False, 29} 30failed_requirements = [name for name, passed in requirements.items() if not passed] 31decision = "APPROVED" if not failed_requirements else "BLOCKED" 32 33print(f"Candidate promotion: {decision}") 34for name in failed_requirements: 35 print(f" missing: {name}") 36print(f"Example failed stage: {events[1].first_failed_stage}") 37 38assert decision == "BLOCKED"
Output
1Candidate promotion: BLOCKED 2 missing: representative_labeled_holdout_collected 3 missing: monitoring_alert_owner_assigned 4Example failed stage: claim_generation

The candidate gate fixes its known fabricated-ETA regressions, but promotion remains blocked. It still needs a representative labeled holdout and an operational owner for alerts. The code has now produced exactly the facts an system should aggregate.

Mastery check

Key concepts

  • Supported, not-supported, and contradicted factual claims
  • Atomic claim verification against admitted evidence
  • Evidence-versioned citations
  • Claim support, unsafe-serve, and abstention rates
  • Consistency sampling as triage rather than truth
  • Safe answer shortening and abstention
  • Failure-stage attribution
  • Regression release gates and monitoring events

Evaluation rubric

  • Treats absence of source support as a serving failure without claiming outside-world falsity
  • Decomposes a draft answer into testable factual claims
  • Blocks both invented ETAs and claims contradicted by admitted state
  • Demonstrates why stable repeated output isn't proof of truth
  • Keeps citations coupled to verified source versions
  • Locates first failure before proposing new mitigation layers
  • Evaluates safe routing with both unsafe-serve rate and supported coverage
  • Produces a trace that the monitoring lesson can aggregate

Follow-up questions

Common pitfalls

Repeated answers are mistaken for true answers

Symptom: A stable sampled ETA is served despite no source field establishing it. Cause: Consistency was treated as evidence. Fix: Use consistency to prioritize verification or review; require admitted source support for customer-facing facts.

Citations are generated but not verified

Symptom: The answer has plausible source tags beside an invented delivery promise. Cause: Citation formatting was checked, but claim-to-source support was not. Fix: Verify each cited factual claim against its source version before serving it.

The retriever is blamed for generator overreach

Symptom: More documents are indexed even though correct tracking evidence was already retrieved. Cause: The team didn't identify the first failed stage. Fix: Separate evidence-admission failure from claim-generation failure in every trace.

Abstention is declared a release success on tiny fixtures

Symptom: Four regression cases are used to claim production factual reliability. Cause: A focused regression suite was confused with a representative holdout. Fix: Keep the regressions, then collect labeled workflow slices and assign monitoring ownership before promotion.

Next Step
Continue to LLM Observability & Monitoring

You can now emit a versioned trace whenever a factual claim is served, removed, or blocked. Next you will turn those traces into quality metrics, alerts, and request-level debugging.

PreviousBias & Fairness in LLMs
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

A Survey on Hallucination in Large Language Models

Huang et al. · 2023

FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation.

Min, S., et al. · 2023 · EMNLP 2023

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models.

Manakul, P., et al. · 2023 · EMNLP 2023

Detecting Hallucinations in Large Language Models Using Semantic Entropy

Farquhar, S., et al. · 2024 · Nature

Chain-of-Verification Reduces Hallucination in Large Language Models

Dhuliawala, S., et al. · 2023