LearnApplied LLM EngineeringHallucination Detection & Mitigation

🛡️MediumAlignment & Safety

Hallucination Detection & Mitigation

Build a claim-level grounding gate for delivery updates that verifies evidence, catches confident fabrication, abstains safely, and records release traces.

14 min read

Learning path

Step 67 of 155 in the full curriculum

Bias & Fairness in LLMs LLM Observability & Monitoring

The fairness lesson blocked a soft evaluator when it could route equivalent customer requests differently. Factual reliability needs an equally strict boundary: a fluent large language model (LLM) answer can't invent a carrier event just because the customer would like certainty.

ShopFlow's next ticket asks, "Where is order #A10234?" The admitted FastShip record says the parcel departed a regional hub on May 26 at 08:14 UTC. It contains no delivery estimate. A draft answer that adds "It will arrive on May 28" may sound helpful, but the system has no source for that promise.

This lesson builds tracking-answerer-v1, a small serving gate. It turns an answer into atomic claims, checks each claim against versioned evidence, routes unsupported details to abstention, and records why release remains blocked. For a customer-facing grounded answer, "not present in admitted evidence" is enough reason not to serve a factual claim. It doesn't prove the claim is false in the wider world.

Three verdicts are enough to start

Hallucination taxonomies can become abstract quickly. For a retrieved delivery record, begin with the source relationship:

Verdict	Meaning in this product	Delivery-update example	Customer action
Supported	Admitted source states the fact	"Last scan: departed regional hub on May 26."	May be served
Not supported	Admitted source doesn't establish the fact	"Delivery is expected May 28."	Remove or abstain
Contradicted	Admitted source states an incompatible fact	"Your order has been delivered." while status is `in transit`	Block and investigate

Surveys of LLM hallucination distinguish statements that conflict with supplied context from statements that add information not established by it.^[1] In an open-world setting, an unsupported statement might later turn out true. In this answer path it's still unsafe, because the product promised evidence-grounded delivery updates.

Diagram showing Versioned carrier record, Draft answer claims, Claim verifier, and Serve gate. — Versioned carrier record, Draft answer claims, Claim verifier, and Serve gate.

Figure 1: A grounded delivery answer is served only when every factual claim has an admitted supporting source.

A delivery-answer grounding gate that checks draft claims against a versioned FastShip record and abstains when an invented ETA lacks evidence. — The carrier record supports a scan event, not a delivery estimate. Verification blocks the invented ETA before a confident sentence reaches the customer.

Put the evidence in code

The carrier record isn't prose decoration. It's the authority the answer must obey. Our first cell records its source ID and version alongside the candidate claims.

versioned-carrier-evidence.py

from collections import Counter
from dataclasses import dataclass
from enum import Enum

@dataclass(frozen=True)
class Evidence:
    source_id: str
    version: str
    facts: dict[str, str]

@dataclass(frozen=True)
class Claim:
    claim_id: str
    text: str
    field: str
    value: str
    citation_id: str

tracking = Evidence(
    source_id="fastship-A10234",
    version="scan-feed/2026-05-27T10:00:00Z",
    facts={
        "carrier": "FastShip",
        "status": "in transit",
        "last_scan": "departed regional hub",
        "last_scan_at": "May 26 at 08:14 UTC",
    },
)
sources = {tracking.source_id: tracking}

draft_claims = [
    Claim("carrier", "Carrier: FastShip.", "carrier", "FastShip", tracking.source_id),
    Claim("scan_place", "Last scan: departed regional hub.", "last_scan", "departed regional hub", tracking.source_id),
    Claim("scan_time", "Scan time: May 26 at 08:14 UTC.", "last_scan_at", "May 26 at 08:14 UTC", tracking.source_id),
    Claim("eta", "Expected delivery is May 28.", "delivery_eta", "May 28", tracking.source_id),
]

print(f"Evidence version: {tracking.version}")
print(f"Available facts: {sorted(tracking.facts)}")
print(f"Draft factual claims: {len(draft_claims)}")

Output

Evidence version: scan-feed/2026-05-27T10:00:00Z
Available facts: ['carrier', 'last_scan', 'last_scan_at', 'status']
Draft factual claims: 4

An LLM can write the ETA sentence because it has seen delivery-date patterns. That doesn't make the sentence admissible. The admitted evidence has no delivery_eta field.

Verify each atomic claim

FActScore popularized evaluating long-form generations as atomic facts rather than one answer-level impression.^[2] The verifier below uses the same reasoning on a small operational record: a claim is either supported by its cited source, missing from that source, contradicted by a different value, or missing an admitted source entirely.

claim-verdicts.py

class Verdict(str, Enum):
    SUPPORTED = "supported"
    NOT_SUPPORTED = "not_supported"
    CONTRADICTED = "contradicted"
    NO_SOURCE = "no_source"

@dataclass(frozen=True)
class Verification:
    claim: Claim
    verdict: Verdict
    evidence_version: str | None

def verify_claim(claim: Claim) -> Verification:
    source = sources.get(claim.citation_id)
    if source is None:
        return Verification(claim, Verdict.NO_SOURCE, None)
    expected = source.facts.get(claim.field)
    if expected is None:
        return Verification(claim, Verdict.NOT_SUPPORTED, source.version)
    if expected != claim.value:
        return Verification(claim, Verdict.CONTRADICTED, source.version)
    return Verification(claim, Verdict.SUPPORTED, source.version)

draft_verdicts = [verify_claim(claim) for claim in draft_claims]
delivered_claim = Claim(
    "delivered",
    "Order #A10234 has been delivered.",
    "status",
    "delivered",
    tracking.source_id,
)

assert [item.verdict for item in draft_verdicts] == [
    Verdict.SUPPORTED,
    Verdict.SUPPORTED,
    Verdict.SUPPORTED,
    Verdict.NOT_SUPPORTED,
]
assert verify_claim(delivered_claim).verdict == Verdict.CONTRADICTED

for result in draft_verdicts + [verify_claim(delivered_claim)]:
    print(f"{result.claim.claim_id:10} {result.verdict.value:15} {result.claim.text}")

Output

carrier    supported       Carrier: FastShip.
scan_place supported       Last scan: departed regional hub.
scan_time  supported       Scan time: May 26 at 08:14 UTC.
eta        not_supported   Expected delivery is May 28.
delivered  contradicted    Order #A10234 has been delivered.

The distinction matters. The ETA isn't proven false; it's simply absent from this source. The delivered statement is worse: it conflicts with status=in transit.

Route the response, not only the score

A detector that writes a warning beside an unsafe response has not protected the customer. Once any factual claim fails, tracking-answerer-v1 keeps supported information and replaces the invented promise with a bounded statement.

safe-answer-route.py

def safe_answer(claims: list[Claim]) -> dict[str, object]:
    verdicts = [verify_claim(claim) for claim in claims]
    failures = [item for item in verdicts if item.verdict != Verdict.SUPPORTED]
    supported_text = " ".join(
        item.claim.text for item in verdicts if item.verdict == Verdict.SUPPORTED
    )
    if failures:
        return {
            "route": "abstain_on_missing_detail",
            "answer": supported_text + " The carrier record does not provide a delivery estimate yet.",
            "blocked_claims": [item.claim.claim_id for item in failures],
        }
    return {
        "route": "serve",
        "answer": supported_text,
        "blocked_claims": [],
    }

decision = safe_answer(draft_claims)
assert decision["route"] == "abstain_on_missing_detail"
assert decision["blocked_claims"] == ["eta"]
assert "May 28" not in str(decision["answer"])

print(f"Route: {decision['route']}")
print(f"Blocked claims: {decision['blocked_claims']}")
print(f"Customer answer: {decision['answer']}")

Output

Route: abstain_on_missing_detail
Blocked claims: ['eta']
Customer answer: Carrier: FastShip. Last scan: departed regional hub. Scan time: May 26 at 08:14 UTC. The carrier record does not provide a delivery estimate yet.

Measure failure before choosing mitigation

A single blocked ETA gives a regression case, not a release metric. Build a small suite that separates four failure modes: a clean scan summary, an unsupported ETA, a contradiction, and a missing source.

Four delivery-answer regression cases mapped to supported, unsupported, contradicted, and missing-source claim verdicts. — Atomic verdicts locate the failure. Unsupported and contradicted claims both leave the customer path, but they call for different investigation.

grounding-regression-suite.py

@dataclass(frozen=True)
class AnswerCase:
    case_id: str
    claims: list[Claim]

cases = [
    AnswerCase("clean_scan", draft_claims[:3]),
    AnswerCase("invented_eta", draft_claims),
    AnswerCase("wrong_status", [delivered_claim]),
    AnswerCase(
        "unadmitted_source",
        [Claim("carrier", "FastShip is carrying order #B404.", "carrier", "FastShip", "missing-feed")],
    ),
]

def has_unsafe_claim(case: AnswerCase) -> bool:
    return any(verify_claim(claim).verdict != Verdict.SUPPORTED for claim in case.claims)

baseline_served_unsafe = sum(has_unsafe_claim(case) for case in cases)
verdict_counts = Counter(
    verify_claim(claim).verdict.value
    for case in cases
    for claim in case.claims
)

print(f"Baseline unsafe serves if all drafts ship: {baseline_served_unsafe}/{len(cases)}")
print(f"Claim verdict counts: {dict(verdict_counts)}")
assert baseline_served_unsafe == 3

Output

Baseline unsafe serves if all drafts ship: 3/4
Claim verdict counts: {'supported': 6, 'not_supported': 1, 'contradicted': 1, 'no_source': 1}

For a grounded status product, two simple metrics are useful:

Metric	Calculation	Release meaning
Claim support rate	Supported factual claims / all factual claims	How much draft content evidence admits
Unsafe serve rate	Served answers containing any failed factual claim / served answers	Whether bad claims reach customers
Abstention rate	Answers withheld or safely shortened / total requests	Cost of being cautious

Claim support can improve while unsafe serves remain unacceptable. A single contradicted delivery status shown to a customer is still a serious failure.

Consistency is an alarm, not evidence

Black-box methods such as SelfCheckGPT compare an answer to additional sampled generations; disagreement can flag factual content that the model is inventing without an external database.^[3] Semantic entropy similarly groups sampled answers by meaning and detects high uncertainty when those meanings vary.^[4] These are useful when no admitted source is available or when you need to prioritize expensive checks.

They don't authorize a delivery claim. A model can repeat the same unsupported ETA every time.

A detection hierarchy where evidence verification controls delivery facts and consistency sampling only escalates uncertainty when evidence has already passed. — Sampling disagreement can find unstable answers, but stable repetition isn't proof. Source support remains the serving authority for tracking facts.

consistency-is-not-truth.py

def disagreement_rate(values: list[str]) -> float:
    most_common_count = Counter(values).most_common(1)[0][1]
    return 1 - most_common_count / len(values)

unstable_eta_samples = ["May 28", "May 29", "May 28", "May 30"]
stable_false_eta_samples = ["May 28", "May 28", "May 28", "May 28"]
eta_verdict = verify_claim(draft_claims[-1]).verdict

print(f"Unstable ETA disagreement: {disagreement_rate(unstable_eta_samples):.2f}")
print(f"Repeated ETA disagreement: {disagreement_rate(stable_false_eta_samples):.2f}")
print(f"Repeated ETA evidence verdict: {eta_verdict.value}")

assert disagreement_rate(stable_false_eta_samples) == 0.0
assert eta_verdict == Verdict.NOT_SUPPORTED

Output

Unstable ETA disagreement: 0.50
Repeated ETA disagreement: 0.00
Repeated ETA evidence verdict: not_supported

This fixes a common misconception: low uncertainty means "the model repeats itself," not "the fact is true."

Combine signals with the right authority

Evidence failure blocks a fact even when samples agree. When factual evidence passes but generation is unstable, consistency can route the case to review rather than presenting an answer that changes from run to run.

evidence-first-routing.py

def route_with_signals(case: AnswerCase, sampled_statuses: list[str]) -> str:
    if has_unsafe_claim(case):
        return "abstain_evidence_failure"
    if disagreement_rate(sampled_statuses) > 0.25:
        return "review_unstable_generation"
    return "serve_supported_answer"

clean_case = cases[0]
eta_case = cases[1]

routes = {
    "clean_stable": route_with_signals(clean_case, ["in transit"] * 4),
    "clean_unstable": route_with_signals(
        clean_case, ["in transit", "in transit", "out for delivery", "delivered"]
    ),
    "eta_repeated": route_with_signals(eta_case, stable_false_eta_samples),
}

for name, route_name in routes.items():
    print(f"{name:14} -> {route_name}")

assert routes["clean_stable"] == "serve_supported_answer"
assert routes["clean_unstable"] == "review_unstable_generation"
assert routes["eta_repeated"] == "abstain_evidence_failure"

Output

clean_stable   -> serve_supported_answer
clean_unstable -> review_unstable_generation
eta_repeated   -> abstain_evidence_failure

Citations must be checked, not decorated

A response that prints [fastship-A10234] isn't necessarily grounded. The citation must resolve to the admitted version and support the nearby claim. Otherwise a model can attach a real-looking source marker to an invented delivery promise.

citation-faithfulness.py

def cited_sentence(claim: Claim) -> str:
    result = verify_claim(claim)
    if result.verdict != Verdict.SUPPORTED:
        raise ValueError(f"cannot cite {claim.claim_id}: {result.verdict.value}")
    return f"{claim.text} [{claim.citation_id}@{result.evidence_version}]"

served_sentences = [cited_sentence(claim) for claim in draft_claims[:3]]
served_answer = " ".join(served_sentences)

try:
    cited_sentence(draft_claims[-1])
except ValueError as error:
    blocked_citation = str(error)

print(served_answer)
print(blocked_citation)
assert "May 28" not in served_answer
assert "not_supported" in blocked_citation

Output

Carrier: FastShip. [fastship-A10234@scan-feed/2026-05-27T10:00:00Z] Last scan: departed regional hub. [fastship-A10234@scan-feed/2026-05-27T10:00:00Z] Scan time: May 26 at 08:14 UTC. [fastship-A10234@scan-feed/2026-05-27T10:00:00Z]
cannot cite eta: not_supported

This is a narrower version of claim-level factual evaluation: decompose, retrieve or admit evidence, verify, and retain the provenance for failed and passed claims. It also extends the evaluation lesson: citation presence is no longer confused with citation support.

Attribute failures before adding complexity

An unsupported answer may begin with retrieval or generation:

First failed stage	Symptom	Appropriate next action
Evidence admission	No carrier record was retrieved for an order	Fix retrieval, permissions, freshness, or tool failure
Claim generation	Source exists, but answer adds unsupported ETA	Tighten generation and post-generation claim gate
Consistency only	Supported facts vary across samples	Review decoding or prompt stability; don't call it a source failure

Techniques such as Chain-of-Verification ask a model to draft verification questions, answer them independently, and revise the response; the original paper reports reduced hallucination across list questions, closed-book question answering, and long-form generation.^[5] That's a possible additional generator control. It doesn't replace an authoritative carrier record or the final claim gate in this product.

first-failure-attribution.py

@dataclass(frozen=True)
class RunTrace:
    request_id: str
    case: AnswerCase
    evidence_admitted: bool
    sampled_statuses: list[str]

def first_failed_stage(trace: RunTrace) -> str:
    if not trace.evidence_admitted:
        return "evidence_admission"
    if has_unsafe_claim(trace.case):
        return "claim_generation"
    if disagreement_rate(trace.sampled_statuses) > 0.25:
        return "generation_stability"
    return "passed"

traces = [
    RunTrace("req-clean", cases[0], True, ["in transit"] * 4),
    RunTrace("req-eta", cases[1], True, stable_false_eta_samples),
    RunTrace("req-missing", cases[3], False, ["unknown"] * 4),
    RunTrace("req-vary", cases[0], True, ["in transit", "delivered", "in transit", "out for delivery"]),
]

for trace in traces:
    print(f"{trace.request_id:11} -> {first_failed_stage(trace)}")

assert [first_failed_stage(trace) for trace in traces] == [
    "passed",
    "claim_generation",
    "evidence_admission",
    "generation_stability",
]

Output

req-clean   -> passed
req-eta     -> claim_generation
req-missing -> evidence_admission
req-vary    -> generation_stability

Evaluate a candidate gate

Research benchmarks help compare methods, but they don't execute this carrier feed, prompt, citation format, or customer route. Use external datasets for breadth and product regressions for release:

Artifact	Teaches or tests	Place in this workflow
SelfCheckGPT^[3]	Sample consistency without external facts	Triage signal when evidence is absent or costly
Semantic entropy^[4]	Meaning-level uncertainty over samples	Escalation feature for confabulations
FActScore^[2]	Atomic factual precision	Design model for claim-level verification
ShopFlow regression	Exact carrier facts and response policy	Release gate for `tracking-answerer-v1`

A hallucination validation stack with research probes for breadth, product claim regressions for release, and monitored carrier-answer traces after deployment. — External research probes explain behavior and compare detectors. Versioned product regressions decide whether this delivery-answer path may ship.

candidate-gate-metrics.py

def baseline_router(case: AnswerCase) -> str:
    return "serve"

def grounded_router(case: AnswerCase) -> str:
    return "abstain" if has_unsafe_claim(case) else "serve"

def score_router(router) -> dict[str, float | int]:
    served = [case for case in cases if router(case) == "serve"]
    unsafe_serves = sum(has_unsafe_claim(case) for case in served)
    supported_cases = [case for case in cases if not has_unsafe_claim(case)]
    supported_serves = sum(router(case) == "serve" for case in supported_cases)
    return {
        "served": len(served),
        "unsafe_serves": unsafe_serves,
        "unsafe_serve_rate": unsafe_serves / max(len(served), 1),
        "supported_coverage": supported_serves / len(supported_cases),
        "abstentions": len(cases) - len(served),
    }

baseline_metrics = score_router(baseline_router)
candidate_metrics = score_router(grounded_router)

for name, metrics in [("baseline", baseline_metrics), ("candidate", candidate_metrics)]:
    print(
        f"{name:9} unsafe_rate={metrics['unsafe_serve_rate']:.1%} "
        f"coverage={metrics['supported_coverage']:.1%} abstentions={metrics['abstentions']}"
    )

assert baseline_metrics["unsafe_serve_rate"] == 0.75
assert candidate_metrics["unsafe_serve_rate"] == 0.0
assert candidate_metrics["supported_coverage"] == 1.0

Output

baseline  unsafe_rate=75.0% coverage=100.0% abstentions=0
candidate unsafe_rate=0.0% coverage=100.0% abstentions=3

This is a useful repair on four deliberately small cases. It isn't enough to claim production factuality. The candidate abstains on three of four requests here because the test suite is failure-heavy by design.

Keep the mitigation stack small and testable

For this workflow, extra decoding tricks aren't the next priority. The current failure is already localized: the draft adds a fact absent from a live source. Start with controls whose effect can be tested against that source:

A focused mitigation stack for carrier updates: admit versioned evidence, constrain claims, verify citation support, route unsupported answers, and monitor outcomes. — Use the smallest control chain that resolves the measured failure. Broader model interventions can be evaluated later if supported answers still fail.

Layer	Invariant	Failure it prevents
Admit evidence	Record source ID and version before answer generation	Stale or untraceable facts
Generate conservatively	Ask only for claims licensed by source fields	Unnecessary unsupported detail
Verify claims	Every factual clause receives a verdict	Fluent fabrications
Route safely	Failed clauses trigger removal, abstention, or review	Unsafe customer promises
Measure after launch	Retain verdicts, versions, route, and owner	Silent recurrence

Hand the next lesson a trace

The next chapter can't monitor a correctness property that this chapter never records. Store enough information to reconstruct the answer decision without retaining unnecessary customer text: source versions, verdict counts, route, and failure stage.

release-trace-contract.py

@dataclass(frozen=True)
class MonitorEvent:
    request_id: str
    evidence_version: str | None
    route: str
    first_failed_stage: str
    verdict_counts: dict[str, int]

def monitor_event(trace: RunTrace) -> MonitorEvent:
    counts = Counter(verify_claim(claim).verdict.value for claim in trace.case.claims)
    return MonitorEvent(
        request_id=trace.request_id,
        evidence_version=tracking.version if trace.evidence_admitted else None,
        route=grounded_router(trace.case),
        first_failed_stage=first_failed_stage(trace),
        verdict_counts=dict(counts),
    )

events = [monitor_event(trace) for trace in traces]
requirements = {
    "known_unsafe_claims_not_served": candidate_metrics["unsafe_serves"] == 0,
    "source_version_logged_when_admitted": all(
        event.evidence_version is not None
        for event in events
        if event.first_failed_stage != "evidence_admission"
    ),
    "representative_labeled_holdout_collected": False,
    "monitoring_alert_owner_assigned": False,
}
failed_requirements = [name for name, passed in requirements.items() if not passed]
decision = "APPROVED" if not failed_requirements else "BLOCKED"

print(f"Candidate promotion: {decision}")
for name in failed_requirements:
    print(f"  missing: {name}")
print(f"Example failed stage: {events[1].first_failed_stage}")

assert decision == "BLOCKED"

Output

Candidate promotion: BLOCKED
  missing: representative_labeled_holdout_collected
  missing: monitoring_alert_owner_assigned
Example failed stage: claim_generation

The candidate gate fixes its known fabricated-ETA regressions, but promotion remains blocked. It still needs a representative labeled holdout and an operational owner for alerts. The code has now produced exactly the facts an system should aggregate.

Mastery check

Key concepts

Supported, not-supported, and contradicted factual claims
Atomic claim verification against admitted evidence
Evidence-versioned citations
Claim support, unsafe-serve, and abstention rates
Consistency sampling as triage rather than truth
Safe answer shortening and abstention
Failure-stage attribution
Regression release gates and monitoring events

Evaluation rubric

Treats absence of source support as a serving failure without claiming outside-world falsity
Decomposes a draft answer into testable factual claims
Blocks both invented ETAs and claims contradicted by admitted state
Demonstrates why stable repeated output isn't proof of truth
Keeps citations coupled to verified source versions
Locates first failure before proposing new mitigation layers
Evaluates safe routing with both unsafe-serve rate and supported coverage
Produces a trace that the monitoring lesson can aggregate

Follow-up questions

Common pitfalls

Repeated answers are mistaken for true answers

Symptom: A stable sampled ETA is served despite no source field establishing it. Cause: Consistency was treated as evidence. Fix: Use consistency to prioritize verification or review; require admitted source support for customer-facing facts.

Citations are generated but not verified

Symptom: The answer has plausible source tags beside an invented delivery promise. Cause: Citation formatting was checked, but claim-to-source support was not. Fix: Verify each cited factual claim against its source version before serving it.

The retriever is blamed for generator overreach

Symptom: More documents are indexed even though correct tracking evidence was already retrieved. Cause: The team didn't identify the first failed stage. Fix: Separate evidence-admission failure from claim-generation failure in every trace.

Abstention is declared a release success on tiny fixtures

Symptom: Four regression cases are used to claim production factual reliability. Cause: A focused regression suite was confused with a representative holdout. Fix: Keep the regressions, then collect labeled workflow slices and assign monitoring ownership before promotion.

Next Step

Continue to LLM Observability & Monitoring

You can now emit a versioned trace whenever a factual claim is served, removed, or blocked. Next you will turn those traces into quality metrics, alerts, and request-level debugging.

PreviousBias & Fairness in LLMs

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

A Survey on Hallucination in Large Language Models

Huang et al. · 2023

FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation.

Min, S., et al. · 2023 · EMNLP 2023

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models.

Manakul, P., et al. · 2023 · EMNLP 2023

Detecting Hallucinations in Large Language Models Using Semantic Entropy

Farquhar, S., et al. · 2024 · Nature

Chain-of-Verification Reduces Hallucination in Large Language Models

Dhuliawala, S., et al. · 2023

Back to Topics

LearnApplied LLM EngineeringHallucination Detection & Mitigation

🛡️MediumAlignment & Safety

Hallucination Detection & Mitigation

Build a claim-level grounding gate for delivery updates that verifies evidence, catches confident fabrication, abstains safely, and records release traces.

14 min read

Learning path

Step 67 of 155 in the full curriculum

Bias & Fairness in LLMs LLM Observability & Monitoring

Three verdicts are enough to start

Hallucination taxonomies can become abstract quickly. For a retrieved delivery record, begin with the source relationship:

Verdict	Meaning in this product	Delivery-update example	Customer action
Supported	Admitted source states the fact	"Last scan: departed regional hub on May 26."	May be served
Not supported	Admitted source doesn't establish the fact	"Delivery is expected May 28."	Remove or abstain
Contradicted	Admitted source states an incompatible fact	"Your order has been delivered." while status is `in transit`	Block and investigate

Figure 1: A grounded delivery answer is served only when every factual claim has an admitted supporting source.

Put the evidence in code

The carrier record isn't prose decoration. It's the authority the answer must obey. Our first cell records its source ID and version alongside the candidate claims.

versioned-carrier-evidence.py

from collections import Counter
from dataclasses import dataclass
from enum import Enum

@dataclass(frozen=True)
class Evidence:
    source_id: str
    version: str
    facts: dict[str, str]

@dataclass(frozen=True)
class Claim:
    claim_id: str
    text: str
    field: str
    value: str
    citation_id: str

tracking = Evidence(
    source_id="fastship-A10234",
    version="scan-feed/2026-05-27T10:00:00Z",
    facts={
        "carrier": "FastShip",
        "status": "in transit",
        "last_scan": "departed regional hub",
        "last_scan_at": "May 26 at 08:14 UTC",
    },
)
sources = {tracking.source_id: tracking}

draft_claims = [
    Claim("carrier", "Carrier: FastShip.", "carrier", "FastShip", tracking.source_id),
    Claim("scan_place", "Last scan: departed regional hub.", "last_scan", "departed regional hub", tracking.source_id),
    Claim("scan_time", "Scan time: May 26 at 08:14 UTC.", "last_scan_at", "May 26 at 08:14 UTC", tracking.source_id),
    Claim("eta", "Expected delivery is May 28.", "delivery_eta", "May 28", tracking.source_id),
]

print(f"Evidence version: {tracking.version}")
print(f"Available facts: {sorted(tracking.facts)}")
print(f"Draft factual claims: {len(draft_claims)}")

Output

Evidence version: scan-feed/2026-05-27T10:00:00Z
Available facts: ['carrier', 'last_scan', 'last_scan_at', 'status']
Draft factual claims: 4

An LLM can write the ETA sentence because it has seen delivery-date patterns. That doesn't make the sentence admissible. The admitted evidence has no delivery_eta field.

Verify each atomic claim

claim-verdicts.py

class Verdict(str, Enum):
    SUPPORTED = "supported"
    NOT_SUPPORTED = "not_supported"
    CONTRADICTED = "contradicted"
    NO_SOURCE = "no_source"

@dataclass(frozen=True)
class Verification:
    claim: Claim
    verdict: Verdict
    evidence_version: str | None

def verify_claim(claim: Claim) -> Verification:
    source = sources.get(claim.citation_id)
    if source is None:
        return Verification(claim, Verdict.NO_SOURCE, None)
    expected = source.facts.get(claim.field)
    if expected is None:
        return Verification(claim, Verdict.NOT_SUPPORTED, source.version)
    if expected != claim.value:
        return Verification(claim, Verdict.CONTRADICTED, source.version)
    return Verification(claim, Verdict.SUPPORTED, source.version)

draft_verdicts = [verify_claim(claim) for claim in draft_claims]
delivered_claim = Claim(
    "delivered",
    "Order #A10234 has been delivered.",
    "status",
    "delivered",
    tracking.source_id,
)

assert [item.verdict for item in draft_verdicts] == [
    Verdict.SUPPORTED,
    Verdict.SUPPORTED,
    Verdict.SUPPORTED,
    Verdict.NOT_SUPPORTED,
]
assert verify_claim(delivered_claim).verdict == Verdict.CONTRADICTED

for result in draft_verdicts + [verify_claim(delivered_claim)]:
    print(f"{result.claim.claim_id:10} {result.verdict.value:15} {result.claim.text}")

Output

carrier    supported       Carrier: FastShip.
scan_place supported       Last scan: departed regional hub.
scan_time  supported       Scan time: May 26 at 08:14 UTC.
eta        not_supported   Expected delivery is May 28.
delivered  contradicted    Order #A10234 has been delivered.

The distinction matters. The ETA isn't proven false; it's simply absent from this source. The delivered statement is worse: it conflicts with status=in transit.

Route the response, not only the score

safe-answer-route.py

def safe_answer(claims: list[Claim]) -> dict[str, object]:
    verdicts = [verify_claim(claim) for claim in claims]
    failures = [item for item in verdicts if item.verdict != Verdict.SUPPORTED]
    supported_text = " ".join(
        item.claim.text for item in verdicts if item.verdict == Verdict.SUPPORTED
    )
    if failures:
        return {
            "route": "abstain_on_missing_detail",
            "answer": supported_text + " The carrier record does not provide a delivery estimate yet.",
            "blocked_claims": [item.claim.claim_id for item in failures],
        }
    return {
        "route": "serve",
        "answer": supported_text,
        "blocked_claims": [],
    }

decision = safe_answer(draft_claims)
assert decision["route"] == "abstain_on_missing_detail"
assert decision["blocked_claims"] == ["eta"]
assert "May 28" not in str(decision["answer"])

print(f"Route: {decision['route']}")
print(f"Blocked claims: {decision['blocked_claims']}")
print(f"Customer answer: {decision['answer']}")

Output

Route: abstain_on_missing_detail
Blocked claims: ['eta']
Customer answer: Carrier: FastShip. Last scan: departed regional hub. Scan time: May 26 at 08:14 UTC. The carrier record does not provide a delivery estimate yet.

Measure failure before choosing mitigation

grounding-regression-suite.py

@dataclass(frozen=True)
class AnswerCase:
    case_id: str
    claims: list[Claim]

cases = [
    AnswerCase("clean_scan", draft_claims[:3]),
    AnswerCase("invented_eta", draft_claims),
    AnswerCase("wrong_status", [delivered_claim]),
    AnswerCase(
        "unadmitted_source",
        [Claim("carrier", "FastShip is carrying order #B404.", "carrier", "FastShip", "missing-feed")],
    ),
]

def has_unsafe_claim(case: AnswerCase) -> bool:
    return any(verify_claim(claim).verdict != Verdict.SUPPORTED for claim in case.claims)

baseline_served_unsafe = sum(has_unsafe_claim(case) for case in cases)
verdict_counts = Counter(
    verify_claim(claim).verdict.value
    for case in cases
    for claim in case.claims
)

print(f"Baseline unsafe serves if all drafts ship: {baseline_served_unsafe}/{len(cases)}")
print(f"Claim verdict counts: {dict(verdict_counts)}")
assert baseline_served_unsafe == 3

Output

Baseline unsafe serves if all drafts ship: 3/4
Claim verdict counts: {'supported': 6, 'not_supported': 1, 'contradicted': 1, 'no_source': 1}

For a grounded status product, two simple metrics are useful:

Metric	Calculation	Release meaning
Claim support rate	Supported factual claims / all factual claims	How much draft content evidence admits
Unsafe serve rate	Served answers containing any failed factual claim / served answers	Whether bad claims reach customers
Abstention rate	Answers withheld or safely shortened / total requests	Cost of being cautious

Claim support can improve while unsafe serves remain unacceptable. A single contradicted delivery status shown to a customer is still a serious failure.

Consistency is an alarm, not evidence

They don't authorize a delivery claim. A model can repeat the same unsupported ETA every time.

consistency-is-not-truth.py

def disagreement_rate(values: list[str]) -> float:
    most_common_count = Counter(values).most_common(1)[0][1]
    return 1 - most_common_count / len(values)

unstable_eta_samples = ["May 28", "May 29", "May 28", "May 30"]
stable_false_eta_samples = ["May 28", "May 28", "May 28", "May 28"]
eta_verdict = verify_claim(draft_claims[-1]).verdict

print(f"Unstable ETA disagreement: {disagreement_rate(unstable_eta_samples):.2f}")
print(f"Repeated ETA disagreement: {disagreement_rate(stable_false_eta_samples):.2f}")
print(f"Repeated ETA evidence verdict: {eta_verdict.value}")

assert disagreement_rate(stable_false_eta_samples) == 0.0
assert eta_verdict == Verdict.NOT_SUPPORTED

Output

Unstable ETA disagreement: 0.50
Repeated ETA disagreement: 0.00
Repeated ETA evidence verdict: not_supported

This fixes a common misconception: low uncertainty means "the model repeats itself," not "the fact is true."

Combine signals with the right authority

evidence-first-routing.py

def route_with_signals(case: AnswerCase, sampled_statuses: list[str]) -> str:
    if has_unsafe_claim(case):
        return "abstain_evidence_failure"
    if disagreement_rate(sampled_statuses) > 0.25:
        return "review_unstable_generation"
    return "serve_supported_answer"

clean_case = cases[0]
eta_case = cases[1]

routes = {
    "clean_stable": route_with_signals(clean_case, ["in transit"] * 4),
    "clean_unstable": route_with_signals(
        clean_case, ["in transit", "in transit", "out for delivery", "delivered"]
    ),
    "eta_repeated": route_with_signals(eta_case, stable_false_eta_samples),
}

for name, route_name in routes.items():
    print(f"{name:14} -> {route_name}")

assert routes["clean_stable"] == "serve_supported_answer"
assert routes["clean_unstable"] == "review_unstable_generation"
assert routes["eta_repeated"] == "abstain_evidence_failure"

Output

clean_stable   -> serve_supported_answer
clean_unstable -> review_unstable_generation
eta_repeated   -> abstain_evidence_failure

Citations must be checked, not decorated

citation-faithfulness.py

def cited_sentence(claim: Claim) -> str:
    result = verify_claim(claim)
    if result.verdict != Verdict.SUPPORTED:
        raise ValueError(f"cannot cite {claim.claim_id}: {result.verdict.value}")
    return f"{claim.text} [{claim.citation_id}@{result.evidence_version}]"

served_sentences = [cited_sentence(claim) for claim in draft_claims[:3]]
served_answer = " ".join(served_sentences)

try:
    cited_sentence(draft_claims[-1])
except ValueError as error:
    blocked_citation = str(error)

print(served_answer)
print(blocked_citation)
assert "May 28" not in served_answer
assert "not_supported" in blocked_citation

Output

Carrier: FastShip. [fastship-A10234@scan-feed/2026-05-27T10:00:00Z] Last scan: departed regional hub. [fastship-A10234@scan-feed/2026-05-27T10:00:00Z] Scan time: May 26 at 08:14 UTC. [fastship-A10234@scan-feed/2026-05-27T10:00:00Z]
cannot cite eta: not_supported

Attribute failures before adding complexity

An unsupported answer may begin with retrieval or generation:

First failed stage	Symptom	Appropriate next action
Evidence admission	No carrier record was retrieved for an order	Fix retrieval, permissions, freshness, or tool failure
Claim generation	Source exists, but answer adds unsupported ETA	Tighten generation and post-generation claim gate
Consistency only	Supported facts vary across samples	Review decoding or prompt stability; don't call it a source failure

first-failure-attribution.py

@dataclass(frozen=True)
class RunTrace:
    request_id: str
    case: AnswerCase
    evidence_admitted: bool
    sampled_statuses: list[str]

def first_failed_stage(trace: RunTrace) -> str:
    if not trace.evidence_admitted:
        return "evidence_admission"
    if has_unsafe_claim(trace.case):
        return "claim_generation"
    if disagreement_rate(trace.sampled_statuses) > 0.25:
        return "generation_stability"
    return "passed"

traces = [
    RunTrace("req-clean", cases[0], True, ["in transit"] * 4),
    RunTrace("req-eta", cases[1], True, stable_false_eta_samples),
    RunTrace("req-missing", cases[3], False, ["unknown"] * 4),
    RunTrace("req-vary", cases[0], True, ["in transit", "delivered", "in transit", "out for delivery"]),
]

for trace in traces:
    print(f"{trace.request_id:11} -> {first_failed_stage(trace)}")

assert [first_failed_stage(trace) for trace in traces] == [
    "passed",
    "claim_generation",
    "evidence_admission",
    "generation_stability",
]

Output

req-clean   -> passed
req-eta     -> claim_generation
req-missing -> evidence_admission
req-vary    -> generation_stability

Evaluate a candidate gate

Research benchmarks help compare methods, but they don't execute this carrier feed, prompt, citation format, or customer route. Use external datasets for breadth and product regressions for release:

Artifact	Teaches or tests	Place in this workflow
SelfCheckGPT^[3]	Sample consistency without external facts	Triage signal when evidence is absent or costly
Semantic entropy^[4]	Meaning-level uncertainty over samples	Escalation feature for confabulations
FActScore^[2]	Atomic factual precision	Design model for claim-level verification
ShopFlow regression	Exact carrier facts and response policy	Release gate for `tracking-answerer-v1`

candidate-gate-metrics.py

def baseline_router(case: AnswerCase) -> str:
    return "serve"

def grounded_router(case: AnswerCase) -> str:
    return "abstain" if has_unsafe_claim(case) else "serve"

def score_router(router) -> dict[str, float | int]:
    served = [case for case in cases if router(case) == "serve"]
    unsafe_serves = sum(has_unsafe_claim(case) for case in served)
    supported_cases = [case for case in cases if not has_unsafe_claim(case)]
    supported_serves = sum(router(case) == "serve" for case in supported_cases)
    return {
        "served": len(served),
        "unsafe_serves": unsafe_serves,
        "unsafe_serve_rate": unsafe_serves / max(len(served), 1),
        "supported_coverage": supported_serves / len(supported_cases),
        "abstentions": len(cases) - len(served),
    }

baseline_metrics = score_router(baseline_router)
candidate_metrics = score_router(grounded_router)

for name, metrics in [("baseline", baseline_metrics), ("candidate", candidate_metrics)]:
    print(
        f"{name:9} unsafe_rate={metrics['unsafe_serve_rate']:.1%} "
        f"coverage={metrics['supported_coverage']:.1%} abstentions={metrics['abstentions']}"
    )

assert baseline_metrics["unsafe_serve_rate"] == 0.75
assert candidate_metrics["unsafe_serve_rate"] == 0.0
assert candidate_metrics["supported_coverage"] == 1.0

Output

baseline  unsafe_rate=75.0% coverage=100.0% abstentions=0
candidate unsafe_rate=0.0% coverage=100.0% abstentions=3

Keep the mitigation stack small and testable

Layer	Invariant	Failure it prevents
Admit evidence	Record source ID and version before answer generation	Stale or untraceable facts
Generate conservatively	Ask only for claims licensed by source fields	Unnecessary unsupported detail
Verify claims	Every factual clause receives a verdict	Fluent fabrications
Route safely	Failed clauses trigger removal, abstention, or review	Unsafe customer promises
Measure after launch	Retain verdicts, versions, route, and owner	Silent recurrence

Hand the next lesson a trace

release-trace-contract.py

@dataclass(frozen=True)
class MonitorEvent:
    request_id: str
    evidence_version: str | None
    route: str
    first_failed_stage: str
    verdict_counts: dict[str, int]

def monitor_event(trace: RunTrace) -> MonitorEvent:
    counts = Counter(verify_claim(claim).verdict.value for claim in trace.case.claims)
    return MonitorEvent(
        request_id=trace.request_id,
        evidence_version=tracking.version if trace.evidence_admitted else None,
        route=grounded_router(trace.case),
        first_failed_stage=first_failed_stage(trace),
        verdict_counts=dict(counts),
    )

events = [monitor_event(trace) for trace in traces]
requirements = {
    "known_unsafe_claims_not_served": candidate_metrics["unsafe_serves"] == 0,
    "source_version_logged_when_admitted": all(
        event.evidence_version is not None
        for event in events
        if event.first_failed_stage != "evidence_admission"
    ),
    "representative_labeled_holdout_collected": False,
    "monitoring_alert_owner_assigned": False,
}
failed_requirements = [name for name, passed in requirements.items() if not passed]
decision = "APPROVED" if not failed_requirements else "BLOCKED"

print(f"Candidate promotion: {decision}")
for name in failed_requirements:
    print(f"  missing: {name}")
print(f"Example failed stage: {events[1].first_failed_stage}")

assert decision == "BLOCKED"

Output

Candidate promotion: BLOCKED
  missing: representative_labeled_holdout_collected
  missing: monitoring_alert_owner_assigned
Example failed stage: claim_generation

Mastery check

Key concepts

Supported, not-supported, and contradicted factual claims
Atomic claim verification against admitted evidence
Evidence-versioned citations
Claim support, unsafe-serve, and abstention rates
Consistency sampling as triage rather than truth
Safe answer shortening and abstention
Failure-stage attribution
Regression release gates and monitoring events

Evaluation rubric

Treats absence of source support as a serving failure without claiming outside-world falsity
Decomposes a draft answer into testable factual claims
Blocks both invented ETAs and claims contradicted by admitted state
Demonstrates why stable repeated output isn't proof of truth
Keeps citations coupled to verified source versions
Locates first failure before proposing new mitigation layers
Evaluates safe routing with both unsafe-serve rate and supported coverage
Produces a trace that the monitoring lesson can aggregate

Follow-up questions

Common pitfalls

Repeated answers are mistaken for true answers

Citations are generated but not verified

The retriever is blamed for generator overreach

Abstention is declared a release success on tiny fixtures

Next Step

Continue to LLM Observability & Monitoring

You can now emit a versioned trace whenever a factual claim is served, removed, or blocked. Next you will turn those traces into quality metrics, alerts, and request-level debugging.

PreviousBias & Fairness in LLMs

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

A Survey on Hallucination in Large Language Models

Huang et al. · 2023

FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation.

Min, S., et al. · 2023 · EMNLP 2023

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models.

Manakul, P., et al. · 2023 · EMNLP 2023

Detecting Hallucinations in Large Language Models Using Semantic Entropy

Farquhar, S., et al. · 2024 · Nature

Chain-of-Verification Reduces Hallucination in Large Language Models

Dhuliawala, S., et al. · 2023

Hallucination Detection & Mitigation

Three verdicts are enough to start

Put the evidence in code

Verify each atomic claim

Route the response, not only the score

Why doesn't not_supported mean the ETA is false?

Measure failure before choosing mitigation

Consistency is an alarm, not evidence

Combine signals with the right authority

Citations must be checked, not decorated

Attribute failures before adding complexity

Evaluate a candidate gate

Keep the mitigation stack small and testable

Hand the next lesson a trace

Mastery check

Key concepts

Evaluation rubric

Follow-up questions

FastShip later delivers the parcel on May 28. Was it correct to block the earlier draft ETA?

Four sampled generations all say "Delivery is May 28." May the system now serve the date?

The source record exists, but the response says "delivered" while its status is in transit. Is that not supported or contradicted?

Why record evidence versions and first failed stage in monitoring events?

Common pitfalls

Repeated answers are mistaken for true answers

Citations are generated but not verified

The retriever is blamed for generator overreach

Abstention is declared a release success on tiny fixtures

Hallucination Detection & Mitigation

Three verdicts are enough to start

Put the evidence in code

Verify each atomic claim

Route the response, not only the score

Why doesn't not_supported mean the ETA is false?

Measure failure before choosing mitigation

Consistency is an alarm, not evidence

Combine signals with the right authority

Citations must be checked, not decorated

Attribute failures before adding complexity

Evaluate a candidate gate

Keep the mitigation stack small and testable

Hand the next lesson a trace

Mastery check

Key concepts

Evaluation rubric

Follow-up questions

FastShip later delivers the parcel on May 28. Was it correct to block the earlier draft ETA?

Four sampled generations all say "Delivery is May 28." May the system now serve the date?

The source record exists, but the response says "delivered" while its status is in transit. Is that not supported or contradicted?

Why record evidence versions and first failed stage in monitoring events?

Common pitfalls

Repeated answers are mistaken for true answers

Citations are generated but not verified

The retriever is blamed for generator overreach

Abstention is declared a release success on tiny fixtures