Build a claim-level grounding gate for delivery updates that verifies evidence, catches confident fabrication, abstains safely, and records release traces.
The fairness lesson blocked a soft evaluator when it could route equivalent customer requests differently. Factual reliability needs an equally strict boundary: a fluent large language model (LLM) answer can't invent a carrier event just because the customer would like certainty.
ShopFlow's next ticket asks, "Where is order #A10234?" The admitted FastShip record says the parcel departed a regional hub on May 26 at 08:14 UTC. It contains no delivery estimate. A draft answer that adds "It will arrive on May 28" may sound helpful, but the system has no source for that promise.
This lesson builds tracking-answerer-v1, a small serving gate. It turns an answer into atomic claims, checks each claim against versioned evidence, routes unsupported details to abstention, and records why release remains blocked. For a customer-facing grounded answer, "not present in admitted evidence" is enough reason not to serve a factual claim. It doesn't prove the claim is false in the wider world.
Hallucination taxonomies can become abstract quickly. For a retrieved delivery record, begin with the source relationship:
| Verdict | Meaning in this product | Delivery-update example | Customer action |
|---|---|---|---|
| Supported | Admitted source states the fact | "Last scan: departed regional hub on May 26." | May be served |
| Not supported | Admitted source doesn't establish the fact | "Delivery is expected May 28." | Remove or abstain |
| Contradicted | Admitted source states an incompatible fact | "Your order has been delivered." while status is in transit | Block and investigate |
Surveys of LLM hallucination distinguish statements that conflict with supplied context from statements that add information not established by it.[1] In an open-world setting, an unsupported statement might later turn out true. In this answer path it's still unsafe, because the product promised evidence-grounded delivery updates.
Figure 1: A grounded delivery answer is served only when every factual claim has an admitted supporting source.
The carrier record isn't prose decoration. It's the authority the answer must obey. Our first cell records its source ID and version alongside the candidate claims.
1from collections import Counter
2from dataclasses import dataclass
3from enum import Enum
4
5@dataclass(frozen=True)
6class Evidence:
7 source_id: str
8 version: str
9 facts: dict[str, str]
10
11@dataclass(frozen=True)
12class Claim:
13 claim_id: str
14 text: str
15 field: str
16 value: str
17 citation_id: str
18
19tracking = Evidence(
20 source_id="fastship-A10234",
21 version="scan-feed/2026-05-27T10:00:00Z",
22 facts={
23 "carrier": "FastShip",
24 "status": "in transit",
25 "last_scan": "departed regional hub",
26 "last_scan_at": "May 26 at 08:14 UTC",
27 },
28)
29sources = {tracking.source_id: tracking}
30
31draft_claims = [
32 Claim("carrier", "Carrier: FastShip.", "carrier", "FastShip", tracking.source_id),
33 Claim("scan_place", "Last scan: departed regional hub.", "last_scan", "departed regional hub", tracking.source_id),
34 Claim("scan_time", "Scan time: May 26 at 08:14 UTC.", "last_scan_at", "May 26 at 08:14 UTC", tracking.source_id),
35 Claim("eta", "Expected delivery is May 28.", "delivery_eta", "May 28", tracking.source_id),
36]
37
38print(f"Evidence version: {tracking.version}")
39print(f"Available facts: {sorted(tracking.facts)}")
40print(f"Draft factual claims: {len(draft_claims)}")1Evidence version: scan-feed/2026-05-27T10:00:00Z
2Available facts: ['carrier', 'last_scan', 'last_scan_at', 'status']
3Draft factual claims: 4An LLM can write the ETA sentence because it has seen delivery-date patterns. That doesn't make the sentence admissible. The admitted evidence has no delivery_eta field.
FActScore popularized evaluating long-form generations as atomic facts rather than one answer-level impression.[2] The verifier below uses the same reasoning on a small operational record: a claim is either supported by its cited source, missing from that source, contradicted by a different value, or missing an admitted source entirely.
1class Verdict(str, Enum):
2 SUPPORTED = "supported"
3 NOT_SUPPORTED = "not_supported"
4 CONTRADICTED = "contradicted"
5 NO_SOURCE = "no_source"
6
7@dataclass(frozen=True)
8class Verification:
9 claim: Claim
10 verdict: Verdict
11 evidence_version: str | None
12
13def verify_claim(claim: Claim) -> Verification:
14 source = sources.get(claim.citation_id)
15 if source is None:
16 return Verification(claim, Verdict.NO_SOURCE, None)
17 expected = source.facts.get(claim.field)
18 if expected is None:
19 return Verification(claim, Verdict.NOT_SUPPORTED, source.version)
20 if expected != claim.value:
21 return Verification(claim, Verdict.CONTRADICTED, source.version)
22 return Verification(claim, Verdict.SUPPORTED, source.version)
23
24draft_verdicts = [verify_claim(claim) for claim in draft_claims]
25delivered_claim = Claim(
26 "delivered",
27 "Order #A10234 has been delivered.",
28 "status",
29 "delivered",
30 tracking.source_id,
31)
32
33assert [item.verdict for item in draft_verdicts] == [
34 Verdict.SUPPORTED,
35 Verdict.SUPPORTED,
36 Verdict.SUPPORTED,
37 Verdict.NOT_SUPPORTED,
38]
39assert verify_claim(delivered_claim).verdict == Verdict.CONTRADICTED
40
41for result in draft_verdicts + [verify_claim(delivered_claim)]:
42 print(f"{result.claim.claim_id:10} {result.verdict.value:15} {result.claim.text}")1carrier supported Carrier: FastShip.
2scan_place supported Last scan: departed regional hub.
3scan_time supported Scan time: May 26 at 08:14 UTC.
4eta not_supported Expected delivery is May 28.
5delivered contradicted Order #A10234 has been delivered.The distinction matters. The ETA isn't proven false; it's simply absent from this source. The delivered statement is worse: it conflicts with status=in transit.
A detector that writes a warning beside an unsafe response has not protected the customer. Once any factual claim fails, tracking-answerer-v1 keeps supported information and replaces the invented promise with a bounded statement.
1def safe_answer(claims: list[Claim]) -> dict[str, object]:
2 verdicts = [verify_claim(claim) for claim in claims]
3 failures = [item for item in verdicts if item.verdict != Verdict.SUPPORTED]
4 supported_text = " ".join(
5 item.claim.text for item in verdicts if item.verdict == Verdict.SUPPORTED
6 )
7 if failures:
8 return {
9 "route": "abstain_on_missing_detail",
10 "answer": supported_text + " The carrier record does not provide a delivery estimate yet.",
11 "blocked_claims": [item.claim.claim_id for item in failures],
12 }
13 return {
14 "route": "serve",
15 "answer": supported_text,
16 "blocked_claims": [],
17 }
18
19decision = safe_answer(draft_claims)
20assert decision["route"] == "abstain_on_missing_detail"
21assert decision["blocked_claims"] == ["eta"]
22assert "May 28" not in str(decision["answer"])
23
24print(f"Route: {decision['route']}")
25print(f"Blocked claims: {decision['blocked_claims']}")
26print(f"Customer answer: {decision['answer']}")1Route: abstain_on_missing_detail
2Blocked claims: ['eta']
3Customer answer: Carrier: FastShip. Last scan: departed regional hub. Scan time: May 26 at 08:14 UTC. The carrier record does not provide a delivery estimate yet.A single blocked ETA gives a regression case, not a release metric. Build a small suite that separates four failure modes: a clean scan summary, an unsupported ETA, a contradiction, and a missing source.
1@dataclass(frozen=True)
2class AnswerCase:
3 case_id: str
4 claims: list[Claim]
5
6cases = [
7 AnswerCase("clean_scan", draft_claims[:3]),
8 AnswerCase("invented_eta", draft_claims),
9 AnswerCase("wrong_status", [delivered_claim]),
10 AnswerCase(
11 "unadmitted_source",
12 [Claim("carrier", "FastShip is carrying order #B404.", "carrier", "FastShip", "missing-feed")],
13 ),
14]
15
16def has_unsafe_claim(case: AnswerCase) -> bool:
17 return any(verify_claim(claim).verdict != Verdict.SUPPORTED for claim in case.claims)
18
19baseline_served_unsafe = sum(has_unsafe_claim(case) for case in cases)
20verdict_counts = Counter(
21 verify_claim(claim).verdict.value
22 for case in cases
23 for claim in case.claims
24)
25
26print(f"Baseline unsafe serves if all drafts ship: {baseline_served_unsafe}/{len(cases)}")
27print(f"Claim verdict counts: {dict(verdict_counts)}")
28assert baseline_served_unsafe == 31Baseline unsafe serves if all drafts ship: 3/4
2Claim verdict counts: {'supported': 6, 'not_supported': 1, 'contradicted': 1, 'no_source': 1}For a grounded status product, two simple metrics are useful:
| Metric | Calculation | Release meaning |
|---|---|---|
| Claim support rate | Supported factual claims / all factual claims | How much draft content evidence admits |
| Unsafe serve rate | Served answers containing any failed factual claim / served answers | Whether bad claims reach customers |
| Abstention rate | Answers withheld or safely shortened / total requests | Cost of being cautious |
Claim support can improve while unsafe serves remain unacceptable. A single contradicted delivery status shown to a customer is still a serious failure.
Black-box methods such as SelfCheckGPT compare an answer to additional sampled generations; disagreement can flag factual content that the model is inventing without an external database.[3] Semantic entropy similarly groups sampled answers by meaning and detects high uncertainty when those meanings vary.[4] These are useful when no admitted source is available or when you need to prioritize expensive checks.
They don't authorize a delivery claim. A model can repeat the same unsupported ETA every time.
1def disagreement_rate(values: list[str]) -> float:
2 most_common_count = Counter(values).most_common(1)[0][1]
3 return 1 - most_common_count / len(values)
4
5unstable_eta_samples = ["May 28", "May 29", "May 28", "May 30"]
6stable_false_eta_samples = ["May 28", "May 28", "May 28", "May 28"]
7eta_verdict = verify_claim(draft_claims[-1]).verdict
8
9print(f"Unstable ETA disagreement: {disagreement_rate(unstable_eta_samples):.2f}")
10print(f"Repeated ETA disagreement: {disagreement_rate(stable_false_eta_samples):.2f}")
11print(f"Repeated ETA evidence verdict: {eta_verdict.value}")
12
13assert disagreement_rate(stable_false_eta_samples) == 0.0
14assert eta_verdict == Verdict.NOT_SUPPORTED1Unstable ETA disagreement: 0.50
2Repeated ETA disagreement: 0.00
3Repeated ETA evidence verdict: not_supportedThis fixes a common misconception: low uncertainty means "the model repeats itself," not "the fact is true."
Evidence failure blocks a fact even when samples agree. When factual evidence passes but generation is unstable, consistency can route the case to review rather than presenting an answer that changes from run to run.
1def route_with_signals(case: AnswerCase, sampled_statuses: list[str]) -> str:
2 if has_unsafe_claim(case):
3 return "abstain_evidence_failure"
4 if disagreement_rate(sampled_statuses) > 0.25:
5 return "review_unstable_generation"
6 return "serve_supported_answer"
7
8clean_case = cases[0]
9eta_case = cases[1]
10
11routes = {
12 "clean_stable": route_with_signals(clean_case, ["in transit"] * 4),
13 "clean_unstable": route_with_signals(
14 clean_case, ["in transit", "in transit", "out for delivery", "delivered"]
15 ),
16 "eta_repeated": route_with_signals(eta_case, stable_false_eta_samples),
17}
18
19for name, route_name in routes.items():
20 print(f"{name:14} -> {route_name}")
21
22assert routes["clean_stable"] == "serve_supported_answer"
23assert routes["clean_unstable"] == "review_unstable_generation"
24assert routes["eta_repeated"] == "abstain_evidence_failure"1clean_stable -> serve_supported_answer
2clean_unstable -> review_unstable_generation
3eta_repeated -> abstain_evidence_failureA response that prints [fastship-A10234] isn't necessarily grounded. The citation must resolve to the admitted version and support the nearby claim. Otherwise a model can attach a real-looking source marker to an invented delivery promise.
1def cited_sentence(claim: Claim) -> str:
2 result = verify_claim(claim)
3 if result.verdict != Verdict.SUPPORTED:
4 raise ValueError(f"cannot cite {claim.claim_id}: {result.verdict.value}")
5 return f"{claim.text} [{claim.citation_id}@{result.evidence_version}]"
6
7served_sentences = [cited_sentence(claim) for claim in draft_claims[:3]]
8served_answer = " ".join(served_sentences)
9
10try:
11 cited_sentence(draft_claims[-1])
12except ValueError as error:
13 blocked_citation = str(error)
14
15print(served_answer)
16print(blocked_citation)
17assert "May 28" not in served_answer
18assert "not_supported" in blocked_citation1Carrier: FastShip. [fastship-A10234@scan-feed/2026-05-27T10:00:00Z] Last scan: departed regional hub. [fastship-A10234@scan-feed/2026-05-27T10:00:00Z] Scan time: May 26 at 08:14 UTC. [fastship-A10234@scan-feed/2026-05-27T10:00:00Z]
2cannot cite eta: not_supportedThis is a narrower version of claim-level factual evaluation: decompose, retrieve or admit evidence, verify, and retain the provenance for failed and passed claims. It also extends the evaluation lesson: citation presence is no longer confused with citation support.
An unsupported answer may begin with retrieval or generation:
| First failed stage | Symptom | Appropriate next action |
|---|---|---|
| Evidence admission | No carrier record was retrieved for an order | Fix retrieval, permissions, freshness, or tool failure |
| Claim generation | Source exists, but answer adds unsupported ETA | Tighten generation and post-generation claim gate |
| Consistency only | Supported facts vary across samples | Review decoding or prompt stability; don't call it a source failure |
Techniques such as Chain-of-Verification ask a model to draft verification questions, answer them independently, and revise the response; the original paper reports reduced hallucination across list questions, closed-book question answering, and long-form generation.[5] That's a possible additional generator control. It doesn't replace an authoritative carrier record or the final claim gate in this product.
1@dataclass(frozen=True)
2class RunTrace:
3 request_id: str
4 case: AnswerCase
5 evidence_admitted: bool
6 sampled_statuses: list[str]
7
8def first_failed_stage(trace: RunTrace) -> str:
9 if not trace.evidence_admitted:
10 return "evidence_admission"
11 if has_unsafe_claim(trace.case):
12 return "claim_generation"
13 if disagreement_rate(trace.sampled_statuses) > 0.25:
14 return "generation_stability"
15 return "passed"
16
17traces = [
18 RunTrace("req-clean", cases[0], True, ["in transit"] * 4),
19 RunTrace("req-eta", cases[1], True, stable_false_eta_samples),
20 RunTrace("req-missing", cases[3], False, ["unknown"] * 4),
21 RunTrace("req-vary", cases[0], True, ["in transit", "delivered", "in transit", "out for delivery"]),
22]
23
24for trace in traces:
25 print(f"{trace.request_id:11} -> {first_failed_stage(trace)}")
26
27assert [first_failed_stage(trace) for trace in traces] == [
28 "passed",
29 "claim_generation",
30 "evidence_admission",
31 "generation_stability",
32]1req-clean -> passed
2req-eta -> claim_generation
3req-missing -> evidence_admission
4req-vary -> generation_stabilityResearch benchmarks help compare methods, but they don't execute this carrier feed, prompt, citation format, or customer route. Use external datasets for breadth and product regressions for release:
| Artifact | Teaches or tests | Place in this workflow |
|---|---|---|
| SelfCheckGPT[3] | Sample consistency without external facts | Triage signal when evidence is absent or costly |
| Semantic entropy[4] | Meaning-level uncertainty over samples | Escalation feature for confabulations |
| FActScore[2] | Atomic factual precision | Design model for claim-level verification |
| ShopFlow regression | Exact carrier facts and response policy | Release gate for tracking-answerer-v1 |
1def baseline_router(case: AnswerCase) -> str:
2 return "serve"
3
4def grounded_router(case: AnswerCase) -> str:
5 return "abstain" if has_unsafe_claim(case) else "serve"
6
7def score_router(router) -> dict[str, float | int]:
8 served = [case for case in cases if router(case) == "serve"]
9 unsafe_serves = sum(has_unsafe_claim(case) for case in served)
10 supported_cases = [case for case in cases if not has_unsafe_claim(case)]
11 supported_serves = sum(router(case) == "serve" for case in supported_cases)
12 return {
13 "served": len(served),
14 "unsafe_serves": unsafe_serves,
15 "unsafe_serve_rate": unsafe_serves / max(len(served), 1),
16 "supported_coverage": supported_serves / len(supported_cases),
17 "abstentions": len(cases) - len(served),
18 }
19
20baseline_metrics = score_router(baseline_router)
21candidate_metrics = score_router(grounded_router)
22
23for name, metrics in [("baseline", baseline_metrics), ("candidate", candidate_metrics)]:
24 print(
25 f"{name:9} unsafe_rate={metrics['unsafe_serve_rate']:.1%} "
26 f"coverage={metrics['supported_coverage']:.1%} abstentions={metrics['abstentions']}"
27 )
28
29assert baseline_metrics["unsafe_serve_rate"] == 0.75
30assert candidate_metrics["unsafe_serve_rate"] == 0.0
31assert candidate_metrics["supported_coverage"] == 1.01baseline unsafe_rate=75.0% coverage=100.0% abstentions=0
2candidate unsafe_rate=0.0% coverage=100.0% abstentions=3This is a useful repair on four deliberately small cases. It isn't enough to claim production factuality. The candidate abstains on three of four requests here because the test suite is failure-heavy by design.
For this workflow, extra decoding tricks aren't the next priority. The current failure is already localized: the draft adds a fact absent from a live source. Start with controls whose effect can be tested against that source:
| Layer | Invariant | Failure it prevents |
|---|---|---|
| Admit evidence | Record source ID and version before answer generation | Stale or untraceable facts |
| Generate conservatively | Ask only for claims licensed by source fields | Unnecessary unsupported detail |
| Verify claims | Every factual clause receives a verdict | Fluent fabrications |
| Route safely | Failed clauses trigger removal, abstention, or review | Unsafe customer promises |
| Measure after launch | Retain verdicts, versions, route, and owner | Silent recurrence |
The next chapter can't monitor a correctness property that this chapter never records. Store enough information to reconstruct the answer decision without retaining unnecessary customer text: source versions, verdict counts, route, and failure stage.
1@dataclass(frozen=True)
2class MonitorEvent:
3 request_id: str
4 evidence_version: str | None
5 route: str
6 first_failed_stage: str
7 verdict_counts: dict[str, int]
8
9def monitor_event(trace: RunTrace) -> MonitorEvent:
10 counts = Counter(verify_claim(claim).verdict.value for claim in trace.case.claims)
11 return MonitorEvent(
12 request_id=trace.request_id,
13 evidence_version=tracking.version if trace.evidence_admitted else None,
14 route=grounded_router(trace.case),
15 first_failed_stage=first_failed_stage(trace),
16 verdict_counts=dict(counts),
17 )
18
19events = [monitor_event(trace) for trace in traces]
20requirements = {
21 "known_unsafe_claims_not_served": candidate_metrics["unsafe_serves"] == 0,
22 "source_version_logged_when_admitted": all(
23 event.evidence_version is not None
24 for event in events
25 if event.first_failed_stage != "evidence_admission"
26 ),
27 "representative_labeled_holdout_collected": False,
28 "monitoring_alert_owner_assigned": False,
29}
30failed_requirements = [name for name, passed in requirements.items() if not passed]
31decision = "APPROVED" if not failed_requirements else "BLOCKED"
32
33print(f"Candidate promotion: {decision}")
34for name in failed_requirements:
35 print(f" missing: {name}")
36print(f"Example failed stage: {events[1].first_failed_stage}")
37
38assert decision == "BLOCKED"1Candidate promotion: BLOCKED
2 missing: representative_labeled_holdout_collected
3 missing: monitoring_alert_owner_assigned
4Example failed stage: claim_generationThe candidate gate fixes its known fabricated-ETA regressions, but promotion remains blocked. It still needs a representative labeled holdout and an operational owner for alerts. The code has now produced exactly the facts an system should aggregate.
Symptom: A stable sampled ETA is served despite no source field establishing it. Cause: Consistency was treated as evidence. Fix: Use consistency to prioritize verification or review; require admitted source support for customer-facing facts.
Symptom: The answer has plausible source tags beside an invented delivery promise. Cause: Citation formatting was checked, but claim-to-source support was not. Fix: Verify each cited factual claim against its source version before serving it.
Symptom: More documents are indexed even though correct tracking evidence was already retrieved. Cause: The team didn't identify the first failed stage. Fix: Separate evidence-admission failure from claim-generation failure in every trace.
Symptom: Four regression cases are used to claim production factual reliability. Cause: A focused regression suite was confused with a representative holdout. Fix: Keep the regressions, then collect labeled workflow slices and assign monitoring ownership before promotion.
A Survey on Hallucination in Large Language Models
Huang et al. · 2023
FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation.
Min, S., et al. · 2023 · EMNLP 2023
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models.
Manakul, P., et al. · 2023 · EMNLP 2023
Detecting Hallucinations in Large Language Models Using Semantic Entropy
Farquhar, S., et al. · 2024 · Nature
Chain-of-Verification Reduces Hallucination in Large Language Models
Dhuliawala, S., et al. · 2023