Design a secure, traceable RAG service around versioned policy evidence, grounded answers, abstention, release gates, and latency budgets.
Your ShopFlow support agent can now be evaluated as an agent. It still can't answer a policy question safely unless it receives the right evidence. A resolution rule may change by region, merchant, product condition, or policy revision. An answer that sounds right but cites last year's rule can authorize a costly mistake.
Retrieval-augmented generation (RAG) gives a language model retrieved evidence at answer time instead of expecting its weights to contain current private facts.[1] In production, that simple idea becomes a system contract: index traceable evidence, retrieve only evidence the user may see, generate from that evidence, abstain when it isn't enough, and retain a trace that a reviewer can inspect.
This lesson builds that contract for policy-answerer-v1. You won't implement BM25, dense embeddings, fusion, or reranking here. Those retrieval algorithms belong in the next lessons. Here, you'll make the pipeline around any retriever trustworthy.
Suppose Luna, an EU support specialist, asks:
Can a customer get a replacement for a damaged refurbished laptop delivered 10 days ago?
The answer isn't just text. A release-worthy response must satisfy four properties:
| Property | What the user needs | Failure you must block |
|---|---|---|
| Correct evidence | Current EU refurbished-electronics policy | Old US or superseded rule retrieved |
| Authorization | Only sources Luna may read | Restricted merchant addendum leaks |
| Grounding | Each policy claim points to evidence | Model invents a replacement window |
| Operability | Trace and latency data for the request | Team can't reproduce a bad promise |
The data path has an offline side and an online side:
The indexer produces evidence records when documents change. The request path filters those records by identity and policy state, asks a retriever for candidates, packs source-labelled context, and returns either a supported answer or an abstention. The release path replays frozen questions before a new index, prompt, retriever, or model version serves users.
Earlier chunking lessons showed how to cut a document into searchable spans. A production service adds the fields needed to use those spans later: a stable document identifier, a parent section for citations, a version, an effective date range, a region, and access tags.
Our tiny corpus has two current policies and one superseded policy. Notice that the US and EU rules deliberately differ. That difference turns an access-control bug into a visible wrong answer.
1from __future__ import annotations
2
3from dataclasses import dataclass
4from datetime import date
5import re
6
7@dataclass(frozen=True)
8class PolicyChunk:
9 chunk_id: str
10 document_id: str
11 parent_id: str
12 version: str
13 region: str
14 acl_tag: str
15 effective_from: date
16 effective_to: date | None
17 text: str
18
19EVAL_DATE = date(2026, 5, 27)
20CHUNKS = [
21 PolicyChunk(
22 chunk_id="eu-refurb-v2-rule",
23 document_id="eu-electronics",
24 parent_id="eu-electronics-v2",
25 version="eu-electronics/2026-04-01",
26 region="EU",
27 acl_tag="support:eu",
28 effective_from=date(2026, 4, 1),
29 effective_to=None,
30 text=(
31 "Damaged refurbished laptops qualify for replacement within "
32 "14 days of delivery when damage is reported within 48 hours."
33 ),
34 ),
35 PolicyChunk(
36 chunk_id="eu-refurb-v1-rule",
37 document_id="eu-electronics",
38 parent_id="eu-electronics-v1",
39 version="eu-electronics/2025-02-01",
40 region="EU",
41 acl_tag="support:eu",
42 effective_from=date(2025, 2, 1),
43 effective_to=date(2026, 3, 31),
44 text="Damaged refurbished laptops qualify for return within 30 days.",
45 ),
46 PolicyChunk(
47 chunk_id="us-refurb-v4-rule",
48 document_id="us-electronics",
49 parent_id="us-electronics-v4",
50 version="us-electronics/2026-03-15",
51 region="US",
52 acl_tag="support:us",
53 effective_from=date(2026, 3, 15),
54 effective_to=None,
55 text="Damaged refurbished laptops qualify for refund within 30 days.",
56 ),
57 PolicyChunk(
58 chunk_id="eu-shoes-v1-rule",
59 document_id="eu-footwear",
60 parent_id="eu-footwear-v1",
61 version="eu-footwear/2026-01-03",
62 region="EU",
63 acl_tag="support:eu",
64 effective_from=date(2026, 1, 3),
65 effective_to=None,
66 text="Unworn footwear may be returned within 30 days of delivery.",
67 ),
68]
69
70def is_current(chunk: PolicyChunk, on_date: date) -> bool:
71 return (
72 chunk.effective_from <= on_date
73 and (chunk.effective_to is None or on_date <= chunk.effective_to)
74 )
75
76current_ids = [chunk.chunk_id for chunk in CHUNKS if is_current(chunk, EVAL_DATE)]
77print("All evidence records:", len(CHUNKS))
78print("Current records:", current_ids)
79assert "eu-refurb-v1-rule" not in current_ids1All evidence records: 4
2Current records: ['eu-refurb-v2-rule', 'us-refurb-v4-rule', 'eu-shoes-v1-rule']The record is deliberately more boring than a model call. That's good. Every later stage can now prove which policy revision it used. The fixed EVAL_DATE also makes this replay reproducible instead of changing behavior with the wall clock.
Indexing whole policy pages gives a retriever too much irrelevant prose. Indexing one sentence can lose surrounding exceptions. Parent-child indexing stores a compact child span for search and a parent section for final evidence. The retriever can match the child ID, then context assembly can fetch the parent section and its stable citation metadata.
The compact lab keeps child text inline and carries document_id plus parent_id. A full parent-child implementation resolves parent_id to a version-matched, permitted parent section before packing. Keep those fields separate instead of parsing meaning out of an ID string.
Chunk overlap remains useful when a sentence straddles a boundary, but it isn't a magic setting. Treat it as an indexing candidate that must survive retrieval tests on your own policy questions.
The next fragment checks a basic index invariant: at most one current record for the same region and parent section. Two active revisions would allow the request path to retrieve contradictory promises.
1from collections import defaultdict
2
3def validate_current_versions(chunks: list[PolicyChunk], on_date: date) -> None:
4 active_by_scope: dict[tuple[str, str], list[str]] = defaultdict(list)
5 for chunk in chunks:
6 if is_current(chunk, on_date):
7 scope = (chunk.region, chunk.document_id)
8 active_by_scope[scope].append(chunk.version)
9
10 conflicts = {
11 scope: versions
12 for scope, versions in active_by_scope.items()
13 if len(versions) > 1
14 }
15 if conflicts:
16 raise ValueError(f"Conflicting active policy versions: {conflicts}")
17
18validate_current_versions(CHUNKS, EVAL_DATE)
19print("Current-version invariant: pass")
20print("Superseded EU record stays indexed for audit, not answering.")1Current-version invariant: pass
2Superseded EU record stays indexed for audit, not answering.An embedding index doesn't know whether Luna can read a document. A highly similar restricted chunk is still forbidden. The safe order is:
Filtering after text has already reached the model is too late. The model, request trace, cache, or error report may already contain restricted content.
The lab uses a simple term-overlap search so its authorization behavior is obvious. Its retrieve() interface is the part you'll replace with hybrid search in the next chapter.
1@dataclass(frozen=True)
2class Caller:
3 actor_id: str
4 region: str
5 acl_tags: frozenset[str]
6
7LUNA = Caller("luna-48291", "EU", frozenset({"support:eu"}))
8
9def allowed_chunks(caller: Caller, chunks: list[PolicyChunk], on_date: date) -> list[PolicyChunk]:
10 return [
11 chunk
12 for chunk in chunks
13 if is_current(chunk, on_date)
14 and chunk.region == caller.region
15 and chunk.acl_tag in caller.acl_tags
16 ]
17
18def terms(text: str) -> set[str]:
19 return set(re.findall(r"[a-z0-9]+", text.lower()))
20
21def retrieve(
22 query: str,
23 caller: Caller,
24 chunks: list[PolicyChunk],
25 on_date: date,
26 top_k: int = 2,
27 min_matching_terms: int = 2,
28) -> list[PolicyChunk]:
29 permitted = allowed_chunks(caller, chunks, on_date)
30 query_terms = terms(query)
31 scored = [
32 (len(query_terms & terms(chunk.text)), chunk)
33 for chunk in permitted
34 ]
35 ranked = sorted(scored, key=lambda item: item[0], reverse=True)
36 return [
37 chunk
38 for score, chunk in ranked
39 if score >= min_matching_terms
40 ][:top_k]
41
42question = "damaged refurbished laptop replacement after 10 days"
43hits = retrieve(question, LUNA, CHUNKS, EVAL_DATE)
44print("Retrieved:", [(chunk.chunk_id, chunk.version) for chunk in hits])
45print("US evidence exposed:", any(chunk.region == "US" for chunk in hits))
46assert hits[0].chunk_id == "eu-refurb-v2-rule"
47assert all(chunk.acl_tag == "support:eu" for chunk in hits)1Retrieved: [('eu-refurb-v2-rule', 'eu-electronics/2026-04-01')]
2US evidence exposed: FalseThis retriever isn't production search: its two-term threshold rejects weak hits, but it misses paraphrases such as "reconditioned notebook." It is a clean test double for the surrounding pipeline. Once the authorization and trace contract work, you can improve recall without weakening the boundary.
A useful test shouldn't only prove success. It should include a result that would rank well if the permission filter were missing.
1restricted = PolicyChunk(
2 chunk_id="merchant-vip-refurb",
3 document_id="merchant-vip-terms",
4 parent_id="merchant-vip-terms",
5 version="merchant-vip/2026-05-01",
6 region="EU",
7 acl_tag="merchant:vip-ops",
8 effective_from=date(2026, 5, 1),
9 effective_to=None,
10 text=(
11 "Damaged refurbished laptops receive immediate full refund within "
12 "14 days for VIP merchant operations."
13 ),
14)
15
16corpus_with_restricted = [restricted, *CHUNKS]
17safe_hits = retrieve(question, LUNA, corpus_with_restricted, EVAL_DATE)
18visible_ids = [chunk.chunk_id for chunk in safe_hits]
19
20print("Visible hit ids:", visible_ids)
21print("Restricted VIP policy hidden:", restricted.chunk_id not in visible_ids)
22assert restricted.chunk_id not in visible_ids1Visible hit ids: ['eu-refurb-v2-rule']
2Restricted VIP policy hidden: True| Design choice | Unsafe shortcut | Observable consequence |
|---|---|---|
| Filter before retrieval | Retrieve everything, redact after generation | Secret rule may enter prompt or trace |
| Store versions and dates | Overwrite the old chunk in place | Can't reproduce a historical answer |
| Preserve parent citation | Return text with no source identity | Reviewer can't verify a claim |
Retrieval produces candidate records, not an answer. Context assembly must give the generator source labels, version information, and a clear instruction to abstain when the evidence doesn't establish the requested promise.
Don't stuff every near-match into the prompt. Even when a context window fits a large amount of text, models can use relevant information less reliably when it sits among long distractors, especially in the middle of a long input.[2] Pack the strongest permitted evidence first, keep the set small, and evaluate this policy rather than assuming it works.
1@dataclass(frozen=True)
2class PackedEvidence:
3 source_id: str
4 chunk_id: str
5 document_id: str
6 parent_id: str
7 version: str
8 text: str
9
10def pack_evidence(hits: list[PolicyChunk], max_characters: int = 400) -> list[PackedEvidence]:
11 packed: list[PackedEvidence] = []
12 used = 0
13 for position, chunk in enumerate(hits, start=1):
14 if used + len(chunk.text) > max_characters:
15 break
16 packed.append(
17 PackedEvidence(
18 source_id=f"E{position}",
19 chunk_id=chunk.chunk_id,
20 document_id=chunk.document_id,
21 parent_id=chunk.parent_id,
22 version=chunk.version,
23 text=chunk.text,
24 )
25 )
26 used += len(chunk.text)
27 return packed
28
29packed = pack_evidence(safe_hits)
30context = "\n".join(
31 f"[{item.source_id}] {item.parent_id} ({item.version}): {item.text}"
32 for item in packed
33)
34print(context)
35assert "[E1]" in context
36assert packed[0].document_id == "eu-electronics"
37assert packed[0].parent_id == "eu-electronics-v2"
38assert "merchant-vip" not in context1[E1] eu-electronics-v2 (eu-electronics/2026-04-01): Damaged refurbished laptops qualify for replacement within 14 days of delivery when damage is reported within 48 hours.In an actual service, a language model would receive the packed context and an instruction to cite it. For the lab, a deterministic answerer makes the core contract inspectable: it emits the rule only when the required evidence is present and otherwise refuses to promise a resolution outcome.
1@dataclass(frozen=True)
2class Answer:
3 text: str
4 cited_sources: tuple[str, ...]
5 abstained: bool
6
7def answer_from_evidence(question: str, evidence: list[PackedEvidence]) -> Answer:
8 for item in evidence:
9 if "14 days" in item.text and "48 hours" in item.text:
10 return Answer(
11 text=(
12 "Yes, if damage was reported within 48 hours of delivery; "
13 "the replacement window is 14 days. "
14 f"[{item.source_id}]"
15 ),
16 cited_sources=(item.source_id,),
17 abstained=False,
18 )
19 return Answer(
20 text="I can't confirm that outcome from permitted current policy evidence.",
21 cited_sources=(),
22 abstained=True,
23 )
24
25supported = answer_from_evidence(question, packed)
26missing = answer_from_evidence("Can I refund a damaged drone?", [])
27print("Supported:", supported.text)
28print("No evidence:", missing.text)
29assert supported.cited_sources == ("E1",)
30assert missing.abstained1Supported: Yes, if damage was reported within 48 hours of delivery; the replacement window is 14 days. [E1]
2No evidence: I can't confirm that outcome from permitted current policy evidence.The lab uses string checks only to make the invariant runnable. A real candidate may use a model, structured citations, and claim verification. The release rule remains: if permitted current evidence doesn't support a material policy claim, the system must abstain or escalate.
The agent evaluation lesson treated traces as observable release evidence. RAG needs the same discipline. Record versions and decisions needed to reproduce an answer, but don't copy restricted source text into broad logs.
| Trace field | Example | Why it matters |
|---|---|---|
request_id, actor_id, region | rag-0007, luna-48291, EU | Establishes authorization context |
index_version | policy-index/2026-05-27 | Lets you replay against the same evidence state |
retrieved_chunk_ids, source_map | ["eu-refurb-v2-rule"], {"E1": {...}} | Connects packed citations to versioned parent evidence |
cited_source_ids | ["E1"] | Connects answer claim to packed evidence |
abstained | false | Makes coverage and failures measurable |
| Stage timings | retrieve_ms=18, ttft_ms=320, generate_ms=410 | Locates latency regressions |
1def trace_request(
2 request_id: str,
3 caller: Caller,
4 hits: list[PolicyChunk],
5 evidence: list[PackedEvidence],
6 answer: Answer,
7) -> dict[str, object]:
8 return {
9 "request_id": request_id,
10 "actor_id": caller.actor_id,
11 "region": caller.region,
12 "index_version": "policy-index/2026-05-27",
13 "retrieved_chunk_ids": [chunk.chunk_id for chunk in hits],
14 "retrieved_versions": [chunk.version for chunk in hits],
15 "source_map": {
16 item.source_id: {
17 "chunk_id": item.chunk_id,
18 "document_id": item.document_id,
19 "parent_id": item.parent_id,
20 "version": item.version,
21 }
22 for item in evidence
23 },
24 "cited_source_ids": list(answer.cited_sources),
25 "abstained": answer.abstained,
26 "timings_ms": {
27 "authorize": 2,
28 "retrieve": 18,
29 "pack": 1,
30 "ttft": 320,
31 "generate": 410,
32 "trace": 3,
33 },
34 }
35
36trace = trace_request("rag-0007", LUNA, safe_hits, packed, supported)
37stores_raw_policy_text = any(
38 chunk.text in str(trace)
39 for chunk in corpus_with_restricted
40)
41print("Trace chunks:", trace["retrieved_chunk_ids"])
42print("Trace source map:", trace["source_map"])
43print("Trace cites:", trace["cited_source_ids"])
44print("Trace stores raw policy text:", stores_raw_policy_text)
45assert not stores_raw_policy_text1Trace chunks: ['eu-refurb-v2-rule']
2Trace source map: {'E1': {'chunk_id': 'eu-refurb-v2-rule', 'document_id': 'eu-electronics', 'parent_id': 'eu-electronics-v2', 'version': 'eu-electronics/2026-04-01'}}
3Trace cites: ['E1']
4Trace stores raw policy text: FalseRAG adds work before the first generated token: authorization, retrieval, and context packing. Track that work separately from time to first token (TTFT). The fixture records generate as time after the first token, so its stages add up to one request timeline. If TTFT rises after a corpus change while retrieval stays fast, prompt size may be the issue.
1LATENCY_BUDGET_MS = {
2 "authorize": 10,
3 "retrieve": 80,
4 "pack": 10,
5 "ttft": 500,
6 "generate": 500,
7 "trace": 10,
8}
9
10def exceeded_budgets(timings: dict[str, int]) -> list[str]:
11 return [
12 stage
13 for stage, budget in LATENCY_BUDGET_MS.items()
14 if stage not in timings or timings[stage] > budget
15 ]
16
17healthy = trace["timings_ms"]
18regressed = {**healthy, "ttft": 740}
19missing_trace = {
20 stage: duration
21 for stage, duration in healthy.items()
22 if stage != "trace"
23}
24print("Healthy exceeded:", exceeded_budgets(healthy))
25print("Regressed exceeded:", exceeded_budgets(regressed))
26print("Missing timing exceeded:", exceeded_budgets(missing_trace))
27assert exceeded_budgets(healthy) == []
28assert exceeded_budgets(regressed) == ["ttft"]
29assert exceeded_budgets(missing_trace) == ["trace"]1Healthy exceeded: []
2Regressed exceeded: ['ttft']
3Missing timing exceeded: ['trace']An appealing demo question doesn't establish reliability. Create frozen cases from policy questions, authorization attacks, outdated revisions, and missing-evidence requests. Keep the expected evidence IDs with each case. This separates retrieval failure from generation failure before users see the candidate.
RAG evaluation research also separates retrieval evidence quality from answer faithfulness and relevance rather than hiding all failures inside one final score.[3] The dedicated RAG evaluation lesson will implement those metrics. This lesson starts with hard release assertions that catch expensive mistakes immediately.
1@dataclass(frozen=True)
2class EvalCase:
3 name: str
4 question: str
5 corpus: tuple[PolicyChunk, ...]
6 expected_chunk_ids: tuple[str, ...]
7 forbidden_chunk_ids: tuple[str, ...]
8 should_abstain: bool
9
10CASES = [
11 EvalCase(
12 "supported-eu-laptop",
13 "damaged refurbished laptop replacement after 10 days",
14 tuple(CHUNKS),
15 ("eu-refurb-v2-rule",),
16 ("eu-refurb-v1-rule", "us-refurb-v4-rule"),
17 False,
18 ),
19 EvalCase(
20 "restricted-vip-source",
21 "VIP merchant damaged refurbished laptop replacement",
22 tuple(corpus_with_restricted),
23 ("eu-refurb-v2-rule",),
24 ("merchant-vip-refurb",),
25 False,
26 ),
27 EvalCase(
28 "superseded-window",
29 "damaged refurbished laptops replacement window",
30 tuple(CHUNKS),
31 ("eu-refurb-v2-rule",),
32 ("eu-refurb-v1-rule",),
33 False,
34 ),
35 EvalCase(
36 "missing-drone-policy",
37 "drone propeller damage return rule",
38 tuple(corpus_with_restricted),
39 (),
40 ("merchant-vip-refurb",),
41 True,
42 ),
43]
44
45def run_case(case: EvalCase) -> tuple[bool, str]:
46 hits = retrieve(case.question, LUNA, list(case.corpus), EVAL_DATE)
47 evidence = pack_evidence(hits)
48 result = answer_from_evidence(case.question, evidence)
49 ids = [chunk.chunk_id for chunk in hits]
50 passed = (
51 all(forbidden_id not in ids for forbidden_id in case.forbidden_chunk_ids)
52 and result.abstained == case.should_abstain
53 and tuple(ids) == case.expected_chunk_ids
54 )
55 return passed, f"{case.name}: ids={ids}, abstained={result.abstained}"
56
57results = [run_case(case) for case in CASES]
58for passed, summary in results:
59 print("PASS" if passed else "BLOCK", summary)
60print("Candidate promoted:", all(passed for passed, _ in results))
61assert all(passed for passed, _ in results)1PASS supported-eu-laptop: ids=['eu-refurb-v2-rule'], abstained=False
2PASS restricted-vip-source: ids=['eu-refurb-v2-rule'], abstained=False
3PASS superseded-window: ids=['eu-refurb-v2-rule'], abstained=False
4PASS missing-drone-policy: ids=[], abstained=True
5Candidate promoted: TrueThe minimal suite already checks three high-impact failures: a forbidden chunk, a superseded chunk, and an unsupported answer. A serious deployment adds paraphrases, policy conflicts, index deletion cases, model-judge calibration, human reviews, and latency distributions.
| Gate | Block when | First repair location |
|---|---|---|
| Authorization | Any returned chunk lacks the caller's permission | Metadata and retrieval filter |
| Freshness | Answer cites a superseded version | Index lifecycle and effective-date filter |
| Evidence | Required source isn't in top candidates | Retriever, chunking, or metadata |
| Grounding | Answer asserts a policy not supported by context | Prompt, answer validator, or abstention |
| Latency | A critical stage exceeds budget consistently | Trace the stage before changing architecture |
policy-answerer-v1 artifactYou now have the bones of a production RAG service. Turn the fragments into a small portfolio artifact:
chunk_id, document_id, parent_id, effective dates, region, and ACL tags.retrieve() contract. Keep the simple overlap baseline first.1release_hits = retrieve(question, LUNA, corpus_with_restricted, EVAL_DATE)
2release_report = {
3 "candidate": "policy-answerer-v1",
4 "index_version": trace["index_version"],
5 "evaluated_cases": len(CASES),
6 "authorization_gate": restricted.chunk_id not in [
7 chunk.chunk_id for chunk in release_hits
8 ],
9 "freshness_gate": "eu-refurb-v1-rule" not in [
10 chunk.chunk_id for chunk in release_hits
11 ],
12 "latency_gate": exceeded_budgets(trace["timings_ms"]) == [],
13 "case_gate": all(passed for passed, _ in results),
14}
15promote = all(
16 value is True
17 for key, value in release_report.items()
18 if key.endswith("_gate")
19)
20print("Candidate:", release_report["candidate"])
21print("Index:", release_report["index_version"])
22print("All hard gates pass:", promote)
23assert promote1Candidate: policy-answerer-v1
2Index: policy-index/2026-05-27
3All hard gates pass: TrueYou are ready to design a production RAG pipeline when you can:
| Level | Evidence in your submission |
|---|---|
| Foundational | Versioned chunks, current-policy filtering, and a supported cited answer |
| Applied | Authorization attack stays hidden and missing evidence triggers abstention |
| Strong | Frozen cases, request traces, and explicit stage budgets block bad releases |
| Production-ready | Retriever upgrades improve evidence recall without changing permission or grounding guarantees |
| Symptom | Likely cause | Repair |
|---|---|---|
| Answer cites last year's resolution window | Index overwrote or failed to filter superseded policies | Keep versioned records and filter by effective date |
| Restricted merchant rule appears in prompt | Permission check happened after retrieval | Filter candidates inside the retrieval boundary |
| Correct policy isn't enough to explain a response | Citation IDs weren't carried into packed context and trace | Keep stable chunk and parent identifiers end to end |
| Bot promises an outcome absent from evidence | Generation had no enforced abstention path | Require supported claims or escalation |
| Quality debates can't be resolved | Tests record answers but not retrieved evidence | Freeze expected evidence IDs and save traces |