LearnApplied LLM EngineeringProduction RAG Pipelines

🔍MediumRAG & Retrieval

Production RAG Pipelines

Design a secure, traceable RAG service around versioned policy evidence, grounded answers, abstention, release gates, and latency budgets.

18 min read

Learning path

Step 65 of 158 in the full curriculum

Evaluating AI Agents Hybrid Search: Dense + Sparse

Your internal security-policy support agent can now be evaluated as an agent. It still can't answer a policy question safely unless it receives the right evidence. A rotation rule may change by region, account type, risk state, or policy revision. An answer that sounds right but cites last year's rule can authorize a costly mistake.

Retrieval-augmented generation (RAG) gives a language model retrieved evidence at answer time instead of expecting its weights to contain current private facts.^{[1]Reference 1Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.https://arxiv.org/abs/2005.11401} In production, that simple idea becomes a system contract: index traceable evidence, retrieve only evidence the user may see, generate from that evidence, abstain when it isn't enough, and retain a trace that a reviewer can inspect.

For policy-answerer-v1, that contract is the main deliverable. You won't implement BM25, dense embeddings, fusion, or reranking here. Those retrieval algorithms belong in the next lessons. Here, you'll make the pipeline around any retriever trustworthy.

The promise the service must keep

Suppose Luna, an EU support specialist, asks:

Can a stale service-account key be rotated automatically if the risk signal arrived 10 days ago?

The answer isn't just text. A release-worthy response must satisfy four properties:

Property	What the user needs	Failure you must block
Correct evidence	Current EU key-rotation policy	Old US or superseded rule retrieved
Authorization	Only sources Luna may read	Restricted admin addendum leaks
Grounding	Each policy claim points to evidence	Model invents a rotation window
Operability	Trace and latency data for the request	Team can't reproduce a bad promise

Three-lane production RAG flow: offline indexing versions evidence, online retrieval authorizes before model context, and frozen release cases replay the exact path before promotion. — A production RAG answer is the end of an evidence path. Version evidence offline, authorize before model context, then replay the same path on frozen cases before release.

The data path has an offline side and an online side. When documents change, the indexer produces evidence records. The request path filters those records by identity and policy state, asks a retriever for candidates, packs source-labelled context, and returns either a supported answer or an abstention. Before a new index, prompt, retriever, or model version serves users, the release path replays frozen questions.

Build the evidence record

Earlier chunking lessons showed how to cut a document into searchable spans. A production service adds the fields needed to use those spans later: a stable document identifier, a parent section for citations, a version, an effective date range, a region, and an access control list (ACL) tag.

Our tiny corpus has three current policies and one superseded policy. Notice that the US and EU rules deliberately differ. That difference turns an access-control bug into a visible wrong answer.

evidence-records.py

from __future__ import annotations

from dataclasses import dataclass
from datetime import date
import re

@dataclass(frozen=True)
class PolicyChunk:
    chunk_id: str
    document_id: str
    parent_id: str
    version: str
    region: str
    acl_tag: str
    effective_from: date
    effective_to: date | None
    text: str

EVAL_DATE = date(2026, 5, 27)
CHUNKS = [
    PolicyChunk(
        chunk_id="eu-key-rotation-v2-rule",
        document_id="eu-access",
        parent_id="eu-access-v2",
        version="eu-access/2026-04-01",
        region="EU",
        acl_tag="support:eu",
        effective_from=date(2026, 4, 1),
        effective_to=None,
        text=(
            "Stale service-account keys qualify for automated rotation within "
            "14 days when a risk signal arrives within 48 hours."
        ),
    ),
    PolicyChunk(
        chunk_id="eu-key-rotation-v1-rule",
        document_id="eu-access",
        parent_id="eu-access-v1",
        version="eu-access/2025-02-01",
        region="EU",
        acl_tag="support:eu",
        effective_from=date(2025, 2, 1),
        effective_to=date(2026, 3, 31),
        text="Stale service-account keys require manual rotation within 30 days.",
    ),
    PolicyChunk(
        chunk_id="us-key-rotation-v4-rule",
        document_id="us-access",
        parent_id="us-access-v4",
        version="us-access/2026-03-15",
        region="US",
        acl_tag="support:us",
        effective_from=date(2026, 3, 15),
        effective_to=None,
        text="Stale service-account keys require security review within 30 days.",
    ),
    PolicyChunk(
        chunk_id="eu-session-timeout-v1-rule",
        document_id="eu-session",
        parent_id="eu-session-timeout-v1",
        version="eu-session/2026-01-03",
        region="EU",
        acl_tag="support:eu",
        effective_from=date(2026, 1, 3),
        effective_to=None,
        text="Idle browser sessions expire after 30 days of inactivity.",
    ),
]

def is_current(chunk: PolicyChunk, on_date: date) -> bool:
    return (
        chunk.effective_from <= on_date
        and (chunk.effective_to is None or on_date <= chunk.effective_to)
    )

current_ids = [chunk.chunk_id for chunk in CHUNKS if is_current(chunk, EVAL_DATE)]
print("All evidence records:", len(CHUNKS))
print("Current records:", current_ids)
assert "eu-key-rotation-v1-rule" not in current_ids

Output

All evidence records: 4
Current records: ['eu-key-rotation-v2-rule', 'us-key-rotation-v4-rule', 'eu-session-timeout-v1-rule']

The record is deliberately more boring than a model call. That's good. Every later stage can now prove which policy revision it used. The fixed EVAL_DATE also makes this replay reproducible instead of changing behavior with the wall clock.

Retrieve small, cite enough context

Indexing whole policy pages gives a retriever too much irrelevant text. Indexing one sentence can lose surrounding exceptions. Parent-child indexing stores a compact child span for search and a parent section for final evidence. The retriever can match the child ID, then context assembly can fetch the parent section and its stable citation metadata.

The compact lab keeps child text inline and carries document_id plus parent_id. A full parent-child implementation resolves parent_id to a version-matched, permitted parent section before packing. Keep those fields separate instead of parsing meaning out of an ID string.

Parent-child indexing flow where a query hits a compact 14-day child span, then parent_id expands that match into a version-matched parent section with caveats and citation metadata. — The child span helps retrieval find the exact rule. The parent section carries caveats, citation context, and the same version and authorization boundary.

Chunk overlap remains useful when a sentence straddles a boundary, but it isn't a default setting to trust blindly. Treat it as an indexing candidate that must survive retrieval tests on your own policy questions.

Chunk boundary comparison showing a hard split separating a stale service-account key-rotation rule from its 14-day rotation window and 48-hour risk-signal condition, while overlapping windows preserve the complete rule inside one retrievable span. — A boundary that cuts the 14-day rule in half makes even a good retriever fail. Overlap can preserve a complete evidence span, but you still measure the result on labeled queries.

The next fragment checks a basic index invariant: at most one current record for the same region and policy document. Two active revisions would allow the request path to retrieve contradictory promises.

index-invariants.py

from collections import defaultdict

def validate_current_versions(chunks: list[PolicyChunk], on_date: date) -> None:
    active_by_scope: dict[tuple[str, str], list[str]] = defaultdict(list)
    for chunk in chunks:
        if is_current(chunk, on_date):
            scope = (chunk.region, chunk.document_id)
            active_by_scope[scope].append(chunk.version)

    conflicts = {
        scope: versions
        for scope, versions in active_by_scope.items()
        if len(versions) > 1
    }
    if conflicts:
        raise ValueError(f"Conflicting active policy versions: {conflicts}")

validate_current_versions(CHUNKS, EVAL_DATE)
print("Current-version invariant: pass")
print("Superseded EU record stays indexed for audit, not answering.")

Output

Current-version invariant: pass
Superseded EU record stays indexed for audit, not answering.

Put authorization before similarity

An embedding index doesn't know whether Luna can read a document. A highly similar restricted chunk is still forbidden. The safe order is:

Determine the caller's tenant, role, region, and request date from trusted application state.
Select admissible evidence by those fields.
Search only within that admissible set, or use a store that enforces the filter inside retrieval.
Pass only returned permitted text to context assembly and logs visible to the caller.

Filtering after text has already reached the model is too late. The model, request trace, cache, or error report may already contain restricted content.

The fixture below has one internal security-policy tenant, so it models region and ACL tags directly. A multi-tenant service must enforce tenant isolation inside the same permission boundary; tenant identity can't depend on model instructions.

The lab uses a simple term-overlap search so its authorization behavior is obvious. Its retrieve() interface is the part you'll replace with hybrid search in the next chapter.

authorized-retrieval.py

@dataclass(frozen=True)
class Caller:
    actor_id: str
    region: str
    acl_tags: frozenset[str]

LUNA = Caller("luna-48291", "EU", frozenset({"support:eu"}))

def allowed_chunks(caller: Caller, chunks: list[PolicyChunk], on_date: date) -> list[PolicyChunk]:
    return [
        chunk
        for chunk in chunks
        if is_current(chunk, on_date)
        and chunk.region == caller.region
        and chunk.acl_tag in caller.acl_tags
    ]

def terms(text: str) -> set[str]:
    return set(re.findall(r"[a-z0-9]+", text.lower()))

def retrieve(
    query: str,
    caller: Caller,
    chunks: list[PolicyChunk],
    on_date: date,
    top_k: int = 2,
    min_matching_terms: int = 2,
) -> list[PolicyChunk]:
    permitted = allowed_chunks(caller, chunks, on_date)
    query_terms = terms(query)
    scored = [
        (len(query_terms & terms(chunk.text)), chunk)
        for chunk in permitted
    ]
    ranked = sorted(scored, key=lambda item: item[0], reverse=True)
    return [
        chunk
        for score, chunk in ranked
        if score >= min_matching_terms
    ][:top_k]

question = "stale service-account key automated rotation after 10 days"
hits = retrieve(question, LUNA, CHUNKS, EVAL_DATE)
print("Retrieved:", [(chunk.chunk_id, chunk.version) for chunk in hits])
print("US evidence exposed:", any(chunk.region == "US" for chunk in hits))
assert hits[0].chunk_id == "eu-key-rotation-v2-rule"
assert all(chunk.acl_tag == "support:eu" for chunk in hits)

Output

Retrieved: [('eu-key-rotation-v2-rule', 'eu-access/2026-04-01'), ('eu-session-timeout-v1-rule', 'eu-session/2026-01-03')]
US evidence exposed: False

This retriever isn't production search: its two-term threshold rejects weak hits, but it misses paraphrases such as "refresh expired machine credential." It's a clean test double for the surrounding pipeline. Once the authorization and trace contract work, you can improve recall without weakening the boundary.

Failure test: a tempting but forbidden result

A useful test shouldn't only prove success. It should include a result that would rank well if the permission filter were missing.

acl-regression-test.py

restricted = PolicyChunk(
    chunk_id="restricted-admin-key-rotation",
    document_id="admin-override-terms",
    parent_id="admin-override-terms",
    version="admin-override/2026-05-01",
    region="EU",
    acl_tag="security:admins",
    effective_from=date(2026, 5, 1),
    effective_to=None,
    text=(
        "Security admins may run emergency key rotation without support approval."
    ),
)

corpus_with_restricted = [restricted, *CHUNKS]
safe_hits = retrieve(question, LUNA, corpus_with_restricted, EVAL_DATE)
visible_ids = [chunk.chunk_id for chunk in safe_hits]

print("Visible hit ids:", visible_ids)
print("Restricted admin policy hidden:", restricted.chunk_id not in visible_ids)
assert restricted.chunk_id not in visible_ids

Output

Visible hit ids: ['eu-key-rotation-v2-rule', 'eu-session-timeout-v1-rule']
Restricted admin policy hidden: True

Design choice	Unsafe shortcut	Observable consequence
Filter before retrieval	Retrieve everything, redact after generation	Secret rule may enter prompt or trace
Store versions and dates	Overwrite the old chunk in place	Can't reproduce a historical answer
Preserve parent citation	Return text with no source identity	Reviewer can't verify a claim

Pack evidence for a grounded answer

Retrieval produces candidate records, not an answer. Context assembly must give the generator source labels, version information, and a clear instruction to abstain when the evidence doesn't establish the requested promise.

Don't stuff every near-match into the prompt. Even when a context window fits a large amount of text, models can use relevant information less reliably when it sits among long distractors, especially in the middle of a long input.^{[2]Reference 2Lost in the Middle: How Language Models Use Long Contextshttps://arxiv.org/abs/2307.03172} Pack the strongest permitted evidence first, keep the set small, and evaluate this policy rather than assuming it works.

pack-cited-context.py

@dataclass(frozen=True)
class PackedEvidence:
    source_id: str
    chunk_id: str
    document_id: str
    parent_id: str
    version: str
    text: str

def pack_evidence(hits: list[PolicyChunk], max_characters: int = 400) -> list[PackedEvidence]:
    packed: list[PackedEvidence] = []
    used = 0
    for position, chunk in enumerate(hits, start=1):
        if used + len(chunk.text) > max_characters:
            break
        packed.append(
            PackedEvidence(
                source_id=f"E{position}",
                chunk_id=chunk.chunk_id,
                document_id=chunk.document_id,
                parent_id=chunk.parent_id,
                version=chunk.version,
                text=chunk.text,
            )
        )
        used += len(chunk.text)
    return packed

packed = pack_evidence(safe_hits)
context = "\n".join(
    f"[{item.source_id}] {item.parent_id} ({item.version}): {item.text}"
    for item in packed
)
print(context)
assert "[E1]" in context
assert packed[0].document_id == "eu-access"
assert packed[0].parent_id == "eu-access-v2"
assert "admin-override" not in context

Output

[E1] eu-access-v2 (eu-access/2026-04-01): Stale service-account keys qualify for automated rotation within 14 days when a risk signal arrives within 48 hours.
[E2] eu-session-timeout-v1 (eu-session/2026-01-03): Idle browser sessions expire after 30 days of inactivity.

Answer or abstain

In an actual service, a language model would receive the packed context and an instruction to cite it. For the lab, a deterministic answerer makes the core contract inspectable: it emits the rule only when the required evidence is present and otherwise refuses to promise a resolution outcome.

grounded-answer.py

@dataclass(frozen=True)
class Answer:
    text: str
    cited_sources: tuple[str, ...]
    abstained: bool

def answer_from_evidence(question: str, evidence: list[PackedEvidence]) -> Answer:
    for item in evidence:
        if "14 days" in item.text and "48 hours" in item.text:
            return Answer(
                text=(
                    "Yes, if the risk signal arrived within 48 hours; "
                    "the automated rotation window is 14 days. "
                    f"[{item.source_id}]"
                ),
                cited_sources=(item.source_id,),
                abstained=False,
            )
    return Answer(
        text="I can't confirm that outcome from permitted current policy evidence.",
        cited_sources=(),
        abstained=True,
    )

supported = answer_from_evidence(question, packed)
missing = answer_from_evidence("Can I approve an unmanaged sandbox credential?", [])
print("Supported:", supported.text)
print("No evidence:", missing.text)
assert supported.cited_sources == ("E1",)
assert missing.abstained

Output

Supported: Yes, if the risk signal arrived within 48 hours; the automated rotation window is 14 days. [E1]
No evidence: I can't confirm that outcome from permitted current policy evidence.

The lab uses string checks only to make the invariant runnable. A real candidate may use a model, structured citations, and claim verification. The release rule remains: if permitted current evidence doesn't support a material policy claim, the system must abstain or escalate.

Record a reproducible request trace

The agent evaluation lesson treated traces as observable release evidence. RAG needs the same discipline. Record versions and decisions needed to reproduce an answer, but don't copy restricted source text into broad logs.

Trace field	Example	Why it matters
`request_id`, `actor_id`, `region`	`rag-0007`, `luna-48291`, `EU`	Establishes authorization context
`index_version`	`policy-index/2026-05-27`	Lets you replay against the same evidence state
`retrieved_chunk_ids`, `source_map`	`["eu-key-rotation-v2-rule"]`, `{"E1": {...}}`	Connects packed citations to versioned parent evidence
`cited_source_ids`	`["E1"]`	Connects answer claim to packed evidence
`abstained`	`false`	Makes coverage and failures measurable
Stage timings	`retrieve_ms=18`, `model_ttft_ms=320`, `generate_ms=410`	Locates latency regressions

request-trace.py

def trace_request(
    request_id: str,
    caller: Caller,
    hits: list[PolicyChunk],
    evidence: list[PackedEvidence],
    answer: Answer,
) -> dict[str, object]:
    return {
        "request_id": request_id,
        "actor_id": caller.actor_id,
        "region": caller.region,
        "index_version": "policy-index/2026-05-27",
        "retrieved_chunk_ids": [chunk.chunk_id for chunk in hits],
        "retrieved_versions": [chunk.version for chunk in hits],
        "source_map": {
            item.source_id: {
                "chunk_id": item.chunk_id,
                "document_id": item.document_id,
                "parent_id": item.parent_id,
                "version": item.version,
            }
            for item in evidence
        },
        "cited_source_ids": list(answer.cited_sources),
        "abstained": answer.abstained,
        "timings_ms": {
            "authorize": 2,
            "retrieve": 18,
            "pack": 1,
            "model_ttft": 320,
            "generate": 410,
            "trace": 3,
        },
    }

trace = trace_request("rag-0007", LUNA, safe_hits, packed, supported)
stores_raw_policy_text = any(
    chunk.text in str(trace)
    for chunk in corpus_with_restricted
)
print("Trace chunks:", trace["retrieved_chunk_ids"])
print("Trace source map:", trace["source_map"])
print("Trace cites:", trace["cited_source_ids"])
print("Trace stores raw policy text:", stores_raw_policy_text)
assert not stores_raw_policy_text

Output

Trace chunks: ['eu-key-rotation-v2-rule', 'eu-session-timeout-v1-rule']
Trace source map: {'E1': {'chunk_id': 'eu-key-rotation-v2-rule', 'document_id': 'eu-access', 'parent_id': 'eu-access-v2', 'version': 'eu-access/2026-04-01'}, 'E2': {'chunk_id': 'eu-session-timeout-v1-rule', 'document_id': 'eu-session', 'parent_id': 'eu-session-timeout-v1', 'version': 'eu-session/2026-01-03'}}
Trace cites: ['E1']
Trace stores raw policy text: False

Setting temperature = 0 doesn't make an answer reproducible. Greedy decoding removes sampling randomness, but outputs may still change when model weights behind an alias, prompt template, retriever configuration, or index changes. Pin a model version or weight hash instead of latest, hash exact prompt template, record retriever and reranker configuration, and retain an index snapshot identifier. Extend this trace's index_version with model_version and prompt_hash. Even with those pins, provider-hosted generation may be only approximately reproducible because hardware and batching can perturb low-probability tokens. Store produced answer as audit evidence instead of assuming byte-for-byte regeneration.

Budget latency by stage

RAG adds work before the first generated token: authorization, retrieval, and context packing. Keep two measurements separate:

End-to-end time to first token (TTFT) is what the caller feels from request arrival until the first generated token.
Model TTFT starts when the service sends packed context to the model and ends when the first generated token arrives.

The fixture records model_ttft plus generate, where generate is time after the first token. That makes each stage additive while preserving the caller-visible TTFT. If model TTFT rises after a corpus change while retrieval stays fast, packed prompt size may be the issue.

RAG request timeline where caller TTFT includes authorize, retrieve, pack, and model startup before token one, followed by generation and a small trace write. — Caller-visible TTFT includes authorization, retrieval, packing, and model startup. Stage timing separates front-end evidence work from slow model startup or long generation.

latency-gate.py

LATENCY_BUDGET_MS = {
    "authorize": 10,
    "retrieve": 80,
    "pack": 10,
    "model_ttft": 500,
    "generate": 500,
    "trace": 10,
}

def exceeded_budgets(timings: dict[str, int]) -> list[str]:
    return [
        stage
        for stage, budget in LATENCY_BUDGET_MS.items()
        if stage not in timings or timings[stage] > budget
    ]

healthy = trace["timings_ms"]
service_ttft = sum(
    healthy[stage]
    for stage in ("authorize", "retrieve", "pack", "model_ttft")
)
regressed = {**healthy, "model_ttft": 740}
missing_trace = {
    stage: duration
    for stage, duration in healthy.items()
    if stage != "trace"
}
print("Service TTFT:", service_ttft)
print("Healthy exceeded:", exceeded_budgets(healthy))
print("Regressed exceeded:", exceeded_budgets(regressed))
print("Missing timing exceeded:", exceeded_budgets(missing_trace))
assert service_ttft == 341
assert exceeded_budgets(healthy) == []
assert exceeded_budgets(regressed) == ["model_ttft"]
assert exceeded_budgets(missing_trace) == ["trace"]

Output

Service TTFT: 341
Healthy exceeded: []
Regressed exceeded: ['model_ttft']
Missing timing exceeded: ['trace']

Use frozen cases as a release gate

An appealing demo question doesn't establish reliability. Create frozen cases from policy questions, authorization attacks, outdated revisions, and missing-evidence requests. Keep the expected evidence IDs with each case. This turns the suite into an eval gate and separates retrieval failure from generation failure before users see the candidate.

RAG evaluation research also separates retrieval evidence quality from answer faithfulness and relevance rather than hiding all failures inside one final score.^{[3]Reference 3RAGAS: Automated Evaluation of Retrieval Augmented Generation.https://arxiv.org/abs/2309.15217} The dedicated RAG evaluation lesson will implement those metrics. Start with hard release assertions that catch expensive mistakes immediately.

RAG release gate path where a frozen policy and attack suite passes authorization, freshness, evidence, answer, and latency checks in sequence before promotion, while any failed gate blocks the release. — Frozen cases become release gates. Candidate ships only when policy, freshness, expected evidence, answer behavior, and latency all hold.

release-gates.py

@dataclass(frozen=True)
class EvalCase:
    name: str
    question: str
    corpus: tuple[PolicyChunk, ...]
    expected_chunk_ids: tuple[str, ...]
    forbidden_chunk_ids: tuple[str, ...]
    should_abstain: bool

CASES = [
    EvalCase(
        "supported-eu-key-rotation",
        "stale service-account automated rotation",
        tuple(CHUNKS),
        ("eu-key-rotation-v2-rule",),
        ("eu-key-rotation-v1-rule", "us-key-rotation-v4-rule"),
        False,
    ),
    EvalCase(
        "restricted-admin-source",
        "admin emergency stale service account rotation",
        tuple(corpus_with_restricted),
        ("eu-key-rotation-v2-rule",),
        ("restricted-admin-key-rotation",),
        False,
    ),
    EvalCase(
        "superseded-window",
        "stale service-account key rotation window",
        tuple(CHUNKS),
        ("eu-key-rotation-v2-rule",),
        ("eu-key-rotation-v1-rule",),
        False,
    ),
    EvalCase(
        "missing-test-key-policy",
        "sandbox credential exception policy",
        tuple(corpus_with_restricted),
        (),
        ("restricted-admin-key-rotation",),
        True,
    ),
]

def run_case(case: EvalCase) -> tuple[bool, str]:
    hits = retrieve(case.question, LUNA, list(case.corpus), EVAL_DATE)
    evidence = pack_evidence(hits)
    result = answer_from_evidence(case.question, evidence)
    ids = [chunk.chunk_id for chunk in hits]
    passed = (
        all(forbidden_id not in ids for forbidden_id in case.forbidden_chunk_ids)
        and result.abstained == case.should_abstain
        and tuple(ids) == case.expected_chunk_ids
    )
    return passed, f"{case.name}: ids={ids}, abstained={result.abstained}"

results = [run_case(case) for case in CASES]
for passed, summary in results:
    print("PASS" if passed else "BLOCK", summary)
print("Candidate promoted:", all(passed for passed, _ in results))
assert all(passed for passed, _ in results)

Output

PASS supported-eu-key-rotation: ids=['eu-key-rotation-v2-rule'], abstained=False
PASS restricted-admin-source: ids=['eu-key-rotation-v2-rule'], abstained=False
PASS superseded-window: ids=['eu-key-rotation-v2-rule'], abstained=False
PASS missing-test-key-policy: ids=[], abstained=True
Candidate promoted: True

The minimal suite already checks three high-impact failures: a forbidden chunk, a superseded chunk, and an unsupported answer. A serious deployment adds paraphrases, policy conflicts, index deletion cases, model-judge calibration, human reviews, and latency distributions.

What to block before launch

Gate	Block when	First repair location
Authorization	Any returned chunk lacks the caller's permission	Metadata and retrieval filter
Freshness	Answer cites a superseded version	Index lifecycle and effective-date filter
Evidence	Required source isn't in top candidates	Retriever, chunking, or metadata
Grounding	Answer asserts a policy not supported by context	Prompt, answer validator, or abstention
Latency	A critical stage exceeds budget consistently	Trace the stage before changing architecture

Ship the `policy-answerer-v1` artifact

The production RAG service has enough structure for a small portfolio artifact:

Store three versions of an access-control policy with chunk_id, document_id, parent_id, effective dates, region, and ACL tags.
Add at least four frozen questions: a supported EU request, a US-only request, a restricted admin-only policy attack, and a question whose answer is absent.
Implement a retriever behind the retrieve() contract. Keep the simple overlap baseline first.
Pack evidence with stable source IDs and return a cited answer or a documented abstention.
Write one trace JSON row per request without logging restricted text.
Produce a release report listing authorization, freshness, evidence, grounding, and latency gates.

release-report.py

release_hits = retrieve(question, LUNA, corpus_with_restricted, EVAL_DATE)
release_report = {
    "candidate": "policy-answerer-v1",
    "index_version": trace["index_version"],
    "evaluated_cases": len(CASES),
    "authorization_gate": restricted.chunk_id not in [
        chunk.chunk_id for chunk in release_hits
    ],
    "freshness_gate": "eu-key-rotation-v1-rule" not in [
        chunk.chunk_id for chunk in release_hits
    ],
    "latency_gate": exceeded_budgets(trace["timings_ms"]) == [],
    "case_gate": all(passed for passed, _ in results),
}
promote = all(
    value is True
    for key, value in release_report.items()
    if key.endswith("_gate")
)
print("Candidate:", release_report["candidate"])
print("Index:", release_report["index_version"])
print("All hard gates pass:", promote)
assert promote

Output

Candidate: policy-answerer-v1
Index: policy-index/2026-05-27
All hard gates pass: True

Mastery check

You're ready to design a production RAG pipeline when you can:

Explain why a RAG answer must be treated as an evidence-backed operation instead of generated text alone.
Define a versioned chunk record with stable citation identity, effective dates, region, and ACL metadata.
Enforce authorization and policy freshness before any retrieved text reaches the model.
Pack small, cited context and require an abstention when permitted evidence can't support the answer.
Record a reproducible request trace without storing restricted text in unsafe logs.
Separate caller-visible TTFT from model TTFT so latency regressions point to the right stage.
Gate a candidate on authorization, freshness, grounding, abstention, and latency behavior.
Preserve that contract while a later retrieval implementation replaces the simple search baseline.

Evaluation rubric

Level	Evidence in your submission
Foundational	Versioned chunks, current-policy filtering, and a supported cited answer
Applied	Authorization attack stays hidden and missing evidence triggers abstention
Strong	Frozen cases, request traces, and explicit stage budgets block bad releases
Production-ready	Retriever upgrades improve evidence recall without changing permission or grounding guarantees

Common pitfalls

Symptom	Likely cause	Repair
Answer cites last year's resolution window	Index overwrote or failed to filter superseded policies	Keep versioned records and filter by effective date
Restricted admin-only rule appears in prompt	Permission check happened after retrieval	Filter candidates inside the retrieval boundary
Correct policy isn't enough to explain a response	Citation IDs weren't carried into packed context and trace	Keep stable chunk and parent identifiers end to end
Bot promises an outcome absent from evidence	Generation had no enforced abstention path	Require supported claims or escalation
Quality debates can't be resolved	Tests record answers but not retrieved evidence	Freeze expected evidence IDs and save traces

Next Step

Continue to Hybrid Search: Dense + Sparse

You now have the evidence, authorization, grounding, and release contract for a RAG service. Next you'll upgrade its retrieval lane so exact identifiers and paraphrased policy questions both recover the right permitted evidence.

PreviousEvaluating AI Agents

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.

Lewis, P., et al. · 2020 · NeurIPS 2020

Lost in the Middle: How Language Models Use Long Contexts

Liu, N.F., et al. · 2023 · TACL 2023

RAGAS: Automated Evaluation of Retrieval Augmented Generation.

Es, S., et al. · 2023 · arXiv preprint

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Back to Topics

LearnApplied LLM EngineeringProduction RAG Pipelines

🔍MediumRAG & Retrieval

Production RAG Pipelines

Design a secure, traceable RAG service around versioned policy evidence, grounded answers, abstention, release gates, and latency budgets.

18 min read

Learning path

Step 65 of 158 in the full curriculum

Evaluating AI Agents Hybrid Search: Dense + Sparse

The promise the service must keep

Suppose Luna, an EU support specialist, asks:

Can a stale service-account key be rotated automatically if the risk signal arrived 10 days ago?

The answer isn't just text. A release-worthy response must satisfy four properties:

Property	What the user needs	Failure you must block
Correct evidence	Current EU key-rotation policy	Old US or superseded rule retrieved
Authorization	Only sources Luna may read	Restricted admin addendum leaks
Grounding	Each policy claim points to evidence	Model invents a rotation window
Operability	Trace and latency data for the request	Team can't reproduce a bad promise

Build the evidence record

Our tiny corpus has three current policies and one superseded policy. Notice that the US and EU rules deliberately differ. That difference turns an access-control bug into a visible wrong answer.

evidence-records.py

from __future__ import annotations

from dataclasses import dataclass
from datetime import date
import re

@dataclass(frozen=True)
class PolicyChunk:
    chunk_id: str
    document_id: str
    parent_id: str
    version: str
    region: str
    acl_tag: str
    effective_from: date
    effective_to: date | None
    text: str

EVAL_DATE = date(2026, 5, 27)
CHUNKS = [
    PolicyChunk(
        chunk_id="eu-key-rotation-v2-rule",
        document_id="eu-access",
        parent_id="eu-access-v2",
        version="eu-access/2026-04-01",
        region="EU",
        acl_tag="support:eu",
        effective_from=date(2026, 4, 1),
        effective_to=None,
        text=(
            "Stale service-account keys qualify for automated rotation within "
            "14 days when a risk signal arrives within 48 hours."
        ),
    ),
    PolicyChunk(
        chunk_id="eu-key-rotation-v1-rule",
        document_id="eu-access",
        parent_id="eu-access-v1",
        version="eu-access/2025-02-01",
        region="EU",
        acl_tag="support:eu",
        effective_from=date(2025, 2, 1),
        effective_to=date(2026, 3, 31),
        text="Stale service-account keys require manual rotation within 30 days.",
    ),
    PolicyChunk(
        chunk_id="us-key-rotation-v4-rule",
        document_id="us-access",
        parent_id="us-access-v4",
        version="us-access/2026-03-15",
        region="US",
        acl_tag="support:us",
        effective_from=date(2026, 3, 15),
        effective_to=None,
        text="Stale service-account keys require security review within 30 days.",
    ),
    PolicyChunk(
        chunk_id="eu-session-timeout-v1-rule",
        document_id="eu-session",
        parent_id="eu-session-timeout-v1",
        version="eu-session/2026-01-03",
        region="EU",
        acl_tag="support:eu",
        effective_from=date(2026, 1, 3),
        effective_to=None,
        text="Idle browser sessions expire after 30 days of inactivity.",
    ),
]

def is_current(chunk: PolicyChunk, on_date: date) -> bool:
    return (
        chunk.effective_from <= on_date
        and (chunk.effective_to is None or on_date <= chunk.effective_to)
    )

current_ids = [chunk.chunk_id for chunk in CHUNKS if is_current(chunk, EVAL_DATE)]
print("All evidence records:", len(CHUNKS))
print("Current records:", current_ids)
assert "eu-key-rotation-v1-rule" not in current_ids

Output

All evidence records: 4
Current records: ['eu-key-rotation-v2-rule', 'us-key-rotation-v4-rule', 'eu-session-timeout-v1-rule']

Retrieve small, cite enough context

index-invariants.py

from collections import defaultdict

def validate_current_versions(chunks: list[PolicyChunk], on_date: date) -> None:
    active_by_scope: dict[tuple[str, str], list[str]] = defaultdict(list)
    for chunk in chunks:
        if is_current(chunk, on_date):
            scope = (chunk.region, chunk.document_id)
            active_by_scope[scope].append(chunk.version)

    conflicts = {
        scope: versions
        for scope, versions in active_by_scope.items()
        if len(versions) > 1
    }
    if conflicts:
        raise ValueError(f"Conflicting active policy versions: {conflicts}")

validate_current_versions(CHUNKS, EVAL_DATE)
print("Current-version invariant: pass")
print("Superseded EU record stays indexed for audit, not answering.")

Output

Current-version invariant: pass
Superseded EU record stays indexed for audit, not answering.

Put authorization before similarity

An embedding index doesn't know whether Luna can read a document. A highly similar restricted chunk is still forbidden. The safe order is:

Determine the caller's tenant, role, region, and request date from trusted application state.
Select admissible evidence by those fields.
Search only within that admissible set, or use a store that enforces the filter inside retrieval.
Pass only returned permitted text to context assembly and logs visible to the caller.

Filtering after text has already reached the model is too late. The model, request trace, cache, or error report may already contain restricted content.

The lab uses a simple term-overlap search so its authorization behavior is obvious. Its retrieve() interface is the part you'll replace with hybrid search in the next chapter.

authorized-retrieval.py

@dataclass(frozen=True)
class Caller:
    actor_id: str
    region: str
    acl_tags: frozenset[str]

LUNA = Caller("luna-48291", "EU", frozenset({"support:eu"}))

def allowed_chunks(caller: Caller, chunks: list[PolicyChunk], on_date: date) -> list[PolicyChunk]:
    return [
        chunk
        for chunk in chunks
        if is_current(chunk, on_date)
        and chunk.region == caller.region
        and chunk.acl_tag in caller.acl_tags
    ]

def terms(text: str) -> set[str]:
    return set(re.findall(r"[a-z0-9]+", text.lower()))

def retrieve(
    query: str,
    caller: Caller,
    chunks: list[PolicyChunk],
    on_date: date,
    top_k: int = 2,
    min_matching_terms: int = 2,
) -> list[PolicyChunk]:
    permitted = allowed_chunks(caller, chunks, on_date)
    query_terms = terms(query)
    scored = [
        (len(query_terms & terms(chunk.text)), chunk)
        for chunk in permitted
    ]
    ranked = sorted(scored, key=lambda item: item[0], reverse=True)
    return [
        chunk
        for score, chunk in ranked
        if score >= min_matching_terms
    ][:top_k]

question = "stale service-account key automated rotation after 10 days"
hits = retrieve(question, LUNA, CHUNKS, EVAL_DATE)
print("Retrieved:", [(chunk.chunk_id, chunk.version) for chunk in hits])
print("US evidence exposed:", any(chunk.region == "US" for chunk in hits))
assert hits[0].chunk_id == "eu-key-rotation-v2-rule"
assert all(chunk.acl_tag == "support:eu" for chunk in hits)

Output

Retrieved: [('eu-key-rotation-v2-rule', 'eu-access/2026-04-01'), ('eu-session-timeout-v1-rule', 'eu-session/2026-01-03')]
US evidence exposed: False

Failure test: a tempting but forbidden result

A useful test shouldn't only prove success. It should include a result that would rank well if the permission filter were missing.

acl-regression-test.py

restricted = PolicyChunk(
    chunk_id="restricted-admin-key-rotation",
    document_id="admin-override-terms",
    parent_id="admin-override-terms",
    version="admin-override/2026-05-01",
    region="EU",
    acl_tag="security:admins",
    effective_from=date(2026, 5, 1),
    effective_to=None,
    text=(
        "Security admins may run emergency key rotation without support approval."
    ),
)

corpus_with_restricted = [restricted, *CHUNKS]
safe_hits = retrieve(question, LUNA, corpus_with_restricted, EVAL_DATE)
visible_ids = [chunk.chunk_id for chunk in safe_hits]

print("Visible hit ids:", visible_ids)
print("Restricted admin policy hidden:", restricted.chunk_id not in visible_ids)
assert restricted.chunk_id not in visible_ids

Output

Visible hit ids: ['eu-key-rotation-v2-rule', 'eu-session-timeout-v1-rule']
Restricted admin policy hidden: True

Design choice	Unsafe shortcut	Observable consequence
Filter before retrieval	Retrieve everything, redact after generation	Secret rule may enter prompt or trace
Store versions and dates	Overwrite the old chunk in place	Can't reproduce a historical answer
Preserve parent citation	Return text with no source identity	Reviewer can't verify a claim

Pack evidence for a grounded answer

pack-cited-context.py

@dataclass(frozen=True)
class PackedEvidence:
    source_id: str
    chunk_id: str
    document_id: str
    parent_id: str
    version: str
    text: str

def pack_evidence(hits: list[PolicyChunk], max_characters: int = 400) -> list[PackedEvidence]:
    packed: list[PackedEvidence] = []
    used = 0
    for position, chunk in enumerate(hits, start=1):
        if used + len(chunk.text) > max_characters:
            break
        packed.append(
            PackedEvidence(
                source_id=f"E{position}",
                chunk_id=chunk.chunk_id,
                document_id=chunk.document_id,
                parent_id=chunk.parent_id,
                version=chunk.version,
                text=chunk.text,
            )
        )
        used += len(chunk.text)
    return packed

packed = pack_evidence(safe_hits)
context = "\n".join(
    f"[{item.source_id}] {item.parent_id} ({item.version}): {item.text}"
    for item in packed
)
print(context)
assert "[E1]" in context
assert packed[0].document_id == "eu-access"
assert packed[0].parent_id == "eu-access-v2"
assert "admin-override" not in context

Output

[E1] eu-access-v2 (eu-access/2026-04-01): Stale service-account keys qualify for automated rotation within 14 days when a risk signal arrives within 48 hours.
[E2] eu-session-timeout-v1 (eu-session/2026-01-03): Idle browser sessions expire after 30 days of inactivity.

Answer or abstain

grounded-answer.py

@dataclass(frozen=True)
class Answer:
    text: str
    cited_sources: tuple[str, ...]
    abstained: bool

def answer_from_evidence(question: str, evidence: list[PackedEvidence]) -> Answer:
    for item in evidence:
        if "14 days" in item.text and "48 hours" in item.text:
            return Answer(
                text=(
                    "Yes, if the risk signal arrived within 48 hours; "
                    "the automated rotation window is 14 days. "
                    f"[{item.source_id}]"
                ),
                cited_sources=(item.source_id,),
                abstained=False,
            )
    return Answer(
        text="I can't confirm that outcome from permitted current policy evidence.",
        cited_sources=(),
        abstained=True,
    )

supported = answer_from_evidence(question, packed)
missing = answer_from_evidence("Can I approve an unmanaged sandbox credential?", [])
print("Supported:", supported.text)
print("No evidence:", missing.text)
assert supported.cited_sources == ("E1",)
assert missing.abstained

Output

Supported: Yes, if the risk signal arrived within 48 hours; the automated rotation window is 14 days. [E1]
No evidence: I can't confirm that outcome from permitted current policy evidence.

Record a reproducible request trace

Trace field	Example	Why it matters
`request_id`, `actor_id`, `region`	`rag-0007`, `luna-48291`, `EU`	Establishes authorization context
`index_version`	`policy-index/2026-05-27`	Lets you replay against the same evidence state
`retrieved_chunk_ids`, `source_map`	`["eu-key-rotation-v2-rule"]`, `{"E1": {...}}`	Connects packed citations to versioned parent evidence
`cited_source_ids`	`["E1"]`	Connects answer claim to packed evidence
`abstained`	`false`	Makes coverage and failures measurable
Stage timings	`retrieve_ms=18`, `model_ttft_ms=320`, `generate_ms=410`	Locates latency regressions

request-trace.py

def trace_request(
    request_id: str,
    caller: Caller,
    hits: list[PolicyChunk],
    evidence: list[PackedEvidence],
    answer: Answer,
) -> dict[str, object]:
    return {
        "request_id": request_id,
        "actor_id": caller.actor_id,
        "region": caller.region,
        "index_version": "policy-index/2026-05-27",
        "retrieved_chunk_ids": [chunk.chunk_id for chunk in hits],
        "retrieved_versions": [chunk.version for chunk in hits],
        "source_map": {
            item.source_id: {
                "chunk_id": item.chunk_id,
                "document_id": item.document_id,
                "parent_id": item.parent_id,
                "version": item.version,
            }
            for item in evidence
        },
        "cited_source_ids": list(answer.cited_sources),
        "abstained": answer.abstained,
        "timings_ms": {
            "authorize": 2,
            "retrieve": 18,
            "pack": 1,
            "model_ttft": 320,
            "generate": 410,
            "trace": 3,
        },
    }

trace = trace_request("rag-0007", LUNA, safe_hits, packed, supported)
stores_raw_policy_text = any(
    chunk.text in str(trace)
    for chunk in corpus_with_restricted
)
print("Trace chunks:", trace["retrieved_chunk_ids"])
print("Trace source map:", trace["source_map"])
print("Trace cites:", trace["cited_source_ids"])
print("Trace stores raw policy text:", stores_raw_policy_text)
assert not stores_raw_policy_text

Output

Trace chunks: ['eu-key-rotation-v2-rule', 'eu-session-timeout-v1-rule']
Trace source map: {'E1': {'chunk_id': 'eu-key-rotation-v2-rule', 'document_id': 'eu-access', 'parent_id': 'eu-access-v2', 'version': 'eu-access/2026-04-01'}, 'E2': {'chunk_id': 'eu-session-timeout-v1-rule', 'document_id': 'eu-session', 'parent_id': 'eu-session-timeout-v1', 'version': 'eu-session/2026-01-03'}}
Trace cites: ['E1']
Trace stores raw policy text: False

Budget latency by stage

RAG adds work before the first generated token: authorization, retrieval, and context packing. Keep two measurements separate:

End-to-end time to first token (TTFT) is what the caller feels from request arrival until the first generated token.
Model TTFT starts when the service sends packed context to the model and ends when the first generated token arrives.

latency-gate.py

LATENCY_BUDGET_MS = {
    "authorize": 10,
    "retrieve": 80,
    "pack": 10,
    "model_ttft": 500,
    "generate": 500,
    "trace": 10,
}

def exceeded_budgets(timings: dict[str, int]) -> list[str]:
    return [
        stage
        for stage, budget in LATENCY_BUDGET_MS.items()
        if stage not in timings or timings[stage] > budget
    ]

healthy = trace["timings_ms"]
service_ttft = sum(
    healthy[stage]
    for stage in ("authorize", "retrieve", "pack", "model_ttft")
)
regressed = {**healthy, "model_ttft": 740}
missing_trace = {
    stage: duration
    for stage, duration in healthy.items()
    if stage != "trace"
}
print("Service TTFT:", service_ttft)
print("Healthy exceeded:", exceeded_budgets(healthy))
print("Regressed exceeded:", exceeded_budgets(regressed))
print("Missing timing exceeded:", exceeded_budgets(missing_trace))
assert service_ttft == 341
assert exceeded_budgets(healthy) == []
assert exceeded_budgets(regressed) == ["model_ttft"]
assert exceeded_budgets(missing_trace) == ["trace"]

Output

Service TTFT: 341
Healthy exceeded: []
Regressed exceeded: ['model_ttft']
Missing timing exceeded: ['trace']

Use frozen cases as a release gate

release-gates.py

@dataclass(frozen=True)
class EvalCase:
    name: str
    question: str
    corpus: tuple[PolicyChunk, ...]
    expected_chunk_ids: tuple[str, ...]
    forbidden_chunk_ids: tuple[str, ...]
    should_abstain: bool

CASES = [
    EvalCase(
        "supported-eu-key-rotation",
        "stale service-account automated rotation",
        tuple(CHUNKS),
        ("eu-key-rotation-v2-rule",),
        ("eu-key-rotation-v1-rule", "us-key-rotation-v4-rule"),
        False,
    ),
    EvalCase(
        "restricted-admin-source",
        "admin emergency stale service account rotation",
        tuple(corpus_with_restricted),
        ("eu-key-rotation-v2-rule",),
        ("restricted-admin-key-rotation",),
        False,
    ),
    EvalCase(
        "superseded-window",
        "stale service-account key rotation window",
        tuple(CHUNKS),
        ("eu-key-rotation-v2-rule",),
        ("eu-key-rotation-v1-rule",),
        False,
    ),
    EvalCase(
        "missing-test-key-policy",
        "sandbox credential exception policy",
        tuple(corpus_with_restricted),
        (),
        ("restricted-admin-key-rotation",),
        True,
    ),
]

def run_case(case: EvalCase) -> tuple[bool, str]:
    hits = retrieve(case.question, LUNA, list(case.corpus), EVAL_DATE)
    evidence = pack_evidence(hits)
    result = answer_from_evidence(case.question, evidence)
    ids = [chunk.chunk_id for chunk in hits]
    passed = (
        all(forbidden_id not in ids for forbidden_id in case.forbidden_chunk_ids)
        and result.abstained == case.should_abstain
        and tuple(ids) == case.expected_chunk_ids
    )
    return passed, f"{case.name}: ids={ids}, abstained={result.abstained}"

results = [run_case(case) for case in CASES]
for passed, summary in results:
    print("PASS" if passed else "BLOCK", summary)
print("Candidate promoted:", all(passed for passed, _ in results))
assert all(passed for passed, _ in results)

Output

PASS supported-eu-key-rotation: ids=['eu-key-rotation-v2-rule'], abstained=False
PASS restricted-admin-source: ids=['eu-key-rotation-v2-rule'], abstained=False
PASS superseded-window: ids=['eu-key-rotation-v2-rule'], abstained=False
PASS missing-test-key-policy: ids=[], abstained=True
Candidate promoted: True

What to block before launch

Gate	Block when	First repair location
Authorization	Any returned chunk lacks the caller's permission	Metadata and retrieval filter
Freshness	Answer cites a superseded version	Index lifecycle and effective-date filter
Evidence	Required source isn't in top candidates	Retriever, chunking, or metadata
Grounding	Answer asserts a policy not supported by context	Prompt, answer validator, or abstention
Latency	A critical stage exceeds budget consistently	Trace the stage before changing architecture

Ship the `policy-answerer-v1` artifact

The production RAG service has enough structure for a small portfolio artifact:

Store three versions of an access-control policy with chunk_id, document_id, parent_id, effective dates, region, and ACL tags.
Add at least four frozen questions: a supported EU request, a US-only request, a restricted admin-only policy attack, and a question whose answer is absent.
Implement a retriever behind the retrieve() contract. Keep the simple overlap baseline first.
Pack evidence with stable source IDs and return a cited answer or a documented abstention.
Write one trace JSON row per request without logging restricted text.
Produce a release report listing authorization, freshness, evidence, grounding, and latency gates.

release-report.py

release_hits = retrieve(question, LUNA, corpus_with_restricted, EVAL_DATE)
release_report = {
    "candidate": "policy-answerer-v1",
    "index_version": trace["index_version"],
    "evaluated_cases": len(CASES),
    "authorization_gate": restricted.chunk_id not in [
        chunk.chunk_id for chunk in release_hits
    ],
    "freshness_gate": "eu-key-rotation-v1-rule" not in [
        chunk.chunk_id for chunk in release_hits
    ],
    "latency_gate": exceeded_budgets(trace["timings_ms"]) == [],
    "case_gate": all(passed for passed, _ in results),
}
promote = all(
    value is True
    for key, value in release_report.items()
    if key.endswith("_gate")
)
print("Candidate:", release_report["candidate"])
print("Index:", release_report["index_version"])
print("All hard gates pass:", promote)
assert promote

Output

Candidate: policy-answerer-v1
Index: policy-index/2026-05-27
All hard gates pass: True

Mastery check

You're ready to design a production RAG pipeline when you can:

Explain why a RAG answer must be treated as an evidence-backed operation instead of generated text alone.
Define a versioned chunk record with stable citation identity, effective dates, region, and ACL metadata.
Enforce authorization and policy freshness before any retrieved text reaches the model.
Pack small, cited context and require an abstention when permitted evidence can't support the answer.
Record a reproducible request trace without storing restricted text in unsafe logs.
Separate caller-visible TTFT from model TTFT so latency regressions point to the right stage.
Gate a candidate on authorization, freshness, grounding, abstention, and latency behavior.
Preserve that contract while a later retrieval implementation replaces the simple search baseline.

Evaluation rubric

Level	Evidence in your submission
Foundational	Versioned chunks, current-policy filtering, and a supported cited answer
Applied	Authorization attack stays hidden and missing evidence triggers abstention
Strong	Frozen cases, request traces, and explicit stage budgets block bad releases
Production-ready	Retriever upgrades improve evidence recall without changing permission or grounding guarantees

Common pitfalls

Symptom	Likely cause	Repair
Answer cites last year's resolution window	Index overwrote or failed to filter superseded policies	Keep versioned records and filter by effective date
Restricted admin-only rule appears in prompt	Permission check happened after retrieval	Filter candidates inside the retrieval boundary
Correct policy isn't enough to explain a response	Citation IDs weren't carried into packed context and trace	Keep stable chunk and parent identifiers end to end
Bot promises an outcome absent from evidence	Generation had no enforced abstention path	Require supported claims or escalation
Quality debates can't be resolved	Tests record answers but not retrieved evidence	Freeze expected evidence IDs and save traces

Next Step

Continue to Hybrid Search: Dense + Sparse

PreviousEvaluating AI Agents

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.

Lewis, P., et al. · 2020 · NeurIPS 2020

Lost in the Middle: How Language Models Use Long Contexts

Liu, N.F., et al. · 2023 · TACL 2023

RAGAS: Automated Evaluation of Retrieval Augmented Generation.

Es, S., et al. · 2023 · arXiv preprint

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Production RAG Pipelines

The promise the service must keep

Build the evidence record

Retrieve small, cite enough context

Put authorization before similarity

Failure test: a tempting but forbidden result

Pack evidence for a grounded answer

Answer or abstain

Record a reproducible request trace

Budget latency by stage

Use frozen cases as a release gate

What to block before launch

Ship the `policy-answerer-v1` artifact

Mastery check

Evaluation rubric

Common pitfalls

Mastery Check

Discussion

Production RAG Pipelines

The promise the service must keep

Build the evidence record

Retrieve small, cite enough context

Put authorization before similarity

Failure test: a tempting but forbidden result

Pack evidence for a grounded answer

Answer or abstain

Record a reproducible request trace

Budget latency by stage

Use frozen cases as a release gate

What to block before launch

Ship the `policy-answerer-v1` artifact

Mastery check

Evaluation rubric

Common pitfalls

Mastery Check

Discussion

Production RAG Pipelines

The promise the service must keep

Why isn't a fluent answer with a source-looking citation sufficient evidence that a RAG service worked?

Build the evidence record

Retrieve small, cite enough context

Why retain a superseded chunk if the retriever must not use it to answer today's question?

Put authorization before similarity

Failure test: a tempting but forbidden result

Pack evidence for a grounded answer

Answer or abstain

The model produced a concise answer that matches a policy from memory, but retrieval returned no permitted current evidence. What should the service return?

Record a reproducible request trace

Budget latency by stage

Retrieval takes 18 ms, packing takes 1 ms, and model TTFT takes 320 ms. Why should a dashboard also report 341 ms as end-to-end TTFT?

Use frozen cases as a release gate

What to block before launch

Ship the policy-answerer-v1 artifact

Mastery check

Evaluation rubric

Common pitfalls

What contract must remain unchanged when you swap the lab's term-overlap retriever for hybrid search?

Mastery Check

Discussion

Production RAG Pipelines

The promise the service must keep

Why isn't a fluent answer with a source-looking citation sufficient evidence that a RAG service worked?

Build the evidence record

Retrieve small, cite enough context

Why retain a superseded chunk if the retriever must not use it to answer today's question?

Put authorization before similarity

Failure test: a tempting but forbidden result

Pack evidence for a grounded answer

Answer or abstain

The model produced a concise answer that matches a policy from memory, but retrieval returned no permitted current evidence. What should the service return?

Record a reproducible request trace

Budget latency by stage

Retrieval takes 18 ms, packing takes 1 ms, and model TTFT takes 320 ms. Why should a dashboard also report 341 ms as end-to-end TTFT?

Use frozen cases as a release gate

What to block before launch

Ship the policy-answerer-v1 artifact

Mastery check

Evaluation rubric

Common pitfalls

What contract must remain unchanged when you swap the lab's term-overlap retriever for hybrid search?

Mastery Check

Discussion

Ship the `policy-answerer-v1` artifact

Ship the `policy-answerer-v1` artifact