LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnApplied LLM EngineeringProduction RAG Pipelines
🔍MediumRAG & Retrieval

Production RAG Pipelines

Design a secure, traceable RAG service around versioned policy evidence, grounded answers, abstention, release gates, and latency budgets.

16 min read
Learning path
Step 61 of 155 in the full curriculum
Evaluating AI AgentsHybrid Search: Dense + Sparse

Production RAG Pipelines

Your ShopFlow support agent can now be evaluated as an agent. It still can't answer a policy question safely unless it receives the right evidence. A resolution rule may change by region, merchant, product condition, or policy revision. An answer that sounds right but cites last year's rule can authorize a costly mistake.

Retrieval-augmented generation (RAG) gives a language model retrieved evidence at answer time instead of expecting its weights to contain current private facts.[1] In production, that simple idea becomes a system contract: index traceable evidence, retrieve only evidence the user may see, generate from that evidence, abstain when it isn't enough, and retain a trace that a reviewer can inspect.

This lesson builds that contract for policy-answerer-v1. You won't implement BM25, dense embeddings, fusion, or reranking here. Those retrieval algorithms belong in the next lessons. Here, you'll make the pipeline around any retriever trustworthy.

The promise the service must keep

Suppose Luna, an EU support specialist, asks:

Can a customer get a replacement for a damaged refurbished laptop delivered 10 days ago?

The answer isn't just text. A release-worthy response must satisfy four properties:

PropertyWhat the user needsFailure you must block
Correct evidenceCurrent EU refurbished-electronics policyOld US or superseded rule retrieved
AuthorizationOnly sources Luna may readRestricted merchant addendum leaks
GroundingEach policy claim points to evidenceModel invents a replacement window
OperabilityTrace and latency data for the requestTeam can't reproduce a bad promise
Production RAG pipeline showing versioned indexing, authorized retrieval, grounded answers, and release traces. Production RAG pipeline showing versioned indexing, authorized retrieval, grounded answers, and release traces.
A production RAG answer is the end of an evidence path. Indexing prepares versioned records; the online path authorizes retrieval before a model can see text; traces and eval gates decide whether a candidate can ship.

The data path has an offline side and an online side:

Diagram showing Policy revisions, Versioned evidence index, Authorized question, and Filter permitted evidence. Diagram showing Policy revisions, Versioned evidence index, Authorized question, and Filter permitted evidence.
Policy revisions, Versioned evidence index, Authorized question, and Filter permitted evidence.

The indexer produces evidence records when documents change. The request path filters those records by identity and policy state, asks a retriever for candidates, packs source-labelled context, and returns either a supported answer or an abstention. The release path replays frozen questions before a new index, prompt, retriever, or model version serves users.

Build the evidence record

Earlier chunking lessons showed how to cut a document into searchable spans. A production service adds the fields needed to use those spans later: a stable document identifier, a parent section for citations, a version, an effective date range, a region, and access tags.

Our tiny corpus has two current policies and one superseded policy. Notice that the US and EU rules deliberately differ. That difference turns an access-control bug into a visible wrong answer.

evidence-records.py
1from __future__ import annotations 2 3from dataclasses import dataclass 4from datetime import date 5import re 6 7@dataclass(frozen=True) 8class PolicyChunk: 9 chunk_id: str 10 document_id: str 11 parent_id: str 12 version: str 13 region: str 14 acl_tag: str 15 effective_from: date 16 effective_to: date | None 17 text: str 18 19EVAL_DATE = date(2026, 5, 27) 20CHUNKS = [ 21 PolicyChunk( 22 chunk_id="eu-refurb-v2-rule", 23 document_id="eu-electronics", 24 parent_id="eu-electronics-v2", 25 version="eu-electronics/2026-04-01", 26 region="EU", 27 acl_tag="support:eu", 28 effective_from=date(2026, 4, 1), 29 effective_to=None, 30 text=( 31 "Damaged refurbished laptops qualify for replacement within " 32 "14 days of delivery when damage is reported within 48 hours." 33 ), 34 ), 35 PolicyChunk( 36 chunk_id="eu-refurb-v1-rule", 37 document_id="eu-electronics", 38 parent_id="eu-electronics-v1", 39 version="eu-electronics/2025-02-01", 40 region="EU", 41 acl_tag="support:eu", 42 effective_from=date(2025, 2, 1), 43 effective_to=date(2026, 3, 31), 44 text="Damaged refurbished laptops qualify for return within 30 days.", 45 ), 46 PolicyChunk( 47 chunk_id="us-refurb-v4-rule", 48 document_id="us-electronics", 49 parent_id="us-electronics-v4", 50 version="us-electronics/2026-03-15", 51 region="US", 52 acl_tag="support:us", 53 effective_from=date(2026, 3, 15), 54 effective_to=None, 55 text="Damaged refurbished laptops qualify for refund within 30 days.", 56 ), 57 PolicyChunk( 58 chunk_id="eu-shoes-v1-rule", 59 document_id="eu-footwear", 60 parent_id="eu-footwear-v1", 61 version="eu-footwear/2026-01-03", 62 region="EU", 63 acl_tag="support:eu", 64 effective_from=date(2026, 1, 3), 65 effective_to=None, 66 text="Unworn footwear may be returned within 30 days of delivery.", 67 ), 68] 69 70def is_current(chunk: PolicyChunk, on_date: date) -> bool: 71 return ( 72 chunk.effective_from <= on_date 73 and (chunk.effective_to is None or on_date <= chunk.effective_to) 74 ) 75 76current_ids = [chunk.chunk_id for chunk in CHUNKS if is_current(chunk, EVAL_DATE)] 77print("All evidence records:", len(CHUNKS)) 78print("Current records:", current_ids) 79assert "eu-refurb-v1-rule" not in current_ids
Output
1All evidence records: 4 2Current records: ['eu-refurb-v2-rule', 'us-refurb-v4-rule', 'eu-shoes-v1-rule']

The record is deliberately more boring than a model call. That's good. Every later stage can now prove which policy revision it used. The fixed EVAL_DATE also makes this replay reproducible instead of changing behavior with the wall clock.

Retrieve small, cite enough context

Indexing whole policy pages gives a retriever too much irrelevant prose. Indexing one sentence can lose surrounding exceptions. Parent-child indexing stores a compact child span for search and a parent section for final evidence. The retriever can match the child ID, then context assembly can fetch the parent section and its stable citation metadata.

The compact lab keeps child text inline and carries document_id plus parent_id. A full parent-child implementation resolves parent_id to a version-matched, permitted parent section before packing. Keep those fields separate instead of parsing meaning out of an ID string.

Parent-child indexing diagram showing small child chunks embedded for precise retrieval, parent sections stored for full context, and parent IDs joining child hits back to broader evidence before generation. Parent-child indexing diagram showing small child chunks embedded for precise retrieval, parent sections stored for full context, and parent IDs joining child hits back to broader evidence before generation.
The child record helps retrieval identify a specific rule. The parent record preserves caveats and a stable citation target. Both need the same version and authorization boundary.

Chunk overlap remains useful when a sentence straddles a boundary, but it isn't a magic setting. Treat it as an indexing candidate that must survive retrieval tests on your own policy questions.

Chunk overlap boundary example showing no-overlap chunks split the refurbished laptop 14-day policy rule while overlap creates one chunk containing the complete answer. Chunk overlap boundary example showing no-overlap chunks split the refurbished laptop 14-day policy rule while overlap creates one chunk containing the complete answer.
A boundary that cuts the 14-day rule in half makes even a good retriever fail. Overlap can preserve a complete evidence span, but you still measure the result on labeled queries.

The next fragment checks a basic index invariant: at most one current record for the same region and parent section. Two active revisions would allow the request path to retrieve contradictory promises.

index-invariants.py
1from collections import defaultdict 2 3def validate_current_versions(chunks: list[PolicyChunk], on_date: date) -> None: 4 active_by_scope: dict[tuple[str, str], list[str]] = defaultdict(list) 5 for chunk in chunks: 6 if is_current(chunk, on_date): 7 scope = (chunk.region, chunk.document_id) 8 active_by_scope[scope].append(chunk.version) 9 10 conflicts = { 11 scope: versions 12 for scope, versions in active_by_scope.items() 13 if len(versions) > 1 14 } 15 if conflicts: 16 raise ValueError(f"Conflicting active policy versions: {conflicts}") 17 18validate_current_versions(CHUNKS, EVAL_DATE) 19print("Current-version invariant: pass") 20print("Superseded EU record stays indexed for audit, not answering.")
Output
1Current-version invariant: pass 2Superseded EU record stays indexed for audit, not answering.

Put authorization before similarity

An embedding index doesn't know whether Luna can read a document. A highly similar restricted chunk is still forbidden. The safe order is:

  1. Determine the caller's tenant, role, region, and request date from trusted application state.
  2. Select admissible evidence by those fields.
  3. Search only within that admissible set, or use a store that enforces the filter inside retrieval.
  4. Pass only returned permitted text to context assembly and logs visible to the caller.

Filtering after text has already reached the model is too late. The model, request trace, cache, or error report may already contain restricted content.

The lab uses a simple term-overlap search so its authorization behavior is obvious. Its retrieve() interface is the part you'll replace with hybrid search in the next chapter.

authorized-retrieval.py
1@dataclass(frozen=True) 2class Caller: 3 actor_id: str 4 region: str 5 acl_tags: frozenset[str] 6 7LUNA = Caller("luna-48291", "EU", frozenset({"support:eu"})) 8 9def allowed_chunks(caller: Caller, chunks: list[PolicyChunk], on_date: date) -> list[PolicyChunk]: 10 return [ 11 chunk 12 for chunk in chunks 13 if is_current(chunk, on_date) 14 and chunk.region == caller.region 15 and chunk.acl_tag in caller.acl_tags 16 ] 17 18def terms(text: str) -> set[str]: 19 return set(re.findall(r"[a-z0-9]+", text.lower())) 20 21def retrieve( 22 query: str, 23 caller: Caller, 24 chunks: list[PolicyChunk], 25 on_date: date, 26 top_k: int = 2, 27 min_matching_terms: int = 2, 28) -> list[PolicyChunk]: 29 permitted = allowed_chunks(caller, chunks, on_date) 30 query_terms = terms(query) 31 scored = [ 32 (len(query_terms & terms(chunk.text)), chunk) 33 for chunk in permitted 34 ] 35 ranked = sorted(scored, key=lambda item: item[0], reverse=True) 36 return [ 37 chunk 38 for score, chunk in ranked 39 if score >= min_matching_terms 40 ][:top_k] 41 42question = "damaged refurbished laptop replacement after 10 days" 43hits = retrieve(question, LUNA, CHUNKS, EVAL_DATE) 44print("Retrieved:", [(chunk.chunk_id, chunk.version) for chunk in hits]) 45print("US evidence exposed:", any(chunk.region == "US" for chunk in hits)) 46assert hits[0].chunk_id == "eu-refurb-v2-rule" 47assert all(chunk.acl_tag == "support:eu" for chunk in hits)
Output
1Retrieved: [('eu-refurb-v2-rule', 'eu-electronics/2026-04-01')] 2US evidence exposed: False

This retriever isn't production search: its two-term threshold rejects weak hits, but it misses paraphrases such as "reconditioned notebook." It is a clean test double for the surrounding pipeline. Once the authorization and trace contract work, you can improve recall without weakening the boundary.

Failure test: a tempting but forbidden result

A useful test shouldn't only prove success. It should include a result that would rank well if the permission filter were missing.

acl-regression-test.py
1restricted = PolicyChunk( 2 chunk_id="merchant-vip-refurb", 3 document_id="merchant-vip-terms", 4 parent_id="merchant-vip-terms", 5 version="merchant-vip/2026-05-01", 6 region="EU", 7 acl_tag="merchant:vip-ops", 8 effective_from=date(2026, 5, 1), 9 effective_to=None, 10 text=( 11 "Damaged refurbished laptops receive immediate full refund within " 12 "14 days for VIP merchant operations." 13 ), 14) 15 16corpus_with_restricted = [restricted, *CHUNKS] 17safe_hits = retrieve(question, LUNA, corpus_with_restricted, EVAL_DATE) 18visible_ids = [chunk.chunk_id for chunk in safe_hits] 19 20print("Visible hit ids:", visible_ids) 21print("Restricted VIP policy hidden:", restricted.chunk_id not in visible_ids) 22assert restricted.chunk_id not in visible_ids
Output
1Visible hit ids: ['eu-refurb-v2-rule'] 2Restricted VIP policy hidden: True
Design choiceUnsafe shortcutObservable consequence
Filter before retrievalRetrieve everything, redact after generationSecret rule may enter prompt or trace
Store versions and datesOverwrite the old chunk in placeCan't reproduce a historical answer
Preserve parent citationReturn text with no source identityReviewer can't verify a claim

Pack evidence for a grounded answer

Retrieval produces candidate records, not an answer. Context assembly must give the generator source labels, version information, and a clear instruction to abstain when the evidence doesn't establish the requested promise.

Don't stuff every near-match into the prompt. Even when a context window fits a large amount of text, models can use relevant information less reliably when it sits among long distractors, especially in the middle of a long input.[2] Pack the strongest permitted evidence first, keep the set small, and evaluate this policy rather than assuming it works.

pack-cited-context.py
1@dataclass(frozen=True) 2class PackedEvidence: 3 source_id: str 4 chunk_id: str 5 document_id: str 6 parent_id: str 7 version: str 8 text: str 9 10def pack_evidence(hits: list[PolicyChunk], max_characters: int = 400) -> list[PackedEvidence]: 11 packed: list[PackedEvidence] = [] 12 used = 0 13 for position, chunk in enumerate(hits, start=1): 14 if used + len(chunk.text) > max_characters: 15 break 16 packed.append( 17 PackedEvidence( 18 source_id=f"E{position}", 19 chunk_id=chunk.chunk_id, 20 document_id=chunk.document_id, 21 parent_id=chunk.parent_id, 22 version=chunk.version, 23 text=chunk.text, 24 ) 25 ) 26 used += len(chunk.text) 27 return packed 28 29packed = pack_evidence(safe_hits) 30context = "\n".join( 31 f"[{item.source_id}] {item.parent_id} ({item.version}): {item.text}" 32 for item in packed 33) 34print(context) 35assert "[E1]" in context 36assert packed[0].document_id == "eu-electronics" 37assert packed[0].parent_id == "eu-electronics-v2" 38assert "merchant-vip" not in context
Output
1[E1] eu-electronics-v2 (eu-electronics/2026-04-01): Damaged refurbished laptops qualify for replacement within 14 days of delivery when damage is reported within 48 hours.

Answer or abstain

In an actual service, a language model would receive the packed context and an instruction to cite it. For the lab, a deterministic answerer makes the core contract inspectable: it emits the rule only when the required evidence is present and otherwise refuses to promise a resolution outcome.

grounded-answer.py
1@dataclass(frozen=True) 2class Answer: 3 text: str 4 cited_sources: tuple[str, ...] 5 abstained: bool 6 7def answer_from_evidence(question: str, evidence: list[PackedEvidence]) -> Answer: 8 for item in evidence: 9 if "14 days" in item.text and "48 hours" in item.text: 10 return Answer( 11 text=( 12 "Yes, if damage was reported within 48 hours of delivery; " 13 "the replacement window is 14 days. " 14 f"[{item.source_id}]" 15 ), 16 cited_sources=(item.source_id,), 17 abstained=False, 18 ) 19 return Answer( 20 text="I can't confirm that outcome from permitted current policy evidence.", 21 cited_sources=(), 22 abstained=True, 23 ) 24 25supported = answer_from_evidence(question, packed) 26missing = answer_from_evidence("Can I refund a damaged drone?", []) 27print("Supported:", supported.text) 28print("No evidence:", missing.text) 29assert supported.cited_sources == ("E1",) 30assert missing.abstained
Output
1Supported: Yes, if damage was reported within 48 hours of delivery; the replacement window is 14 days. [E1] 2No evidence: I can't confirm that outcome from permitted current policy evidence.

The lab uses string checks only to make the invariant runnable. A real candidate may use a model, structured citations, and claim verification. The release rule remains: if permitted current evidence doesn't support a material policy claim, the system must abstain or escalate.

Record a reproducible request trace

The agent evaluation lesson treated traces as observable release evidence. RAG needs the same discipline. Record versions and decisions needed to reproduce an answer, but don't copy restricted source text into broad logs.

Trace fieldExampleWhy it matters
request_id, actor_id, regionrag-0007, luna-48291, EUEstablishes authorization context
index_versionpolicy-index/2026-05-27Lets you replay against the same evidence state
retrieved_chunk_ids, source_map["eu-refurb-v2-rule"], {"E1": {...}}Connects packed citations to versioned parent evidence
cited_source_ids["E1"]Connects answer claim to packed evidence
abstainedfalseMakes coverage and failures measurable
Stage timingsretrieve_ms=18, ttft_ms=320, generate_ms=410Locates latency regressions
request-trace.py
1def trace_request( 2 request_id: str, 3 caller: Caller, 4 hits: list[PolicyChunk], 5 evidence: list[PackedEvidence], 6 answer: Answer, 7) -> dict[str, object]: 8 return { 9 "request_id": request_id, 10 "actor_id": caller.actor_id, 11 "region": caller.region, 12 "index_version": "policy-index/2026-05-27", 13 "retrieved_chunk_ids": [chunk.chunk_id for chunk in hits], 14 "retrieved_versions": [chunk.version for chunk in hits], 15 "source_map": { 16 item.source_id: { 17 "chunk_id": item.chunk_id, 18 "document_id": item.document_id, 19 "parent_id": item.parent_id, 20 "version": item.version, 21 } 22 for item in evidence 23 }, 24 "cited_source_ids": list(answer.cited_sources), 25 "abstained": answer.abstained, 26 "timings_ms": { 27 "authorize": 2, 28 "retrieve": 18, 29 "pack": 1, 30 "ttft": 320, 31 "generate": 410, 32 "trace": 3, 33 }, 34 } 35 36trace = trace_request("rag-0007", LUNA, safe_hits, packed, supported) 37stores_raw_policy_text = any( 38 chunk.text in str(trace) 39 for chunk in corpus_with_restricted 40) 41print("Trace chunks:", trace["retrieved_chunk_ids"]) 42print("Trace source map:", trace["source_map"]) 43print("Trace cites:", trace["cited_source_ids"]) 44print("Trace stores raw policy text:", stores_raw_policy_text) 45assert not stores_raw_policy_text
Output
1Trace chunks: ['eu-refurb-v2-rule'] 2Trace source map: {'E1': {'chunk_id': 'eu-refurb-v2-rule', 'document_id': 'eu-electronics', 'parent_id': 'eu-electronics-v2', 'version': 'eu-electronics/2026-04-01'}} 3Trace cites: ['E1'] 4Trace stores raw policy text: False

Budget latency by stage

RAG adds work before the first generated token: authorization, retrieval, and context packing. Track that work separately from time to first token (TTFT). The fixture records generate as time after the first token, so its stages add up to one request timeline. If TTFT rises after a corpus change while retrieval stays fast, prompt size may be the issue.

RAG request latency budget separating authorization, retrieval, prompt packing, time to first token, answer generation, and trace recording. RAG request latency budget separating authorization, retrieval, prompt packing, time to first token, answer generation, and trace recording.
One end-to-end number hides the repair. Stage timing distinguishes slow evidence lookup from excessive packed context or slow generation startup.
latency-gate.py
1LATENCY_BUDGET_MS = { 2 "authorize": 10, 3 "retrieve": 80, 4 "pack": 10, 5 "ttft": 500, 6 "generate": 500, 7 "trace": 10, 8} 9 10def exceeded_budgets(timings: dict[str, int]) -> list[str]: 11 return [ 12 stage 13 for stage, budget in LATENCY_BUDGET_MS.items() 14 if stage not in timings or timings[stage] > budget 15 ] 16 17healthy = trace["timings_ms"] 18regressed = {**healthy, "ttft": 740} 19missing_trace = { 20 stage: duration 21 for stage, duration in healthy.items() 22 if stage != "trace" 23} 24print("Healthy exceeded:", exceeded_budgets(healthy)) 25print("Regressed exceeded:", exceeded_budgets(regressed)) 26print("Missing timing exceeded:", exceeded_budgets(missing_trace)) 27assert exceeded_budgets(healthy) == [] 28assert exceeded_budgets(regressed) == ["ttft"] 29assert exceeded_budgets(missing_trace) == ["trace"]
Output
1Healthy exceeded: [] 2Regressed exceeded: ['ttft'] 3Missing timing exceeded: ['trace']

Use frozen cases as a release gate

An appealing demo question doesn't establish reliability. Create frozen cases from policy questions, authorization attacks, outdated revisions, and missing-evidence requests. Keep the expected evidence IDs with each case. This separates retrieval failure from generation failure before users see the candidate.

RAG evaluation research also separates retrieval evidence quality from answer faithfulness and relevance rather than hiding all failures inside one final score.[3] The dedicated RAG evaluation lesson will implement those metrics. This lesson starts with hard release assertions that catch expensive mistakes immediately.

RAG release evidence board separating retrieval gates, answer gates, latency gates, and a blocked-release rule. RAG release evidence board separating retrieval gates, answer gates, latency gates, and a blocked-release rule.
First gate correctness properties that must never fail: no unauthorized evidence, no stale policy citation, supported answers, correct abstentions, and an explicit latency budget.
release-gates.py
1@dataclass(frozen=True) 2class EvalCase: 3 name: str 4 question: str 5 corpus: tuple[PolicyChunk, ...] 6 expected_chunk_ids: tuple[str, ...] 7 forbidden_chunk_ids: tuple[str, ...] 8 should_abstain: bool 9 10CASES = [ 11 EvalCase( 12 "supported-eu-laptop", 13 "damaged refurbished laptop replacement after 10 days", 14 tuple(CHUNKS), 15 ("eu-refurb-v2-rule",), 16 ("eu-refurb-v1-rule", "us-refurb-v4-rule"), 17 False, 18 ), 19 EvalCase( 20 "restricted-vip-source", 21 "VIP merchant damaged refurbished laptop replacement", 22 tuple(corpus_with_restricted), 23 ("eu-refurb-v2-rule",), 24 ("merchant-vip-refurb",), 25 False, 26 ), 27 EvalCase( 28 "superseded-window", 29 "damaged refurbished laptops replacement window", 30 tuple(CHUNKS), 31 ("eu-refurb-v2-rule",), 32 ("eu-refurb-v1-rule",), 33 False, 34 ), 35 EvalCase( 36 "missing-drone-policy", 37 "drone propeller damage return rule", 38 tuple(corpus_with_restricted), 39 (), 40 ("merchant-vip-refurb",), 41 True, 42 ), 43] 44 45def run_case(case: EvalCase) -> tuple[bool, str]: 46 hits = retrieve(case.question, LUNA, list(case.corpus), EVAL_DATE) 47 evidence = pack_evidence(hits) 48 result = answer_from_evidence(case.question, evidence) 49 ids = [chunk.chunk_id for chunk in hits] 50 passed = ( 51 all(forbidden_id not in ids for forbidden_id in case.forbidden_chunk_ids) 52 and result.abstained == case.should_abstain 53 and tuple(ids) == case.expected_chunk_ids 54 ) 55 return passed, f"{case.name}: ids={ids}, abstained={result.abstained}" 56 57results = [run_case(case) for case in CASES] 58for passed, summary in results: 59 print("PASS" if passed else "BLOCK", summary) 60print("Candidate promoted:", all(passed for passed, _ in results)) 61assert all(passed for passed, _ in results)
Output
1PASS supported-eu-laptop: ids=['eu-refurb-v2-rule'], abstained=False 2PASS restricted-vip-source: ids=['eu-refurb-v2-rule'], abstained=False 3PASS superseded-window: ids=['eu-refurb-v2-rule'], abstained=False 4PASS missing-drone-policy: ids=[], abstained=True 5Candidate promoted: True

The minimal suite already checks three high-impact failures: a forbidden chunk, a superseded chunk, and an unsupported answer. A serious deployment adds paraphrases, policy conflicts, index deletion cases, model-judge calibration, human reviews, and latency distributions.

What to block before launch

GateBlock whenFirst repair location
AuthorizationAny returned chunk lacks the caller's permissionMetadata and retrieval filter
FreshnessAnswer cites a superseded versionIndex lifecycle and effective-date filter
EvidenceRequired source isn't in top candidatesRetriever, chunking, or metadata
GroundingAnswer asserts a policy not supported by contextPrompt, answer validator, or abstention
LatencyA critical stage exceeds budget consistentlyTrace the stage before changing architecture

Ship the policy-answerer-v1 artifact

You now have the bones of a production RAG service. Turn the fragments into a small portfolio artifact:

  1. Store three versions of an electronics-return policy with chunk_id, document_id, parent_id, effective dates, region, and ACL tags.
  2. Add at least four frozen questions: a supported EU request, a US-only request, a restricted merchant-policy attack, and a question whose answer is absent.
  3. Implement a retriever behind the retrieve() contract. Keep the simple overlap baseline first.
  4. Pack evidence with stable source IDs and return a cited answer or a documented abstention.
  5. Write one trace JSON row per request without logging restricted text.
  6. Produce a release report listing authorization, freshness, evidence, grounding, and latency gates.
release-report.py
1release_hits = retrieve(question, LUNA, corpus_with_restricted, EVAL_DATE) 2release_report = { 3 "candidate": "policy-answerer-v1", 4 "index_version": trace["index_version"], 5 "evaluated_cases": len(CASES), 6 "authorization_gate": restricted.chunk_id not in [ 7 chunk.chunk_id for chunk in release_hits 8 ], 9 "freshness_gate": "eu-refurb-v1-rule" not in [ 10 chunk.chunk_id for chunk in release_hits 11 ], 12 "latency_gate": exceeded_budgets(trace["timings_ms"]) == [], 13 "case_gate": all(passed for passed, _ in results), 14} 15promote = all( 16 value is True 17 for key, value in release_report.items() 18 if key.endswith("_gate") 19) 20print("Candidate:", release_report["candidate"]) 21print("Index:", release_report["index_version"]) 22print("All hard gates pass:", promote) 23assert promote
Output
1Candidate: policy-answerer-v1 2Index: policy-index/2026-05-27 3All hard gates pass: True

Mastery check

You are ready to design a production RAG pipeline when you can:

  • Explain why a RAG answer must be treated as an evidence-backed operation, not merely generated prose.
  • Define a versioned chunk record with stable citation identity, effective dates, region, and ACL metadata.
  • Enforce authorization and policy freshness before any retrieved text reaches the model.
  • Pack small, cited context and require an abstention when permitted evidence can't support the answer.
  • Record a reproducible request trace without storing restricted text in unsafe logs.
  • Gate a candidate on authorization, freshness, grounding, abstention, and latency behavior.
  • Preserve that contract while a later retrieval implementation replaces the simple search baseline.

Evaluation rubric

LevelEvidence in your submission
FoundationalVersioned chunks, current-policy filtering, and a supported cited answer
AppliedAuthorization attack stays hidden and missing evidence triggers abstention
StrongFrozen cases, request traces, and explicit stage budgets block bad releases
Production-readyRetriever upgrades improve evidence recall without changing permission or grounding guarantees

Common pitfalls

SymptomLikely causeRepair
Answer cites last year's resolution windowIndex overwrote or failed to filter superseded policiesKeep versioned records and filter by effective date
Restricted merchant rule appears in promptPermission check happened after retrievalFilter candidates inside the retrieval boundary
Correct policy isn't enough to explain a responseCitation IDs weren't carried into packed context and traceKeep stable chunk and parent identifiers end to end
Bot promises an outcome absent from evidenceGeneration had no enforced abstention pathRequire supported claims or escalation
Quality debates can't be resolvedTests record answers but not retrieved evidenceFreeze expected evidence IDs and save traces

Next Step
Continue to Hybrid Search: Dense + Sparse

You now have the evidence, authorization, grounding, and release contract for a RAG service. Next you will upgrade its retrieval lane so exact identifiers and paraphrased policy questions both recover the right permitted evidence.

PreviousEvaluating AI Agents
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.

Lewis, P., et al. · 2020 · NeurIPS 2020

Lost in the Middle: How Language Models Use Long Contexts

Liu, N.F., et al. · 2023 · TACL 2023

RAGAS: Automated Evaluation of Retrieval Augmented Generation.

Es, S., et al. · 2023 · arXiv preprint