LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnAdvanced Agents & RetrievalAdvanced RAG: HyDE & Self-RAG
🔍HardRAG & Retrieval

Advanced RAG: HyDE & Self-RAG

Learn how query rewriting, HyDE, Self-RAG, and Corrective RAG change retrieval control, and how to evaluate their cost and evidence quality.

34 min read
Learning path
Step 110 of 155 in the full curriculum
Vector DB Internals: HNSW & IVFGraphRAG & Knowledge Graphs

Advanced RAG: HyDE & Self-RAG

Advanced retrieval-augmented generation (RAG) adds retrieval controls for queries that are ambiguous, underspecified, or mismatched with the document corpus. This chapter covers query rewriting, HyDE (Hypothetical Document Embeddings), Self-RAG (Self-reflective RAG), and correction loops, with attention to what each approach does and doesn't guarantee.

Vector database internals showed how an index finds nearby chunks quickly. This chapter starts one level above the index: how to turn a messy user request into the right retrieval work before the generator tries to answer.

Imagine a customer opens your order-tracking chat and types, "What about the other one?" They could mean the second item in their cart, the replacement shipment you sent yesterday, or the refund still pending. The retriever has too little context. Or imagine a warehouse clerk searching for "label" and getting barcode-label specs when they needed the return-label policy. This is the fundamental challenge of search: users rarely ask questions in the exact way that matches the documents they need.

In a basic RAG (Retrieval-Augmented Generation) system, the Large Language Model (LLM) takes the user's question, finds the most similar documents, and tries to answer. This "naive" approach works for simple fact lookups, but it fails in the real world where queries are ambiguous ("What about the other one?"), complex ("Compare X and Y"), or phrased differently than the source material.

For production search systems, adaptive retrieval logic is one option when measurements expose these failures. Instead of searching only the user's exact words, a route may reformulate queries, grade what it finds, or decide whether retrieval is needed at all.

The problem with naive RAG

A standard RAG implementation (often called "Naive RAG") typically embeds the user's raw query and fetches the top-k nearest neighbors from a vector database. This simple "retrieve-then-generate" pipeline suffers from systematic failure modes:

  1. The query is ambiguous: "What about the other one?" relies entirely on conversation history about a delayed order, which a stateless retriever lacks.
  2. The query and document are semantically misaligned: A user asks "How do I start a return?" but the documentation is titled "Return Merchandise Authorization (RMA) Workflow". The embedding of the conversational question may not match the embedding of the formal policy document.
  3. The retriever returns irrelevant documents: Embedding geometry, query wording, ANN settings, or corpus quality can all put warehouse packing instructions above the return-label policy. If irrelevant chunks enter context, the generator may produce an unsupported answer.
  4. The model doesn't know when to retrieve: Naive RAG retrieves for every query, even "Hi" or "What is 2+2?", wasting tokens and latency.

These failure modes are common in real applications. A useful production pipeline measures them separately, then adds the smallest retrieval intervention that repairs the observed failure rather than making every question pay for an elaborate loop.

Query rewriting and decomposition

Three query-transformation patterns for retrieval: standalone rewrite for a vague follow-up, multi-query expansion for delivery delays, and decomposition for a shipping comparison. Three query-transformation patterns for retrieval: standalone rewrite for a vague follow-up, multi-query expansion for delivery delays, and decomposition for a shipping comparison.
Three different failures need three different fixes. Rewrite resolves ambiguity, expansion broadens recall, and decomposition turns one hard search into smaller grounded hops.

User queries are rarely optimal for vector search. They're often short, lack context, or contain multiple distinct questions about orders, shipments, and refunds all at once.

1. Rewrite for de-ambiguation

Think of this like a support lead who turns a vague customer message into a precise search request before querying the policy base. Conversational queries often rely on implicit context. In rewrite-retrieve-read systems, the rewrite step turns the raw input into a retrieval-oriented query instead of treating the user's wording as sacred.[1] Ma et al. study that pattern with a dedicated rewrite stage in front of retrieval; in chat products, a pragmatic variant is to rewrite the latest turn into a standalone, search-optimized question before retrieval.[1]

Here's a concrete example. A customer sends two messages:

Customer: "My box arrived crushed." Customer: "What do I do now?"

A stateless retriever searching for "What do I do now?" would pull generic help-desk articles. A rewrite step instead produces:

Standalone query: "How do I file a damage claim and request a replacement for a crushed package?"

In production, the rewrite model can be any instruction-tuned LLM behind a small interface. The copy-runnable example below uses a deterministic fake model so the parsing contract can be tested locally without API keys.

1-rewrite-for-de-ambiguation.py
1from typing import Protocol, TypedDict 2 3class ChatMessage(TypedDict): 4 role: str 5 content: str 6 7class RewriteModel(Protocol): 8 def rewrite(self, latest_query: str, history_text: str) -> str: ... 9 10class FakeRewriteModel: 11 def rewrite(self, latest_query: str, history_text: str) -> str: 12 if "crushed" in history_text.lower() and "what do i do" in latest_query.lower(): 13 return "How do I file a damage claim and request a replacement for a crushed package?" 14 return latest_query 15 16def rewrite_query_with_history( 17 query: str, chat_history: list[ChatMessage], model: RewriteModel 18) -> str: 19 history_text = "\n".join( 20 f"{message['role']}: {message['content']}" for message in chat_history 21 ) 22 return model.rewrite(query, history_text).strip() 23 24history = [ 25 {"role": "customer", "content": "My box arrived crushed."}, 26] 27 28rewritten = rewrite_query_with_history("What do I do now?", history, FakeRewriteModel()) 29print(rewritten)
Output
1How do I file a damage claim and request a replacement for a crushed package?

2. Multi-query expansion

A single query might miss relevant documents due to vocabulary mismatch. Multi-Query Expansion generates synonymous queries to improve recall.

For example, if a user asks "How do you handle peak-season delivery delays?", relevant internal documents might use terms like "carrier surge capacity", "warehouse backorder policies", or "expedited shipping alternatives". To improve recall across different vocabularies, we can automatically generate variations of the query.

Once you fan out into several queries, you have several ranked result lists to merge. The common pattern is RAG-Fusion, which pairs LLM-generated query variants with Reciprocal Rank Fusion (RRF) to combine the lists into one ranking.[2] RRF ignores raw similarity scores and sums reciprocal ranks instead, so a document that lands near the top of several lists rises even if no single list ranked it first.[3] Concretely, a document d gets score sum over lists of 1 / (k + rank(d)), where k is a small constant (commonly 60) and rank starts at 1. Documents missing from a list contribute zero from that list. RRF is the same rank-merge trick used to fuse dense and sparse results in hybrid search, reused here to fuse query variants.

Here's how to implement multi-query expansion while keeping the boundary testable. The model returns one query per line, and the parser removes bullets or numbering so all variants can be searched in parallel.

2-multi-query-expansion.py
1from typing import Protocol 2 3class QueryExpansionModel(Protocol): 4 def expand(self, query: str, n: int) -> str: ... 5 6class FakeExpansionModel: 7 def expand(self, query: str, n: int) -> str: 8 return "\n".join( 9 [ 10 "Carrier surge capacity during holidays", 11 "Warehouse backorder policies for high-volume periods", 12 "Expedited shipping alternatives for delayed orders", 13 ][:n] 14 ) 15 16def clean_query_line(line: str) -> str: 17 return line.strip().lstrip("-*0123456789. ").strip() 18 19def generate_multi_queries( 20 query: str, model: QueryExpansionModel, n: int = 3 21) -> list[str]: 22 content = model.expand(query, n) 23 queries = [clean_query_line(line) for line in content.splitlines()] 24 return [query for query in queries if query] 25 26queries = generate_multi_queries( 27 "How do you handle peak-season delivery delays?", FakeExpansionModel(), n=3 28) 29 30for query in queries: 31 print(f"- {query}")
Output
1- Carrier surge capacity during holidays 2- Warehouse backorder policies for high-volume periods 3- Expedited shipping alternatives for delayed orders

Generating variants isn't enough; the system must merge their result lists without assuming similarity scores from separate searches are directly comparable. RRF provides a deterministic rank-based merge:

fuse-query-variants-with-rrf.py
1def reciprocal_rank_fusion(rankings: list[list[str]], rank_constant: int = 60) -> list[tuple[str, float]]: 2 scores: dict[str, float] = {} 3 for ranking in rankings: 4 for rank, document_id in enumerate(ranking, start=1): 5 scores[document_id] = scores.get(document_id, 0.0) + 1 / (rank_constant + rank) 6 return sorted(scores.items(), key=lambda pair: (-pair[1], pair[0])) 7 8ranked_lists = [ 9 ["carrier-capacity", "return-policy", "sla"], 10 ["sla", "carrier-capacity", "backorder"], 11 ["carrier-capacity", "expedited-options", "sla"], 12] 13 14for document_id, score in reciprocal_rank_fusion(ranked_lists)[:3]: 15 print(f"{document_id}: {score:.4f}")
Output
1carrier-capacity: 0.0489 2sla: 0.0481 3expedited-options: 0.0161

3. Query decomposition (least-to-most)

Complex questions often require multiple retrieval steps, as a single search may fail to gather all the necessary facts. Decomposition breaks a complex query into a series of simpler sub-queries that can be executed sequentially or in parallel.

For instance, consider the following analytical query:

"Compare standard and express shipping delivery times for fragile items."

A standard retriever might struggle to find a single document containing this exact comparison. Instead, we decompose it into sub-questions:

  1. "What is the standard shipping delivery time for fragile items?"
  2. "What is the express shipping delivery time for fragile items?"
  3. "What are the packaging requirements and liability policies for fragile item shipments?"
  4. "Compare standard and express shipping for fragile items given the above facts."

This decomposition pattern is closely related to least-to-most prompting, which decomposes hard problems into simpler steps.[4] In RAG, teams reuse that idea for retrieval coverage rather than chain-of-thought supervision. Answering simpler questions first can gather explicit evidence for each sub-fact before synthesis, but it also adds queries and possible error propagation. Measure supported-answer accuracy and latency before releasing it.

Here's a simple implementation. A real model would produce the sub-questions, but the rest of the system should only depend on the line-oriented contract.

3-query-decomposition-least-to-most.py
1from typing import Protocol 2 3class DecompositionModel(Protocol): 4 def decompose(self, query: str) -> str: ... 5 6class FakeDecompositionModel: 7 def decompose(self, query: str) -> str: 8 return "\n".join( 9 [ 10 "What is the standard shipping delivery time for fragile items?", 11 "What is the express shipping delivery time for fragile items?", 12 "What packaging and liability rules apply to fragile item shipments?", 13 "How do standard and express shipping compare given those facts?", 14 ] 15 ) 16 17def decompose_query(query: str, model: DecompositionModel) -> list[str]: 18 lines = [ 19 line.strip().lstrip("-*0123456789. ").strip() 20 for line in model.decompose(query).splitlines() 21 ] 22 return [line for line in lines if line] 23 24sub_questions = decompose_query( 25 "Compare standard and express shipping delivery times for fragile items.", 26 FakeDecompositionModel(), 27) 28 29for index, question in enumerate(sub_questions, start=1): 30 print(f"{index}. {question}")
Output
11. What is the standard shipping delivery time for fragile items? 22. What is the express shipping delivery time for fragile items? 33. What packaging and liability rules apply to fragile item shipments? 44. How do standard and express shipping compare given those facts?

HyDE (Hypothetical Document Embeddings)

Standard dense retrieval matches a query embedding to document embeddings. However, queries are often short and interrogative, while indexed passages are longer and declarative. This semantic asymmetry hurts retrieval quality.

HyDE (Hypothetical Document Embeddings) bridges this gap by generating a hypothetical document first, then using that generated text for retrieval.[5] In Gao et al., the model is prompted to "write a document that answers the question." Think of it like trying to find a specific return-policy clause but you can't remember the exact wording. Instead of asking the knowledge base "What do we do about broken things?", you write a one-paragraph summary of what you remember the policy says and ask, "Where are documents that look like this paragraph?"

Here's a concrete e-commerce example. A customer asks:

Query: "Can I get a refund if my perishable item spoils after 48 hours?"

A standard dense retriever might embed the short question and pull generic "refund policy" articles that don't mention perishables. HyDE instead prompts the model to write a hypothetical policy paragraph:

Hypothetical document: "Perishable goods refunds: Items classified as perishable (fresh food, flowers, temperature-sensitive medicine) must be reported within 24 hours of delivery. After 24 hours, spoilage claims require photo evidence and a carrier delay affidavit..."

That generated paragraph is longer, declarative, and uses vocabulary like "perishable goods", "carrier delay affidavit", and "temperature-sensitive". It is an illustrative search proxy, not an answer: the generated deadline may be wrong, and only retrieved source text may support the final response.

How HyDE works

The original HyDE pipeline has three phases:

  1. Generate: Prompt an instruction-tuned LM to write a hypothetical passage for the query. The passage may contain fabricated details, but it often captures relevance patterns that look like real documents.[5]
  2. Embed: Encode that generated passage with a document encoder such as Contriever.
  3. Retrieve: Search the corpus with the hypothetical-document embedding. The paper's key intuition is that the encoder's dense bottleneck filters much of the fabricated detail while preserving the semantic neighborhood of relevant documents.[5]

The production version uses a real generator and dense encoder. This small runnable version uses keyword sets so the retrieval contract is visible: the query first becomes a document-like paragraph, then the retriever searches with that paragraph rather than the original question.

how-hyde-works.py
1from dataclasses import dataclass 2from typing import Protocol 3 4class HypotheticalDocGenerator(Protocol): 5 def generate(self, query: str) -> str: ... 6 7@dataclass(frozen=True) 8class Chunk: 9 id: str 10 text: str 11 12class FakeHyDEGenerator: 13 def generate(self, query: str) -> str: 14 return ( 15 "Perishable goods refunds require a spoilage claim, photo evidence, " 16 "delivery timestamp review, and carrier delay notes." 17 ) 18 19class KeywordRetriever: 20 def __init__(self, chunks: list[Chunk]) -> None: 21 self.chunks = chunks 22 23 def search(self, search_text: str, k: int = 2) -> list[Chunk]: 24 query_terms = set( 25 search_text.lower().replace(",", " ").replace(".", " ").split() 26 ) 27 28 def score(chunk: Chunk) -> int: 29 chunk_terms = set( 30 chunk.text.lower().replace(",", " ").replace(".", " ").split() 31 ) 32 return len(query_terms & chunk_terms) 33 34 return sorted(self.chunks, key=score, reverse=True)[:k] 35 36def hyde_retrieve( 37 query: str, generator: HypotheticalDocGenerator, retriever: KeywordRetriever 38) -> list[Chunk]: 39 hypothetical_doc = generator.generate(query) 40 return retriever.search(hypothetical_doc, k=2) 41 42chunks = [ 43 Chunk("generic-refunds", "Refund requests require an order number and customer email."), 44 Chunk( 45 "perishable-refunds", 46 "Perishable goods spoilage claims need photo evidence and delivery timestamp review.", 47 ), 48 Chunk("shipping-labels", "Return labels can be regenerated from the order portal."), 49] 50 51matches = hyde_retrieve( 52 "Can I get a refund if my food spoils after delivery?", 53 FakeHyDEGenerator(), 54 KeywordRetriever(chunks), 55) 56 57match_ids = [chunk.id for chunk in matches] 58print(f"retrieved: {match_ids}")
Output
1retrieved: ['perishable-refunds', 'generic-refunds']

HyDE flow is simple but important: change the search object first, then retrieve.

HyDE retrieval flow showing a short question turned into a hypothetical policy paragraph, embedded as document-style text, then used to retrieve real evidence before answer generation. HyDE retrieval flow showing a short question turned into a hypothetical policy paragraph, embedded as document-style text, then used to retrieve real evidence before answer generation.
HyDE changes what you search with. The system embeds a generated document-style proxy, retrieves real chunks near that proxy, and answers only from the retrieved evidence.

In the zero-shot experiments reported for HyDE, converting query text into hypothetical document text improved retrieval over the authors' Contriever baseline.[5] The mechanism can still fail on a new corpus, especially when the proxy invents a high-impact identifier or policy detail.

When to use HyDE

HyDE isn't a silver bullet and its effectiveness depends heavily on the query type. It's particularly useful in the zero-shot setting studied in the paper, especially when there's a strong mismatch between how users ask a question and how documents phrase the answer.[5] That's why it tends to help with abstract or conceptual queries and cross-lingual retrieval.

In practice, many teams gate HyDE off for exact-match lookups such as IDs, dates, prices, or error codes. That part is an engineering inference from the mechanism, not a claim from the paper: if the pseudo-document invents a precise fact, retrieval can drift toward text that echoes the invention instead of the source chunk.

HyDE can sometimes "hallucinate" a direction so far off-base that it retrieves irrelevant documents. Monitor retrieval quality when deploying HyDE in production, and gate it away from exact identifiers, dates, prices, and error codes.

route-exact-lookups-around-hyde.py
1import re 2 3EXACT_LOOKUP = re.compile(r"\b(?:order\s*#?\d+|tracking\s+[A-Z0-9-]{8,})\b", re.IGNORECASE) 4 5def retrieval_route(query: str) -> str: 6 if EXACT_LOOKUP.search(query): 7 return "hybrid_exact_preserving" 8 return "hyde_candidate" 9 10queries = [ 11 "What is the status of order #48291?", 12 "How does customs clearance work for fragile imports?", 13] 14 15for query in queries: 16 print(f"{retrieval_route(query)}: {query}")
Output
1hybrid_exact_preserving: What is the status of order #48291? 2hyde_candidate: How does customs clearance work for fragile imports?

Self-RAG (Self-reflective RAG)

Many retrieve-then-generate pipelines fetch context once without a model-generated critique step. Self-RAG instead fine-tunes a generator to emit special reflection tokens that control retrieval and score candidate generation segments.[6]

Reflection tokens

Self-RAG uses one retrieval token family and three critique token families:[6]

  1. Retrieve with values Yes, No, or Continue. This decides whether the model should fetch evidence before generating the next segment.
  2. IsRel with labels such as Relevant or Irrelevant. This scores whether a retrieved passage is helpful for the current query or segment.
  3. IsSup with labels Fully supported, Partially supported, and No support. This checks whether the generated claim is grounded in retrieved evidence.
  4. IsUse with utility scores from 1 to 5. This measures how useful the final response is for the user.

Paper examples serialize these as inline control tags such as [Retrieve=Yes], [Relevant], and [Fully supported]. The exact bracket syntax is less important than the four decision families.

To see these tokens in action, imagine the same perishable-refund query. A Self-RAG model might generate the following token stream:

  1. [Retrieve=Yes], the model decides it needs evidence before answering.
  2. It retrieves a passage: "Perishable items must be reported within 24 hours of delivery..."
  3. [Relevant], the passage is useful for the current segment.
  4. The model generates: "For perishable items, the standard refund window is 24 hours from delivery."
  5. [Fully supported], the claim is grounded in the retrieved text.
  6. [IsUse=4], the response is helpful but could be more detailed.

Without reflection tokens, a standard RAG pipeline might have retrieved the same passage without explicitly scoring passage relevance or claim support. Reflection tokens expose the model's predicted judgments for scoring and control; they aren't proof that a claim is true.

Architecture

Self-RAG is more than "retrieve once, then critique at the end." At inference time it can emit a retrieval decision, retrieve top-k passages on demand, generate candidate segments conditioned on different passages in parallel, and score those branches with reflection-token probabilities.[6] That segment-level beam search is what makes Self-RAG distinct from a simple prompted guardrail loop.

Self-RAG inference loop where a query plus partial answer emits a Retrieve decision, optionally fetches passages, predicts relevance, support, and utility, then chooses the highest-scored continuation. Self-RAG inference loop where a query plus partial answer emits a Retrieve decision, optionally fetches passages, predicts relevance, support, and utility, then chooses the highest-scored continuation.
Self-RAG makes retrieval decisions part of generation. Each segment can retrieve evidence, predict critique tokens for candidate branches, and continue with the highest-scored path.

Implementation note

Deploying a true Self-RAG system requires a generator specifically fine-tuned to emit these reflection tokens during generation. The paper releases 7B and 13B checkpoints trained that way.[6] During training, the authors use a separate critic model to insert reflection tokens offline, then train the final generator to emit those tokens itself at inference time.[6]

A prompted frontier model can imitate parts of this control loop, but that isn't the same system. Without reflection-token fine-tuning, you're building a Self-RAG-inspired agentic pipeline: separate routing, retrieval grading, and answer validation calls stitched together in application code.

Self-RAG cost comes from on-demand retrieval plus branching over multiple passages and scoring those branches, not just from inserting a few extra tokens. Evaluate it where support-aware generation is valuable enough to justify that additional serving path.

The scoring sketch below doesn't implement Self-RAG training. It shows how an inference service can rank branches once a trained model has supplied relevance, support, and utility probabilities; the weights are an explicit product policy.

score-self-rag-branches.py
1from dataclasses import dataclass 2 3@dataclass(frozen=True) 4class CandidateSegment: 5 text: str 6 relevance_probability: float 7 support_probability: float 8 utility_probability: float 9 10def branch_score(candidate: CandidateSegment) -> float: 11 return ( 12 0.2 * candidate.relevance_probability 13 + 0.6 * candidate.support_probability 14 + 0.2 * candidate.utility_probability 15 ) 16 17candidates = [ 18 CandidateSegment("refund window is 24 hours", 0.92, 0.96, 0.81), 19 CandidateSegment("refund window is 48 hours", 0.95, 0.28, 0.88), 20] 21winner = max(candidates, key=branch_score) 22 23print(f"chosen segment: {winner.text}") 24print(f"score: {branch_score(winner):.3f}")
Output
1chosen segment: refund window is 24 hours 2score: 0.922

CRAG (Corrective RAG)

Corrective Retrieval Augmented Generation (CRAG) focuses on correction after imperfect retrieval. It adds a lightweight Retrieval Evaluator that scores retrieved question-document pairs and routes the request before generation.[7]

The evaluator

In the paper, the evaluator is a lightweight T5-large model fine-tuned to score each retrieved question-document pair, then threshold those scores into one of three actions:[7]

Imagine a customer asks about "international customs duties for textile orders." The internal knowledge base has weak coverage. The evaluator might score the retrieved internal documents as Incorrect, triggering a web-search fallback. A production system would still need source allowlists and citation checks before trusting those web results. If internal documents are somewhat relevant but incomplete, the evaluator returns Ambiguous, and CRAG combines refined internal strips with web results.

  1. Correct: At least one retrieved document clears the upper threshold. Action: refine internal knowledge and answer from it.
  2. Incorrect: All retrieved documents fall below the lower threshold. Action: discard them and fall back to web search.
  3. Ambiguous: The scores land between those two cases. Action: combine refined internal evidence with web results.

The crucial distinction from Self-RAG is where correction happens. CRAG doesn't train the generator to emit reflection tokens. Instead, it inserts a separate evaluator between retrieval and generation and uses that evaluator to trigger correction paths.

route-crag-evaluator-scores.py
1from typing import Literal 2 3Decision = Literal["correct", "incorrect", "ambiguous"] 4 5def crag_action(scores: list[float], lower: float = 0.2, upper: float = 0.7) -> Decision: 6 best_score = max(scores) 7 if best_score > upper: 8 return "correct" 9 if best_score < lower: 10 return "incorrect" 11 return "ambiguous" 12 13print(f"strong retrieval: {crag_action([0.81, 0.15])}") 14print(f"weak retrieval: {crag_action([0.10, 0.17])}") 15print(f"uncertain retrieval: {crag_action([0.45, 0.09])}")
Output
1strong retrieval: correct 2weak retrieval: incorrect 3uncertain retrieval: ambiguous

Corrective RAG evaluator flow where internal retrieval is graded as correct, ambiguous, or incorrect, then routed to refinement, combine-with-web, or web fallback before generation. Corrective RAG evaluator flow where internal retrieval is graded as correct, ambiguous, or incorrect, then routed to refinement, combine-with-web, or web fallback before generation.
CRAG adds a routing gate after retrieval. Evaluator scores select internal refinement, mixed internal and web context, or web fallback before generation.

Knowledge refinement

Even relevant documents contain noise. CRAG includes a decompose-then-recompose step:

  1. Break the document into fine-grained strips (sentences or small chunks).
  2. Score each strip for relevance.
  3. Concatenate only the relevant strips.
  4. Pass this "refined knowledge" to the generator.

Here's a simplified CRAG sketch in Python. The actual paper scores retrieved documents individually and then applies thresholds. For readability, the code below collapses that logic into a single classify helper. The refine_knowledge method then decomposes documents into smaller strips, scores those strips, and recomposes only the useful evidence before generation.

If you instantiate this class with real components and run crag.run("Customs duties for textile orders"), the evaluator might return "incorrect" because internal docs lack coverage. The pipeline would then call web_search.search(...) and pass the web results through refine_knowledge before generating the answer.

knowledge-refinement.py
1from typing import Literal, Protocol 2 3Decision = Literal["correct", "incorrect", "ambiguous"] 4 5class SearchBackend(Protocol): 6 def search(self, query: str, k: int = 5) -> list[str]: ... 7 8class RetrievalEvaluator(Protocol): 9 def classify(self, query: str, docs: list[str]) -> Decision: ... 10 11 def is_relevant_strip(self, query: str, strip: str) -> bool: ... 12 13class CorrectiveRAG: 14 def __init__( 15 self, 16 vector_db: SearchBackend, 17 evaluator_model: RetrievalEvaluator, 18 web_search_tool: SearchBackend, 19 ) -> None: 20 self.vector_db = vector_db 21 self.evaluator = evaluator_model 22 self.web_search = web_search_tool 23 24 def run(self, query: str) -> str: 25 # Initial retrieval can be wrong because the private corpus is incomplete. 26 retrieved_docs = self.vector_db.search(query, k=5) 27 28 # The evaluator decides whether internal evidence is usable. 29 decision = self.evaluator.classify(query, retrieved_docs) 30 31 if decision == "correct": 32 final_context = self.refine_knowledge(query, retrieved_docs) 33 34 elif decision == "incorrect": 35 web_results = self.web_search.search(query) 36 final_context = self.refine_knowledge(query, web_results) 37 38 else: # ambiguous 39 internal_context = self.refine_knowledge(query, retrieved_docs) 40 web_context = self.refine_knowledge(query, self.web_search.search(query)) 41 final_context = internal_context + web_context 42 43 return self.generate(query, final_context) 44 45 def refine_knowledge(self, query: str, docs: list[str]) -> list[str]: 46 refined_strips = [] 47 for doc in docs: 48 strips = self.chunk_into_strips(doc) 49 for strip in strips: 50 if self.evaluator.is_relevant_strip(query, strip): 51 refined_strips.append(strip) 52 return refined_strips 53 54 def chunk_into_strips(self, doc: str) -> list[str]: 55 # Teaching version: sentence segmentation by period. 56 return [segment.strip() for segment in doc.split('.') if segment.strip()] 57 58 def generate(self, query: str, context: list[str]) -> str: 59 if not context: 60 return "No reliable evidence found." 61 return f"Answer to '{query}' using: " + " ".join(context) 62 63class FakeVectorDB: 64 def search(self, query: str, k: int = 5) -> list[str]: 65 return [ 66 "Warehouse picking guide. Use label printers in aisle 4.", 67 "Domestic returns policy. Return labels expire after 30 days.", 68 ][:k] 69 70class FakeWebSearch: 71 def search(self, query: str, k: int = 5) -> list[str]: 72 return [ 73 "Official customs duty schedule for textile orders.", 74 "Textile shipments require HS code review before import.", 75 ][:k] 76 77class FakeEvaluator: 78 def classify(self, query: str, docs: list[str]) -> Decision: 79 joined_docs = " ".join(docs).lower() 80 if "customs" in joined_docs or "textile" in joined_docs: 81 return "correct" 82 return "incorrect" 83 84 def is_relevant_strip(self, query: str, strip: str) -> bool: 85 keywords = {"customs", "duty", "duties", "textile", "import"} 86 strip_words = set(strip.lower().replace(",", " ").split()) 87 return bool(keywords & strip_words) 88 89pipeline = CorrectiveRAG(FakeVectorDB(), FakeEvaluator(), FakeWebSearch()) 90answer = pipeline.run("international customs duties for textile orders") 91print(answer)
Output
1Answer to 'international customs duties for textile orders' using: Official customs duty schedule for textile orders Textile shipments require HS code review before import

Agentic and iterative retrieval

Self-RAG and CRAG both introduce feedback, but at different boundaries: Self-RAG can make retrieval and critique decisions while generating segments, while CRAG routes after initial retrieval. An application can generalize feedback into agentic retrieval: instead of one fixed retrieve-then-generate pass, a model with tool access searches, reads results, tests whether evidence is sufficient, and either answers or searches again with a refined query.

The mechanism is older than the agent framing. Interleaving retrieval with reasoning across multiple hops is what IRCoT formalized: retrieve, reason a step, use that step to drive the next retrieval, and repeat until the chain of evidence is complete.[8] Multi-hop questions like "Which carrier handles our heaviest returns lane, and what is its peak-season SLA?" need exactly this, because the answer to the second part depends on resolving the first.

The engineering question is when to escalate up this ladder, since each rung costs latency and tokens:

  1. Naive RAG for single-fact lookups answerable by one retrieval.
  2. Query rewriting plus hybrid search and reranking for conversational ambiguity and vocabulary mismatch. This is a practical baseline to evaluate before more expensive routes.
  3. HyDE behind a router for abstract or vocabulary-mismatched queries where a document-style proxy helps.
  4. A correction gate (CRAG-style evaluator or prompted critique) when retrieval quality is inconsistent and a single bad context causes user-visible errors.
  5. Iterative or agentic retrieval for genuinely multi-hop questions, where one retrieval pass can't gather all the facts.

Escalate only when the failure mode and evaluation demand it. On a simple FAQ lookup, an agentic loop can add latency and failure paths without improving retrieved evidence.

Comparison of advanced techniques

Choose these techniques by failure mode, not by novelty. HyDE targets semantic mismatch. Self-RAG changes retrieval timing and branch scoring inside a trained generator. CRAG adds a separate correction gate after retrieval.

Comparison of advanced RAG architectures showing HyDE as search-object change, Self-RAG as generation-time retrieval control, and CRAG as post-retrieval correction. Comparison of advanced RAG architectures showing HyDE as search-object change, Self-RAG as generation-time retrieval control, and CRAG as post-retrieval correction.
Compare the three architectures side by side. HyDE changes the search object, Self-RAG changes generation-time control, and CRAG routes weak retrieval before answer synthesis.
FeatureNaive RAGHyDESelf-RAGCRAG
Retrieval TriggerAlwaysAlwaysDynamic ([Retrieve])Always
Query RepresentationRaw QueryHypothetical DocRaw Query + Partial GenerationRaw Query
Retrieval Quality CheckNoneNone; changes query representationPredicted relevance, support, and utility tokensRetrieval evaluator confidence
External SearchNoNoNoYes (on ambiguous/incorrect retrieval)
Primary Use CaseSimple Q&AAbstract or vocabulary-mismatched queriesHigh-factuality generation with a specialized modelWhen internal retrieval quality is inconsistent
LatencyLowMediumHighMedium-High

Production failure modes

Advanced RAG helps only when it matches the failure. Watch for these symptoms before adding another model call.

Symptom: the query is vague, but the answer needs history

Cause: The retriever sees only the latest turn, so it searches for "the other one" instead of the delayed replacement order or pending refund. Fix: Rewrite the latest turn into a standalone query using the last few conversation turns. Keep the raw user message for the generator, but search with the rewritten query.

Symptom: HyDE improves recall on policy questions but breaks exact lookups

Cause: The hypothetical document can invent precise identifiers, dates, or prices. That invented detail may pull retrieval toward the wrong neighborhood. Fix: Route exact-match queries to normal hybrid retrieval. Use HyDE for conceptual or vocabulary-mismatched questions where a document-style proxy helps more than it hurts.

Symptom: Self-RAG-inspired code works in prompts but isn't true Self-RAG

Cause: True Self-RAG trains a generator to emit retrieval and critique tokens during generation. Separate prompted grading calls can mimic the control loop, but they don't create the same reflection-token model. Fix: Name the system honestly. Call it a prompted critique loop or CRAG-style evaluator unless you're hosting a model trained with Self-RAG reflection tokens.

Symptom: accuracy rises, but users feel the product is slow

Cause: Rewriting, HyDE generation, dense retrieval, sparse retrieval, reranking, grading, and final synthesis can become a long sequential path. Fix: Parallelize independent retrievals, cap reranker candidates, cache repeated rewrites or hypothetical documents, and measure retrieval and generation latency separately.

Practical implementation strategy

Implementing all these techniques at once is overkill. A progressive approach is best:

Phase 1: Robust baseline

Start with Query Rewriting and Hybrid Search (Dense + Keyword).

  • Why: Fixes basic vocabulary mismatch and conversational ambiguity.
  • Cost: Low (1 extra LLM call for rewriting).

Phase 2: Add reranking

Add a Cross-Encoder Reranker (e.g., Cohere Rerank or BGE-Reranker) after retrieval.

  • Why: Can improve precision of the final candidate set when initial retrieval has decent recall.
  • Cost: Moderate. Cross-encoders are slower than initial retrieval, so keep candidate count small.

Phase 3: Specialized handling (HyDE/router)

Use a Router to classify queries.

  • If query is conceptual or abstract (e.g., "Explain how customs clearance works for international orders"): Use HyDE, since vocabulary mismatch is likely.
  • If query is factual or precise (e.g., "What is my tracking number for order #48291?"): Use standard retrieval, since HyDE risks hallucinating the exact number.
  • Why: Optimizes for different query types without applying HyDE to everything.

Phase 4: Add correction loops

If accuracy is still insufficient, add a prompted critique loop or a CRAG-style evaluator before investing in a true Self-RAG model.

  • Prompt the model or a lightweight evaluator to grade retrieved documents before answer generation.
  • If evidence is weak, trigger a rewrite, second retrieval pass, or web fallback.
  • Why: Captures much of the reliability gain without requiring reflection-token fine-tuning.

Latency is the production constraint in advanced RAG. A pipeline with rewriting, HyDE, retrieval, reranking, grading, and generation turns one answer into several sequential model and retrieval steps. Stream the final generation, fetch dense and sparse results in parallel, and cache reusable artifacts when traffic is repetitive.

Don't release a more elaborate route because it improves a few anecdotes. Compare supported-answer quality and latency on a labeled set, then release only paths that meet both requirements.

release-a-retrieval-route-from-evals.py
1evaluations = [ 2 {"route": "rewrite+hybrid", "supported_accuracy": 0.91, "p95_ms": 180}, 3 {"route": "hyde+rerank", "supported_accuracy": 0.94, "p95_ms": 260}, 4 {"route": "agentic-loop", "supported_accuracy": 0.95, "p95_ms": 710}, 5] 6minimum_supported_accuracy = 0.93 7maximum_p95_ms = 350 8 9eligible = [ 10 row for row in evaluations 11 if row["supported_accuracy"] >= minimum_supported_accuracy 12 and row["p95_ms"] <= maximum_p95_ms 13] 14released = max(eligible, key=lambda row: row["supported_accuracy"]) 15print(f"released route: {released['route']}") 16print(f"supported_accuracy={released['supported_accuracy']:.2f} p95_ms={released['p95_ms']}")
Output
1released route: hyde+rerank 2supported_accuracy=0.94 p95_ms=260

Try it yourself

The best way to internalize these techniques is to apply them to a small, concrete dataset. Here's a focused exercise you can complete in under an hour.

Exercise: Build a query rewriter for a returns chatbot

Setup: Collect five real customer messages from a returns chatbot (or write realistic ones). Include at least one ambiguous message that needs conversation history, one complex comparison, and one vague keyword.

Step 1, Rewrite: Write a Python function that takes a customer message plus the last two turns of chat history and outputs a standalone query. Run it on your five messages and inspect the results. Does the rewritten query contain the full intent?

Step 2, Measure: For each original message, manually decide which of your internal policy documents should be retrieved. Then run the rewritten query through a simple dense-retrieval setup (even a small embedding model like all-MiniLM-L6-v2 against a dozen policy chunks). Count how many of the top-3 results match your manual gold set. The rewrite should improve hit rate for the ambiguous and vague cases.

Step 3, Diagnose: Pick one message where retrieval still fails. Is the problem vocabulary mismatch (try multi-query expansion), semantic asymmetry (try HyDE), or weak evidence (try a CRAG-style evaluator)? Implement the fix and measure again.

Expected outcome: Record which interventions improve top-3 evidence hits or catch weak retrieval on your examples, and which add latency without a gain. A small exercise may not reproduce paper results; its value is exposing the measurement loop.

Going deeper

Once you understand the mechanics, the important skill is knowing when each technique pays off and when it adds unnecessary complexity. Here are three critical trade-offs engineers face when moving from prototype to production.

When does HyDE justify its latency cost?

HyDE adds at least one extra generation call before retrieval. That cost is only worth it when measured retrieval quality improves enough to justify it. A smaller instruction model may be adequate for the hypothetical document, but evaluate it rather than assuming equivalence. Cache repeated proxy documents only when generation configuration and source policy make reuse valid. For exact-match lookups like tracking numbers, order IDs, or dates, skip HyDE.

When does Self-RAG degrade instead of help?

Self-RAG degrades when retrieval itself is weak or when the model's learned critique tokens stop correlating with real answer quality. It also raises inference cost because the model may retrieve on demand, branch over multiple passages, and spend extra decoding steps on critique tokens before choosing a continuation. For a simple FAQ bot, first measure a cheaper baseline such as rewrite plus reranking or a correction gate before committing to a specialized Self-RAG serving path.

Can you combine HyDE and Self-RAG?

As a system design, you can use HyDE to propose initial candidate passages, then let a true Self-RAG model score branches with IsRel, IsSup, and IsUse. That composition needs its own evaluation; neither mechanism guarantees the other improves it. If you don't have a reflection-token model, describe the composition as HyDE plus a CRAG-style evaluator or prompted support checks rather than Self-RAG.

Mastery check

Key concepts

  • standalone query rewriting
  • multi-query expansion
  • query decomposition
  • HyDE
  • Self-RAG reflection tokens
  • CRAG evaluator
  • agentic retrieval escalation
  • retrieval versus generation latency

Evaluation rubric

  • Foundational: Explains why rewrite, expansion, decomposition, HyDE, Self-RAG, and CRAG solve different failures instead of treating them as one interchangeable bag of tricks
  • Intermediate: Implements at least one rewrite path, one HyDE-style retrieval path, and one correction gate with clear boundaries between retrieval, grading, and answer generation
  • Advanced: Chooses the cheapest retrieval ladder that fits the failure mode and can defend latency, factuality, and exact-match routing trade-offs

Follow-up questions

Common pitfalls

  • Symptom: vague follow-up questions retrieve generic FAQ chunks. Cause: the retriever saw only the latest turn. Fix: rewrite the latest turn into a standalone query using recent history.
  • Symptom: HyDE improves conceptual queries but hurts tracking-number or price lookups. Cause: the hypothetical document invented precise facts and dragged retrieval off course. Fix: route exact-match queries to ordinary hybrid retrieval.
  • Symptom: a prompted grading loop is described as Self-RAG. Cause: the control logic looks similar at a distance. Fix: reserve the Self-RAG name for models trained to emit reflection tokens during generation.
  • Symptom: retrieval quality improves but user experience gets slower and more fragile. Cause: rewriting, HyDE, reranking, grading, and synthesis were stacked into one long sequential path. Fix: parallelize independent work, cap expensive stages, and measure retrieval and generation latency separately.
  • Symptom: teams add agentic retrieval to simple FAQ lookups. Cause: escalation happened because the technique sounded more advanced, not because the failure demanded it. Fix: climb the ladder only when one retrieval pass can't gather or validate the needed evidence.
Next Step
Continue to GraphRAG & Knowledge Graphs

There, you'll examine how Microsoft's GraphRAG architecture uses entity graphs, hierarchical community reports, and embeddings for corpus-wide relationship questions that pure vector retrieval can struggle with.

PreviousVector DB Internals: HNSW & IVF
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Query Rewriting for Retrieval-Augmented Large Language Models.

Ma, X., et al. · 2023 · arXiv preprint

RAG-Fusion: a New Take on Retrieval-Augmented Generation

Rackauckas, Z. · 2024

Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods.

Cormack, G. V., Clarke, C. L. A., & Buettcher, S. · 2009 · SIGIR '09

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Zhou, D., et al. · 2022 · ICLR 2023

Precise Zero-Shot Dense Retrieval without Relevance Labels.

Gao, L., Ma, X., Lin, J., & Callan, J. · 2022 · arXiv preprint

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection.

Asai, A., et al. · 2023 · arXiv preprint

Corrective Retrieval Augmented Generation.

Yan, S.-Q., et al. · 2024 · arXiv preprint

Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

Trivedi, H., Balasubramanian, N., Khot, T., & Sabharwal, A. · 2022