Learn how query rewriting, HyDE, Self-RAG, and Corrective RAG change retrieval control, and how to evaluate their cost and evidence quality.
Advanced retrieval-augmented generation (RAG) adds retrieval controls for queries that are ambiguous, underspecified, or mismatched with the document corpus. This chapter covers query rewriting, HyDE (Hypothetical Document Embeddings), Self-RAG (Self-reflective RAG), and correction loops, with attention to what each approach does and doesn't guarantee.
Vector database internals showed how an index finds nearby chunks quickly. This chapter starts one level above the index: how to turn a messy user request into the right retrieval work before the generator tries to answer.
Imagine a customer opens your order-tracking chat and types, "What about the other one?" They could mean the second item in their cart, the replacement shipment you sent yesterday, or the refund still pending. The retriever has too little context. Or imagine a warehouse clerk searching for "label" and getting barcode-label specs when they needed the return-label policy. This is the fundamental challenge of search: users rarely ask questions in the exact way that matches the documents they need.
In a basic RAG (Retrieval-Augmented Generation) system, the Large Language Model (LLM) takes the user's question, finds the most similar documents, and tries to answer. This "naive" approach works for simple fact lookups, but it fails in the real world where queries are ambiguous ("What about the other one?"), complex ("Compare X and Y"), or phrased differently than the source material.
For production search systems, adaptive retrieval logic is one option when measurements expose these failures. Instead of searching only the user's exact words, a route may reformulate queries, grade what it finds, or decide whether retrieval is needed at all.
A standard RAG implementation (often called "Naive RAG") typically embeds the user's raw query and fetches the top-k nearest neighbors from a vector database. This simple "retrieve-then-generate" pipeline suffers from systematic failure modes:
These failure modes are common in real applications. A useful production pipeline measures them separately, then adds the smallest retrieval intervention that repairs the observed failure rather than making every question pay for an elaborate loop.
User queries are rarely optimal for vector search. They're often short, lack context, or contain multiple distinct questions about orders, shipments, and refunds all at once.
Think of this like a support lead who turns a vague customer message into a precise search request before querying the policy base. Conversational queries often rely on implicit context. In rewrite-retrieve-read systems, the rewrite step turns the raw input into a retrieval-oriented query instead of treating the user's wording as sacred.[1] Ma et al. study that pattern with a dedicated rewrite stage in front of retrieval; in chat products, a pragmatic variant is to rewrite the latest turn into a standalone, search-optimized question before retrieval.[1]
Here's a concrete example. A customer sends two messages:
Customer: "My box arrived crushed." Customer: "What do I do now?"
A stateless retriever searching for "What do I do now?" would pull generic help-desk articles. A rewrite step instead produces:
Standalone query: "How do I file a damage claim and request a replacement for a crushed package?"
In production, the rewrite model can be any instruction-tuned LLM behind a small interface. The copy-runnable example below uses a deterministic fake model so the parsing contract can be tested locally without API keys.
1from typing import Protocol, TypedDict
2
3class ChatMessage(TypedDict):
4 role: str
5 content: str
6
7class RewriteModel(Protocol):
8 def rewrite(self, latest_query: str, history_text: str) -> str: ...
9
10class FakeRewriteModel:
11 def rewrite(self, latest_query: str, history_text: str) -> str:
12 if "crushed" in history_text.lower() and "what do i do" in latest_query.lower():
13 return "How do I file a damage claim and request a replacement for a crushed package?"
14 return latest_query
15
16def rewrite_query_with_history(
17 query: str, chat_history: list[ChatMessage], model: RewriteModel
18) -> str:
19 history_text = "\n".join(
20 f"{message['role']}: {message['content']}" for message in chat_history
21 )
22 return model.rewrite(query, history_text).strip()
23
24history = [
25 {"role": "customer", "content": "My box arrived crushed."},
26]
27
28rewritten = rewrite_query_with_history("What do I do now?", history, FakeRewriteModel())
29print(rewritten)1How do I file a damage claim and request a replacement for a crushed package?A single query might miss relevant documents due to vocabulary mismatch. Multi-Query Expansion generates synonymous queries to improve recall.
For example, if a user asks "How do you handle peak-season delivery delays?", relevant internal documents might use terms like "carrier surge capacity", "warehouse backorder policies", or "expedited shipping alternatives". To improve recall across different vocabularies, we can automatically generate variations of the query.
Once you fan out into several queries, you have several ranked result lists to merge. The common pattern is RAG-Fusion, which pairs LLM-generated query variants with Reciprocal Rank Fusion (RRF) to combine the lists into one ranking.[2] RRF ignores raw similarity scores and sums reciprocal ranks instead, so a document that lands near the top of several lists rises even if no single list ranked it first.[3] Concretely, a document d gets score sum over lists of 1 / (k + rank(d)), where k is a small constant (commonly 60) and rank starts at 1. Documents missing from a list contribute zero from that list. RRF is the same rank-merge trick used to fuse dense and sparse results in hybrid search, reused here to fuse query variants.
Here's how to implement multi-query expansion while keeping the boundary testable. The model returns one query per line, and the parser removes bullets or numbering so all variants can be searched in parallel.
1from typing import Protocol
2
3class QueryExpansionModel(Protocol):
4 def expand(self, query: str, n: int) -> str: ...
5
6class FakeExpansionModel:
7 def expand(self, query: str, n: int) -> str:
8 return "\n".join(
9 [
10 "Carrier surge capacity during holidays",
11 "Warehouse backorder policies for high-volume periods",
12 "Expedited shipping alternatives for delayed orders",
13 ][:n]
14 )
15
16def clean_query_line(line: str) -> str:
17 return line.strip().lstrip("-*0123456789. ").strip()
18
19def generate_multi_queries(
20 query: str, model: QueryExpansionModel, n: int = 3
21) -> list[str]:
22 content = model.expand(query, n)
23 queries = [clean_query_line(line) for line in content.splitlines()]
24 return [query for query in queries if query]
25
26queries = generate_multi_queries(
27 "How do you handle peak-season delivery delays?", FakeExpansionModel(), n=3
28)
29
30for query in queries:
31 print(f"- {query}")1- Carrier surge capacity during holidays
2- Warehouse backorder policies for high-volume periods
3- Expedited shipping alternatives for delayed ordersGenerating variants isn't enough; the system must merge their result lists without assuming similarity scores from separate searches are directly comparable. RRF provides a deterministic rank-based merge:
1def reciprocal_rank_fusion(rankings: list[list[str]], rank_constant: int = 60) -> list[tuple[str, float]]:
2 scores: dict[str, float] = {}
3 for ranking in rankings:
4 for rank, document_id in enumerate(ranking, start=1):
5 scores[document_id] = scores.get(document_id, 0.0) + 1 / (rank_constant + rank)
6 return sorted(scores.items(), key=lambda pair: (-pair[1], pair[0]))
7
8ranked_lists = [
9 ["carrier-capacity", "return-policy", "sla"],
10 ["sla", "carrier-capacity", "backorder"],
11 ["carrier-capacity", "expedited-options", "sla"],
12]
13
14for document_id, score in reciprocal_rank_fusion(ranked_lists)[:3]:
15 print(f"{document_id}: {score:.4f}")1carrier-capacity: 0.0489
2sla: 0.0481
3expedited-options: 0.0161Complex questions often require multiple retrieval steps, as a single search may fail to gather all the necessary facts. Decomposition breaks a complex query into a series of simpler sub-queries that can be executed sequentially or in parallel.
For instance, consider the following analytical query:
"Compare standard and express shipping delivery times for fragile items."
A standard retriever might struggle to find a single document containing this exact comparison. Instead, we decompose it into sub-questions:
This decomposition pattern is closely related to least-to-most prompting, which decomposes hard problems into simpler steps.[4] In RAG, teams reuse that idea for retrieval coverage rather than chain-of-thought supervision. Answering simpler questions first can gather explicit evidence for each sub-fact before synthesis, but it also adds queries and possible error propagation. Measure supported-answer accuracy and latency before releasing it.
Here's a simple implementation. A real model would produce the sub-questions, but the rest of the system should only depend on the line-oriented contract.
1from typing import Protocol
2
3class DecompositionModel(Protocol):
4 def decompose(self, query: str) -> str: ...
5
6class FakeDecompositionModel:
7 def decompose(self, query: str) -> str:
8 return "\n".join(
9 [
10 "What is the standard shipping delivery time for fragile items?",
11 "What is the express shipping delivery time for fragile items?",
12 "What packaging and liability rules apply to fragile item shipments?",
13 "How do standard and express shipping compare given those facts?",
14 ]
15 )
16
17def decompose_query(query: str, model: DecompositionModel) -> list[str]:
18 lines = [
19 line.strip().lstrip("-*0123456789. ").strip()
20 for line in model.decompose(query).splitlines()
21 ]
22 return [line for line in lines if line]
23
24sub_questions = decompose_query(
25 "Compare standard and express shipping delivery times for fragile items.",
26 FakeDecompositionModel(),
27)
28
29for index, question in enumerate(sub_questions, start=1):
30 print(f"{index}. {question}")11. What is the standard shipping delivery time for fragile items?
22. What is the express shipping delivery time for fragile items?
33. What packaging and liability rules apply to fragile item shipments?
44. How do standard and express shipping compare given those facts?Standard dense retrieval matches a query embedding to document embeddings. However, queries are often short and interrogative, while indexed passages are longer and declarative. This semantic asymmetry hurts retrieval quality.
HyDE (Hypothetical Document Embeddings) bridges this gap by generating a hypothetical document first, then using that generated text for retrieval.[5] In Gao et al., the model is prompted to "write a document that answers the question." Think of it like trying to find a specific return-policy clause but you can't remember the exact wording. Instead of asking the knowledge base "What do we do about broken things?", you write a one-paragraph summary of what you remember the policy says and ask, "Where are documents that look like this paragraph?"
Here's a concrete e-commerce example. A customer asks:
Query: "Can I get a refund if my perishable item spoils after 48 hours?"
A standard dense retriever might embed the short question and pull generic "refund policy" articles that don't mention perishables. HyDE instead prompts the model to write a hypothetical policy paragraph:
Hypothetical document: "Perishable goods refunds: Items classified as perishable (fresh food, flowers, temperature-sensitive medicine) must be reported within 24 hours of delivery. After 24 hours, spoilage claims require photo evidence and a carrier delay affidavit..."
That generated paragraph is longer, declarative, and uses vocabulary like "perishable goods", "carrier delay affidavit", and "temperature-sensitive". It is an illustrative search proxy, not an answer: the generated deadline may be wrong, and only retrieved source text may support the final response.
The original HyDE pipeline has three phases:
The production version uses a real generator and dense encoder. This small runnable version uses keyword sets so the retrieval contract is visible: the query first becomes a document-like paragraph, then the retriever searches with that paragraph rather than the original question.
1from dataclasses import dataclass
2from typing import Protocol
3
4class HypotheticalDocGenerator(Protocol):
5 def generate(self, query: str) -> str: ...
6
7@dataclass(frozen=True)
8class Chunk:
9 id: str
10 text: str
11
12class FakeHyDEGenerator:
13 def generate(self, query: str) -> str:
14 return (
15 "Perishable goods refunds require a spoilage claim, photo evidence, "
16 "delivery timestamp review, and carrier delay notes."
17 )
18
19class KeywordRetriever:
20 def __init__(self, chunks: list[Chunk]) -> None:
21 self.chunks = chunks
22
23 def search(self, search_text: str, k: int = 2) -> list[Chunk]:
24 query_terms = set(
25 search_text.lower().replace(",", " ").replace(".", " ").split()
26 )
27
28 def score(chunk: Chunk) -> int:
29 chunk_terms = set(
30 chunk.text.lower().replace(",", " ").replace(".", " ").split()
31 )
32 return len(query_terms & chunk_terms)
33
34 return sorted(self.chunks, key=score, reverse=True)[:k]
35
36def hyde_retrieve(
37 query: str, generator: HypotheticalDocGenerator, retriever: KeywordRetriever
38) -> list[Chunk]:
39 hypothetical_doc = generator.generate(query)
40 return retriever.search(hypothetical_doc, k=2)
41
42chunks = [
43 Chunk("generic-refunds", "Refund requests require an order number and customer email."),
44 Chunk(
45 "perishable-refunds",
46 "Perishable goods spoilage claims need photo evidence and delivery timestamp review.",
47 ),
48 Chunk("shipping-labels", "Return labels can be regenerated from the order portal."),
49]
50
51matches = hyde_retrieve(
52 "Can I get a refund if my food spoils after delivery?",
53 FakeHyDEGenerator(),
54 KeywordRetriever(chunks),
55)
56
57match_ids = [chunk.id for chunk in matches]
58print(f"retrieved: {match_ids}")1retrieved: ['perishable-refunds', 'generic-refunds']HyDE flow is simple but important: change the search object first, then retrieve.
In the zero-shot experiments reported for HyDE, converting query text into hypothetical document text improved retrieval over the authors' Contriever baseline.[5] The mechanism can still fail on a new corpus, especially when the proxy invents a high-impact identifier or policy detail.
HyDE isn't a silver bullet and its effectiveness depends heavily on the query type. It's particularly useful in the zero-shot setting studied in the paper, especially when there's a strong mismatch between how users ask a question and how documents phrase the answer.[5] That's why it tends to help with abstract or conceptual queries and cross-lingual retrieval.
In practice, many teams gate HyDE off for exact-match lookups such as IDs, dates, prices, or error codes. That part is an engineering inference from the mechanism, not a claim from the paper: if the pseudo-document invents a precise fact, retrieval can drift toward text that echoes the invention instead of the source chunk.
HyDE can sometimes "hallucinate" a direction so far off-base that it retrieves irrelevant documents. Monitor retrieval quality when deploying HyDE in production, and gate it away from exact identifiers, dates, prices, and error codes.
1import re
2
3EXACT_LOOKUP = re.compile(r"\b(?:order\s*#?\d+|tracking\s+[A-Z0-9-]{8,})\b", re.IGNORECASE)
4
5def retrieval_route(query: str) -> str:
6 if EXACT_LOOKUP.search(query):
7 return "hybrid_exact_preserving"
8 return "hyde_candidate"
9
10queries = [
11 "What is the status of order #48291?",
12 "How does customs clearance work for fragile imports?",
13]
14
15for query in queries:
16 print(f"{retrieval_route(query)}: {query}")1hybrid_exact_preserving: What is the status of order #48291?
2hyde_candidate: How does customs clearance work for fragile imports?Many retrieve-then-generate pipelines fetch context once without a model-generated critique step. Self-RAG instead fine-tunes a generator to emit special reflection tokens that control retrieval and score candidate generation segments.[6]
Self-RAG uses one retrieval token family and three critique token families:[6]
Retrieve with values Yes, No, or Continue. This decides whether the model should fetch evidence before generating the next segment.IsRel with labels such as Relevant or Irrelevant. This scores whether a retrieved passage is helpful for the current query or segment.IsSup with labels Fully supported, Partially supported, and No support. This checks whether the generated claim is grounded in retrieved evidence.IsUse with utility scores from 1 to 5. This measures how useful the final response is for the user.Paper examples serialize these as inline control tags such as [Retrieve=Yes], [Relevant], and [Fully supported]. The exact bracket syntax is less important than the four decision families.
To see these tokens in action, imagine the same perishable-refund query. A Self-RAG model might generate the following token stream:
[Retrieve=Yes], the model decides it needs evidence before answering.[Relevant], the passage is useful for the current segment.[Fully supported], the claim is grounded in the retrieved text.[IsUse=4], the response is helpful but could be more detailed.Without reflection tokens, a standard RAG pipeline might have retrieved the same passage without explicitly scoring passage relevance or claim support. Reflection tokens expose the model's predicted judgments for scoring and control; they aren't proof that a claim is true.
Self-RAG is more than "retrieve once, then critique at the end." At inference time it can emit a retrieval decision, retrieve top-k passages on demand, generate candidate segments conditioned on different passages in parallel, and score those branches with reflection-token probabilities.[6] That segment-level beam search is what makes Self-RAG distinct from a simple prompted guardrail loop.
Deploying a true Self-RAG system requires a generator specifically fine-tuned to emit these reflection tokens during generation. The paper releases 7B and 13B checkpoints trained that way.[6] During training, the authors use a separate critic model to insert reflection tokens offline, then train the final generator to emit those tokens itself at inference time.[6]
A prompted frontier model can imitate parts of this control loop, but that isn't the same system. Without reflection-token fine-tuning, you're building a Self-RAG-inspired agentic pipeline: separate routing, retrieval grading, and answer validation calls stitched together in application code.
Self-RAG cost comes from on-demand retrieval plus branching over multiple passages and scoring those branches, not just from inserting a few extra tokens. Evaluate it where support-aware generation is valuable enough to justify that additional serving path.
The scoring sketch below doesn't implement Self-RAG training. It shows how an inference service can rank branches once a trained model has supplied relevance, support, and utility probabilities; the weights are an explicit product policy.
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class CandidateSegment:
5 text: str
6 relevance_probability: float
7 support_probability: float
8 utility_probability: float
9
10def branch_score(candidate: CandidateSegment) -> float:
11 return (
12 0.2 * candidate.relevance_probability
13 + 0.6 * candidate.support_probability
14 + 0.2 * candidate.utility_probability
15 )
16
17candidates = [
18 CandidateSegment("refund window is 24 hours", 0.92, 0.96, 0.81),
19 CandidateSegment("refund window is 48 hours", 0.95, 0.28, 0.88),
20]
21winner = max(candidates, key=branch_score)
22
23print(f"chosen segment: {winner.text}")
24print(f"score: {branch_score(winner):.3f}")1chosen segment: refund window is 24 hours
2score: 0.922Corrective Retrieval Augmented Generation (CRAG) focuses on correction after imperfect retrieval. It adds a lightweight Retrieval Evaluator that scores retrieved question-document pairs and routes the request before generation.[7]
In the paper, the evaluator is a lightweight T5-large model fine-tuned to score each retrieved question-document pair, then threshold those scores into one of three actions:[7]
Imagine a customer asks about "international customs duties for textile orders." The internal knowledge base has weak coverage. The evaluator might score the retrieved internal documents as Incorrect, triggering a web-search fallback. A production system would still need source allowlists and citation checks before trusting those web results. If internal documents are somewhat relevant but incomplete, the evaluator returns Ambiguous, and CRAG combines refined internal strips with web results.
The crucial distinction from Self-RAG is where correction happens. CRAG doesn't train the generator to emit reflection tokens. Instead, it inserts a separate evaluator between retrieval and generation and uses that evaluator to trigger correction paths.
1from typing import Literal
2
3Decision = Literal["correct", "incorrect", "ambiguous"]
4
5def crag_action(scores: list[float], lower: float = 0.2, upper: float = 0.7) -> Decision:
6 best_score = max(scores)
7 if best_score > upper:
8 return "correct"
9 if best_score < lower:
10 return "incorrect"
11 return "ambiguous"
12
13print(f"strong retrieval: {crag_action([0.81, 0.15])}")
14print(f"weak retrieval: {crag_action([0.10, 0.17])}")
15print(f"uncertain retrieval: {crag_action([0.45, 0.09])}")1strong retrieval: correct
2weak retrieval: incorrect
3uncertain retrieval: ambiguous
Even relevant documents contain noise. CRAG includes a decompose-then-recompose step:
Here's a simplified CRAG sketch in Python. The actual paper scores retrieved documents individually and then applies thresholds. For readability, the code below collapses that logic into a single classify helper. The refine_knowledge method then decomposes documents into smaller strips, scores those strips, and recomposes only the useful evidence before generation.
If you instantiate this class with real components and run crag.run("Customs duties for textile orders"), the evaluator might return "incorrect" because internal docs lack coverage. The pipeline would then call web_search.search(...) and pass the web results through refine_knowledge before generating the answer.
1from typing import Literal, Protocol
2
3Decision = Literal["correct", "incorrect", "ambiguous"]
4
5class SearchBackend(Protocol):
6 def search(self, query: str, k: int = 5) -> list[str]: ...
7
8class RetrievalEvaluator(Protocol):
9 def classify(self, query: str, docs: list[str]) -> Decision: ...
10
11 def is_relevant_strip(self, query: str, strip: str) -> bool: ...
12
13class CorrectiveRAG:
14 def __init__(
15 self,
16 vector_db: SearchBackend,
17 evaluator_model: RetrievalEvaluator,
18 web_search_tool: SearchBackend,
19 ) -> None:
20 self.vector_db = vector_db
21 self.evaluator = evaluator_model
22 self.web_search = web_search_tool
23
24 def run(self, query: str) -> str:
25 # Initial retrieval can be wrong because the private corpus is incomplete.
26 retrieved_docs = self.vector_db.search(query, k=5)
27
28 # The evaluator decides whether internal evidence is usable.
29 decision = self.evaluator.classify(query, retrieved_docs)
30
31 if decision == "correct":
32 final_context = self.refine_knowledge(query, retrieved_docs)
33
34 elif decision == "incorrect":
35 web_results = self.web_search.search(query)
36 final_context = self.refine_knowledge(query, web_results)
37
38 else: # ambiguous
39 internal_context = self.refine_knowledge(query, retrieved_docs)
40 web_context = self.refine_knowledge(query, self.web_search.search(query))
41 final_context = internal_context + web_context
42
43 return self.generate(query, final_context)
44
45 def refine_knowledge(self, query: str, docs: list[str]) -> list[str]:
46 refined_strips = []
47 for doc in docs:
48 strips = self.chunk_into_strips(doc)
49 for strip in strips:
50 if self.evaluator.is_relevant_strip(query, strip):
51 refined_strips.append(strip)
52 return refined_strips
53
54 def chunk_into_strips(self, doc: str) -> list[str]:
55 # Teaching version: sentence segmentation by period.
56 return [segment.strip() for segment in doc.split('.') if segment.strip()]
57
58 def generate(self, query: str, context: list[str]) -> str:
59 if not context:
60 return "No reliable evidence found."
61 return f"Answer to '{query}' using: " + " ".join(context)
62
63class FakeVectorDB:
64 def search(self, query: str, k: int = 5) -> list[str]:
65 return [
66 "Warehouse picking guide. Use label printers in aisle 4.",
67 "Domestic returns policy. Return labels expire after 30 days.",
68 ][:k]
69
70class FakeWebSearch:
71 def search(self, query: str, k: int = 5) -> list[str]:
72 return [
73 "Official customs duty schedule for textile orders.",
74 "Textile shipments require HS code review before import.",
75 ][:k]
76
77class FakeEvaluator:
78 def classify(self, query: str, docs: list[str]) -> Decision:
79 joined_docs = " ".join(docs).lower()
80 if "customs" in joined_docs or "textile" in joined_docs:
81 return "correct"
82 return "incorrect"
83
84 def is_relevant_strip(self, query: str, strip: str) -> bool:
85 keywords = {"customs", "duty", "duties", "textile", "import"}
86 strip_words = set(strip.lower().replace(",", " ").split())
87 return bool(keywords & strip_words)
88
89pipeline = CorrectiveRAG(FakeVectorDB(), FakeEvaluator(), FakeWebSearch())
90answer = pipeline.run("international customs duties for textile orders")
91print(answer)1Answer to 'international customs duties for textile orders' using: Official customs duty schedule for textile orders Textile shipments require HS code review before importSelf-RAG and CRAG both introduce feedback, but at different boundaries: Self-RAG can make retrieval and critique decisions while generating segments, while CRAG routes after initial retrieval. An application can generalize feedback into agentic retrieval: instead of one fixed retrieve-then-generate pass, a model with tool access searches, reads results, tests whether evidence is sufficient, and either answers or searches again with a refined query.
The mechanism is older than the agent framing. Interleaving retrieval with reasoning across multiple hops is what IRCoT formalized: retrieve, reason a step, use that step to drive the next retrieval, and repeat until the chain of evidence is complete.[8] Multi-hop questions like "Which carrier handles our heaviest returns lane, and what is its peak-season SLA?" need exactly this, because the answer to the second part depends on resolving the first.
The engineering question is when to escalate up this ladder, since each rung costs latency and tokens:
Escalate only when the failure mode and evaluation demand it. On a simple FAQ lookup, an agentic loop can add latency and failure paths without improving retrieved evidence.
Choose these techniques by failure mode, not by novelty. HyDE targets semantic mismatch. Self-RAG changes retrieval timing and branch scoring inside a trained generator. CRAG adds a separate correction gate after retrieval.
| Feature | Naive RAG | HyDE | Self-RAG | CRAG |
|---|---|---|---|---|
| Retrieval Trigger | Always | Always | Dynamic ([Retrieve]) | Always |
| Query Representation | Raw Query | Hypothetical Doc | Raw Query + Partial Generation | Raw Query |
| Retrieval Quality Check | None | None; changes query representation | Predicted relevance, support, and utility tokens | Retrieval evaluator confidence |
| External Search | No | No | No | Yes (on ambiguous/incorrect retrieval) |
| Primary Use Case | Simple Q&A | Abstract or vocabulary-mismatched queries | High-factuality generation with a specialized model | When internal retrieval quality is inconsistent |
| Latency | Low | Medium | High | Medium-High |
Advanced RAG helps only when it matches the failure. Watch for these symptoms before adding another model call.
Cause: The retriever sees only the latest turn, so it searches for "the other one" instead of the delayed replacement order or pending refund. Fix: Rewrite the latest turn into a standalone query using the last few conversation turns. Keep the raw user message for the generator, but search with the rewritten query.
Cause: The hypothetical document can invent precise identifiers, dates, or prices. That invented detail may pull retrieval toward the wrong neighborhood. Fix: Route exact-match queries to normal hybrid retrieval. Use HyDE for conceptual or vocabulary-mismatched questions where a document-style proxy helps more than it hurts.
Cause: True Self-RAG trains a generator to emit retrieval and critique tokens during generation. Separate prompted grading calls can mimic the control loop, but they don't create the same reflection-token model. Fix: Name the system honestly. Call it a prompted critique loop or CRAG-style evaluator unless you're hosting a model trained with Self-RAG reflection tokens.
Cause: Rewriting, HyDE generation, dense retrieval, sparse retrieval, reranking, grading, and final synthesis can become a long sequential path. Fix: Parallelize independent retrievals, cap reranker candidates, cache repeated rewrites or hypothetical documents, and measure retrieval and generation latency separately.
Implementing all these techniques at once is overkill. A progressive approach is best:
Start with Query Rewriting and Hybrid Search (Dense + Keyword).
Add a Cross-Encoder Reranker (e.g., Cohere Rerank or BGE-Reranker) after retrieval.
Use a Router to classify queries.
If accuracy is still insufficient, add a prompted critique loop or a CRAG-style evaluator before investing in a true Self-RAG model.
Latency is the production constraint in advanced RAG. A pipeline with rewriting, HyDE, retrieval, reranking, grading, and generation turns one answer into several sequential model and retrieval steps. Stream the final generation, fetch dense and sparse results in parallel, and cache reusable artifacts when traffic is repetitive.
Don't release a more elaborate route because it improves a few anecdotes. Compare supported-answer quality and latency on a labeled set, then release only paths that meet both requirements.
1evaluations = [
2 {"route": "rewrite+hybrid", "supported_accuracy": 0.91, "p95_ms": 180},
3 {"route": "hyde+rerank", "supported_accuracy": 0.94, "p95_ms": 260},
4 {"route": "agentic-loop", "supported_accuracy": 0.95, "p95_ms": 710},
5]
6minimum_supported_accuracy = 0.93
7maximum_p95_ms = 350
8
9eligible = [
10 row for row in evaluations
11 if row["supported_accuracy"] >= minimum_supported_accuracy
12 and row["p95_ms"] <= maximum_p95_ms
13]
14released = max(eligible, key=lambda row: row["supported_accuracy"])
15print(f"released route: {released['route']}")
16print(f"supported_accuracy={released['supported_accuracy']:.2f} p95_ms={released['p95_ms']}")1released route: hyde+rerank
2supported_accuracy=0.94 p95_ms=260The best way to internalize these techniques is to apply them to a small, concrete dataset. Here's a focused exercise you can complete in under an hour.
Setup: Collect five real customer messages from a returns chatbot (or write realistic ones). Include at least one ambiguous message that needs conversation history, one complex comparison, and one vague keyword.
Step 1, Rewrite: Write a Python function that takes a customer message plus the last two turns of chat history and outputs a standalone query. Run it on your five messages and inspect the results. Does the rewritten query contain the full intent?
Step 2, Measure: For each original message, manually decide which of your internal policy documents should be retrieved. Then run the rewritten query through a simple dense-retrieval setup (even a small embedding model like all-MiniLM-L6-v2 against a dozen policy chunks). Count how many of the top-3 results match your manual gold set. The rewrite should improve hit rate for the ambiguous and vague cases.
Step 3, Diagnose: Pick one message where retrieval still fails. Is the problem vocabulary mismatch (try multi-query expansion), semantic asymmetry (try HyDE), or weak evidence (try a CRAG-style evaluator)? Implement the fix and measure again.
Expected outcome: Record which interventions improve top-3 evidence hits or catch weak retrieval on your examples, and which add latency without a gain. A small exercise may not reproduce paper results; its value is exposing the measurement loop.
Once you understand the mechanics, the important skill is knowing when each technique pays off and when it adds unnecessary complexity. Here are three critical trade-offs engineers face when moving from prototype to production.
HyDE adds at least one extra generation call before retrieval. That cost is only worth it when measured retrieval quality improves enough to justify it. A smaller instruction model may be adequate for the hypothetical document, but evaluate it rather than assuming equivalence. Cache repeated proxy documents only when generation configuration and source policy make reuse valid. For exact-match lookups like tracking numbers, order IDs, or dates, skip HyDE.
Self-RAG degrades when retrieval itself is weak or when the model's learned critique tokens stop correlating with real answer quality. It also raises inference cost because the model may retrieve on demand, branch over multiple passages, and spend extra decoding steps on critique tokens before choosing a continuation. For a simple FAQ bot, first measure a cheaper baseline such as rewrite plus reranking or a correction gate before committing to a specialized Self-RAG serving path.
As a system design, you can use HyDE to propose initial candidate passages, then let a true Self-RAG model score branches with IsRel, IsSup, and IsUse. That composition needs its own evaluation; neither mechanism guarantees the other improves it. If you don't have a reflection-token model, describe the composition as HyDE plus a CRAG-style evaluator or prompted support checks rather than Self-RAG.
Query Rewriting for Retrieval-Augmented Large Language Models.
Ma, X., et al. · 2023 · arXiv preprint
RAG-Fusion: a New Take on Retrieval-Augmented Generation
Rackauckas, Z. · 2024
Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods.
Cormack, G. V., Clarke, C. L. A., & Buettcher, S. · 2009 · SIGIR '09
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
Zhou, D., et al. · 2022 · ICLR 2023
Precise Zero-Shot Dense Retrieval without Relevance Labels.
Gao, L., Ma, X., Lin, J., & Callan, J. · 2022 · arXiv preprint
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection.
Asai, A., et al. · 2023 · arXiv preprint
Corrective Retrieval Augmented Generation.
Yan, S.-Q., et al. · 2024 · arXiv preprint
Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions
Trivedi, H., Balasubramanian, N., Khot, T., & Sabharwal, A. · 2022