LearnAdvanced Agents & RetrievalAdvanced RAG: HyDE & Self-RAG

🔍HardRAG & Retrieval

Advanced RAG: HyDE & Self-RAG

Learn how query rewriting, HyDE, Self-RAG, and Corrective RAG change retrieval control, and how to evaluate their cost and evidence quality.

33 min read

Learning path

Step 112 of 158 in the full curriculum

Vector DB Internals: HNSW & IVF GraphRAG & Knowledge Graphs

A developer opens an internal AI docs assistant and types, "What about the other one?" They could mean the backup API key discussed earlier, the second failed batch job, or the alternative embedding endpoint. An on-call engineer can hit a similar problem by searching for "quota" and getting billing limits instead of the rate-limit policy. Search fails when the user's words don't line up with the evidence they need.

Vector database internals showed how an index finds nearby chunks quickly. Now move one layer above the index: use retrieval-augmented generation (RAG) controls to rewrite messy requests, search with HyDE (Hypothetical Document Embeddings), critique generation with Self-RAG (Self-reflective RAG), or correct weak retrieval with Corrective Retrieval-Augmented Generation (CRAG). Add the smallest intervention that repairs a measured failure.

The problem with naive RAG

A standard RAG implementation (often called "Naive RAG") typically embeds the user's raw query and fetches the top-k nearest neighbors from a vector database. This simple "retrieve-then-generate" pipeline suffers from systematic failure modes:

The query is ambiguous: "What about the other one?" relies entirely on conversation history about a failed batch job, which a stateless retriever lacks.
The query and document are semantically misaligned: A user asks "How do I rotate a key?" but the documentation is titled "Credential Rotation Workflow". The embedding of the conversational question may not match the embedding of the formal security document.
The retriever returns irrelevant documents: Embedding geometry, query wording, approximate nearest-neighbor (ANN) settings, or corpus quality can all put SDK install notes above the rate-limit policy. If irrelevant chunks enter context, the generator may produce an unsupported answer.
The model doesn't know when to retrieve: Naive RAG retrieves for every query, even "Hi" or "What is 2+2?", wasting tokens and latency.

These failure modes are common in real applications. A useful production pipeline measures them separately, then adds the smallest retrieval intervention that repairs the observed failure rather than making every question pay for an elaborate loop.

Query rewriting and decomposition

Query transformation diagram where a vague follow-up splits into rewrite, expansion, or decomposition before retrieval. — Three search failures need three different transforms. Rewrite fixes ambiguity, expansion broadens recall, and decomposition turns one hard search into grounded hops.

User queries are rarely optimal for vector search. They're often short, lack context, or contain multiple distinct questions about endpoints, limits, errors, and credentials all at once.

1. Rewrite for disambiguation

An on-call lead often turns a vague Slack thread into a precise search request before querying runbooks and product docs. Query rewriting gives the model that same job. Conversational queries often rely on implicit context. Ma et al.'s Rewrite-Retrieve-Read pipeline adds a dedicated rewrite stage before retrieval instead of treating the user's wording as sacred.^{[1]Reference 1Query Rewriting for Retrieval-Augmented Large Language Models.https://arxiv.org/abs/2305.14283} In a chat product, a practical variant is to rewrite the latest turn into a standalone, search-oriented question.

Here's a concrete example. A developer sends two messages:

Developer: "My API key leaked in a public issue." Developer: "What do I do now?"

A stateless retriever searching for "What do I do now?" would pull generic onboarding articles. A rewrite step instead produces:

Standalone query: "How do I revoke and rotate a leaked API key and audit recent usage?"

In production, the rewrite model can be any instruction-tuned LLM behind a small interface. Start with a copy-runnable sketch that uses a deterministic fake model so the parsing contract can be tested locally without API keys.

1-rewrite-for-disambiguation.py

from typing import Protocol, TypedDict

class ChatMessage(TypedDict):
    role: str
    content: str

class RewriteModel(Protocol):
    def rewrite(self, latest_query: str, history_text: str) -> str: ...

class FakeRewriteModel:
    def rewrite(self, latest_query: str, history_text: str) -> str:
        if "api key leaked" in history_text.lower() and "what do i do" in latest_query.lower():
            return "How do I revoke and rotate a leaked API key and audit recent usage?"
        return latest_query

def rewrite_query_with_history(
    query: str, chat_history: list[ChatMessage], model: RewriteModel
) -> str:
    history_text = "\n".join(
        f"{message['role']}: {message['content']}" for message in chat_history
    )
    return model.rewrite(query, history_text).strip()

history = [
    {"role": "developer", "content": "My API key leaked in a public issue."},
]

rewritten = rewrite_query_with_history("What do I do now?", history, FakeRewriteModel())
print(rewritten)

Output

How do I revoke and rotate a leaked API key and audit recent usage?

2. Multi-query expansion

A single query might miss relevant documents due to vocabulary mismatch. Multi-Query Expansion generates synonymous queries to improve recall.

For example, if a user asks "How do you handle peak traffic throttling?", relevant internal documents might use terms like "rate-limit burst policy", "queue backpressure", or "autoscaling capacity". To improve recall across different vocabularies, the retriever can generate query variations automatically.

Once you fan out into several queries, you have several ranked result lists to merge. One documented pattern is RAG-Fusion, which pairs generated query variants with Reciprocal Rank Fusion (RRF) to combine the lists into one ranking.^{[2]Reference 2RAG-Fusion: a New Take on Retrieval-Augmented Generationhttps://arxiv.org/abs/2402.03367} RRF ignores raw similarity scores and sums reciprocal ranks instead, so a document that lands near the top of several lists rises even if no single list ranked it first.^{[3]Reference 3Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods.https://dl.acm.org/doi/10.1145/1571941.1572114}

For a document $d$ , the score is $\sum_{r \in R} 1 / (k + r(d))$ . Here, $R$ is the set of rankings, $r(d)$ is the document's one-based position in one ranking, and $k$ dampens the effect of very high ranks. Cormack et al. fixed $k = 60$ after a pilot investigation.^{[3]Reference 3Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods.https://dl.acm.org/doi/10.1145/1571941.1572114} A document missing from a returned list contributes zero from that list. RRF is the same rank-merge idea used to fuse dense and sparse results in hybrid search, reused here for query variants.

Implement multi-query expansion behind a testable boundary. The model returns one query per line, and the parser removes bullets or numbering so all variants can be searched in parallel.

2-multi-query-expansion.py

from typing import Protocol

class QueryExpansionModel(Protocol):
    def expand(self, query: str, n: int) -> str: ...

class FakeExpansionModel:
    def expand(self, query: str, n: int) -> str:
        return "\n".join(
            [
                "Rate-limit burst policy during launch traffic",
                "Queue backpressure controls for high-volume periods",
                "Autoscaling capacity for traffic spikes",
            ][:n]
        )

def clean_query_line(line: str) -> str:
    return line.strip().lstrip("-*0123456789. ").strip()

def generate_multi_queries(
    query: str, model: QueryExpansionModel, n: int = 3
) -> list[str]:
    content = model.expand(query, n)
    queries = [clean_query_line(line) for line in content.splitlines()]
    return [query for query in queries if query]

queries = generate_multi_queries(
    "How do you handle peak traffic throttling?", FakeExpansionModel(), n=3
)

for query in queries:
    print(f"- {query}")

Output

- Rate-limit burst policy during launch traffic
- Queue backpressure controls for high-volume periods
- Autoscaling capacity for traffic spikes

Generating variants isn't enough; the system must merge their result lists without assuming similarity scores from separate searches are directly comparable. RRF provides a deterministic rank-based merge:

fuse-query-variants-with-rrf.py

def reciprocal_rank_fusion(rankings: list[list[str]], rank_constant: int = 60) -> list[tuple[str, float]]:
    scores: dict[str, float] = {}
    for ranking in rankings:
        for rank, document_id in enumerate(ranking, start=1):
            scores[document_id] = scores.get(document_id, 0.0) + 1 / (rank_constant + rank)
    return sorted(scores.items(), key=lambda pair: (-pair[1], pair[0]))

ranked_lists = [
    ["rate-limit-policy", "quota-increase", "autoscaling"],
    ["autoscaling", "rate-limit-policy", "backpressure"],
    ["rate-limit-policy", "batch-api", "autoscaling"],
]

for document_id, score in reciprocal_rank_fusion(ranked_lists)[:3]:
    print(f"{document_id}: {score:.4f}")

Output

rate-limit-policy: 0.0489
autoscaling: 0.0481
batch-api: 0.0161

3. Query decomposition (least-to-most)

Complex questions often require multiple retrieval steps, as a single search may fail to gather all the necessary facts. Decomposition breaks a complex query into a series of simpler sub-queries that can be executed sequentially or in parallel.

For instance, consider the following analytical query:

"Compare synchronous and batch embedding API latency, limits, and retry behavior."

A standard retriever might struggle to find a single document containing this exact comparison. Instead, we decompose it into sub-questions:

"What latency does the synchronous embedding endpoint target?"
"What throughput and completion limits does the batch embedding endpoint have?"
"What retry behavior applies to each endpoint?"
"Compare synchronous and batch embeddings given the latency, limit, and retry facts."

This decomposition pattern is closely related to least-to-most prompting, which decomposes hard problems into simpler steps.^{[4]Reference 4Least-to-Most Prompting Enables Complex Reasoning in Large Language Modelshttps://arxiv.org/abs/2205.10625} In RAG, teams reuse that idea for retrieval coverage rather than chain-of-thought supervision. Answering simpler questions first can gather explicit evidence for each sub-fact before synthesis, but it also adds queries and possible error propagation. Measure supported-answer accuracy and latency before releasing it.

Here's a simple implementation. A real model would produce the sub-questions, but the rest of the system should only depend on the line-oriented contract.

3-query-decomposition-least-to-most.py

from typing import Protocol

class DecompositionModel(Protocol):
    def decompose(self, query: str) -> str: ...

class FakeDecompositionModel:
    def decompose(self, query: str) -> str:
        return "\n".join(
            [
                "What latency does the synchronous embedding endpoint target?",
                "What throughput and completion limits does the batch embedding endpoint have?",
                "What retry behavior applies to each endpoint?",
                "How do synchronous and batch embeddings compare given those facts?",
            ]
        )

def decompose_query(query: str, model: DecompositionModel) -> list[str]:
    lines = [
        line.strip().lstrip("-*0123456789. ").strip()
        for line in model.decompose(query).splitlines()
    ]
    return [line for line in lines if line]

sub_questions = decompose_query(
    "Compare synchronous and batch embedding API latency, limits, and retry behavior.",
    FakeDecompositionModel(),
)

for index, question in enumerate(sub_questions, start=1):
    print(f"{index}. {question}")

Output

What latency does the synchronous embedding endpoint target?
What throughput and completion limits does the batch embedding endpoint have?
What retry behavior applies to each endpoint?
How do synchronous and batch embeddings compare given those facts?

HyDE (Hypothetical Document Embeddings)

Standard dense retrieval matches a query embedding to document embeddings. Queries are often short and interrogative, while indexed passages are longer and declarative. HyDE changes how the query vector is built when that mismatch hurts retrieval.

HyDE (Hypothetical Document Embeddings) bridges this gap by generating one or more hypothetical documents, embedding those document-style proxies, and retrieving real corpus passages near the resulting vector.^{[5]Reference 5Precise Zero-Shot Dense Retrieval without Relevance Labels.https://arxiv.org/abs/2212.10496} In Gao et al., the model is prompted to "write a document that answers the question." Suppose you need a specific throughput-limit clause but can't remember the exact wording. Instead of asking the knowledge base "Can we handle more traffic?", you write a one-paragraph summary of what you expect the policy to say and ask, "Where are documents that look like this paragraph?"

Here's a concrete API-doc example. A developer asks:

Query: "Can we increase embedding throughput during a launch week?"

A standard dense retriever might embed the short question and pull generic "embedding API overview" articles that don't mention launch traffic. HyDE instead prompts the model to write a hypothetical policy paragraph:

Hypothetical document: "Embedding throughput increases: teams can request a temporary tokens-per-minute quota increase, use the batch embedding endpoint for offline jobs, shard requests across approved projects, and apply exponential backoff when rate limits are returned..."

That generated paragraph is longer, declarative, and uses vocabulary like "tokens-per-minute quota", "batch embedding endpoint", and "exponential backoff". It's an illustrative search proxy, not an answer: the generated limit may be wrong, and only retrieved source text may support the final response.

How HyDE works

The original HyDE pipeline has three phases:

Generate: Prompt an instruction-tuned language model to sample one or more hypothetical passages for the query. The passages may contain fabricated details, but they can still capture relevance patterns that look like real documents.^{[5]Reference 5Precise Zero-Shot Dense Retrieval without Relevance Labels.https://arxiv.org/abs/2212.10496}
Embed: Encode each sampled passage with a document encoder such as Contriever, then average the vectors to estimate the expected hypothetical-document embedding.^{[5]Reference 5Precise Zero-Shot Dense Retrieval without Relevance Labels.https://arxiv.org/abs/2212.10496}
Retrieve: Search the corpus with that averaged vector. The paper's key intuition is that the encoder's dense bottleneck filters much of the fabricated detail while preserving the semantic neighborhood of relevant documents.^{[5]Reference 5Precise Zero-Shot Dense Retrieval without Relevance Labels.https://arxiv.org/abs/2212.10496}

The production version uses a real generator and dense encoder. This small runnable version deliberately generates one proxy and uses keyword sets so the retrieval contract is visible: the query first becomes a document-like paragraph, then the retriever searches with that paragraph rather than the original question.

how-hyde-works.py

from dataclasses import dataclass
from typing import Protocol

class HypotheticalDocGenerator(Protocol):
    def generate(self, query: str) -> str: ...

@dataclass(frozen=True)
class Chunk:
    id: str
    text: str

class FakeHyDEGenerator:
    def generate(self, query: str) -> str:
        return (
            "Embedding throughput increases require a temporary tokens-per-minute "
            "quota request, batch embedding jobs, and exponential backoff for rate limits."
        )

class KeywordRetriever:
    def __init__(self, chunks: list[Chunk]) -> None:
        self.chunks = chunks

    def search(self, search_text: str, k: int = 2) -> list[Chunk]:
        query_terms = set(
            search_text.lower().replace(",", " ").replace(".", " ").split()
        )

        def score(chunk: Chunk) -> int:
            chunk_terms = set(
                chunk.text.lower().replace(",", " ").replace(".", " ").split()
            )
            return len(query_terms & chunk_terms)

        return sorted(self.chunks, key=score, reverse=True)[:k]

def hyde_retrieve(
    query: str, generator: HypotheticalDocGenerator, retriever: KeywordRetriever
) -> list[Chunk]:
    hypothetical_doc = generator.generate(query)
    return retriever.search(hypothetical_doc, k=2)

chunks = [
    Chunk("generic-embeddings", "The embedding API converts text into vectors for search."),
    Chunk(
        "throughput-quota",
        "Launch traffic needs a temporary tokens-per-minute quota request and approval.",
    ),
    Chunk("batch-embeddings", "Batch embedding jobs support offline workloads with retry backoff."),
]

matches = hyde_retrieve(
    "Can we increase embedding throughput during a launch week?",
    FakeHyDEGenerator(),
    KeywordRetriever(chunks),
)

match_ids = [chunk.id for chunk in matches]
print(f"retrieved: {match_ids}")

Output

retrieved: ['throughput-quota', 'batch-embeddings']

HyDE flow is simple but important: change the search object first, then retrieve.

HyDE retrieval flow where a short question becomes a hypothetical document proxy, which retrieves real evidence before answer synthesis. — HyDE changes search object, not grounding rule. System embeds a generated proxy, retrieves real chunks near it, and answers only from retrieved evidence.

Across the paper's zero-shot experiments, HyDE improved retrieval over the underlying Contriever or mContriever baseline.^{[5]Reference 5Precise Zero-Shot Dense Retrieval without Relevance Labels.https://arxiv.org/abs/2212.10496} The mechanism can still fail on a new corpus, especially when a proxy invents a high-impact identifier or policy detail.

When to use HyDE

HyDE was designed for zero-shot retrieval without relevance labels. Gao et al. evaluate it on web search, BEIR, and multilingual Mr. TyDi tasks.^{[5]Reference 5Precise Zero-Shot Dense Retrieval without Relevance Labels.https://arxiv.org/abs/2212.10496} Treat transfer to your corpus as a hypothesis to test, especially when users phrase questions differently from stored documents.

In production, gate HyDE away from exact-match lookups such as IDs, dates, prices, or error codes. That's an engineering inference from the mechanism, not a claim from the paper: if a proxy invents a precise fact, retrieval can drift toward text that echoes the invention instead of the source chunk.

route-exact-lookups-around-hyde.py

import re

EXACT_LOOKUP = re.compile(r"\b(?:incident\s+[A-Z]+-\d+|error\s+[A-Z0-9-]{6,})\b", re.IGNORECASE)

def retrieval_route(query: str) -> str:
    if EXACT_LOOKUP.search(query):
        return "hybrid_exact_preserving"
    return "hyde_candidate"

queries = [
    "What happened in incident INC-48291?",
    "How should we plan embedding throughput for launch traffic?",
]

for query in queries:
    print(f"{retrieval_route(query)}: {query}")

Output

hybrid_exact_preserving: What happened in incident INC-48291?
hyde_candidate: How should we plan embedding throughput for launch traffic?

Self-RAG (Self-reflective RAG)

Many retrieve-then-generate pipelines fetch context once without a model-generated critique step. Self-RAG instead fine-tunes a generator to emit special reflection tokens that control retrieval and score candidate generation segments.^{[6]Reference 6Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection.https://arxiv.org/abs/2310.11511}

Reflection tokens

Self-RAG uses one retrieval token family and three critique token families:^{[6]Reference 6Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection.https://arxiv.org/abs/2310.11511}

Retrieve with values Yes, No, or Continue. This decides whether the model should fetch evidence before generating the next segment.
ISREL with labels such as Relevant or Irrelevant. This scores whether a retrieved passage is helpful for the current query or segment.
ISSUP with labels Fully supported, Partially supported, and No support. This checks whether the generated claim is grounded in retrieved evidence.
ISUSE with utility scores from 1 to 5. This measures how useful the final response is for the user.

Paper examples serialize these as inline control tags such as [Retrieve=Yes], [ISREL=Relevant], [ISSUP=Fully Supported], and [ISUSE=4]. The exact bracket syntax is less important than the four decision families, but each critique tag keeps its family name visible.

To see these tokens in action, imagine the same launch-throughput query. A Self-RAG model might generate the following token stream:

[Retrieve=Yes], the model decides it needs evidence before answering.
It retrieves a passage: "Temporary tokens-per-minute increases require approval and backoff-aware clients..."
[ISREL=Relevant], the passage is useful for the current segment.
The model generates: "For launch traffic, request a temporary TPM increase and keep exponential backoff enabled."
[ISSUP=Fully Supported], the claim is grounded in the retrieved text.
[ISUSE=4], the response is helpful but could be more detailed.

Without reflection tokens, a standard RAG pipeline might have retrieved the same passage without explicitly scoring passage relevance or claim support. Reflection tokens expose the model's predicted judgments for scoring and control; they aren't proof that a claim is true.

Architecture

Self-RAG is more than "retrieve once, then critique at the end." At inference time it can emit a retrieval decision, retrieve top-k passages on demand, generate candidate segments conditioned on different passages in parallel, and score those branches with reflection-token probabilities.^{[6]Reference 6Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection.https://arxiv.org/abs/2310.11511} That segment-level beam search is what makes Self-RAG distinct from a simple prompted guardrail loop.

Self-RAG generation path where a segment decides to retrieve passages, scores competing branches with critique tokens, and continues with the best-supported branch. — Self-RAG folds retrieval into decoding. Each segment can fetch passages, score competing continuations, and keep only strongest supported branch.

Implementation note

Deploying a true Self-RAG system requires a generator specifically fine-tuned to emit these reflection tokens during generation. The paper releases 7B and 13B checkpoints trained that way.^{[6]Reference 6Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection.https://arxiv.org/abs/2310.11511} Before generator training, the authors use a separate critic model to insert reflection tokens into supervised examples offline, then train the final generator to emit those tokens itself at inference time.^{[6]Reference 6Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection.https://arxiv.org/abs/2310.11511}

A prompted frontier model can imitate parts of this control loop, but that isn't the same system. Without reflection-token fine-tuning, you're building a Self-RAG-inspired agentic pipeline: separate routing, retrieval grading, and answer validation calls stitched together in application code.

Self-RAG cost comes from on-demand retrieval plus branching over multiple passages and scoring those branches, not from a few extra control tokens. Evaluate it where support-aware generation is worth that additional serving path.

This scoring sketch doesn't implement Self-RAG training. It shows how an inference service can rank branches once a trained model has supplied relevance, support, and utility probabilities; the weights are an explicit product policy.

score-self-rag-branches.py

from dataclasses import dataclass

@dataclass(frozen=True)
class CandidateSegment:
    text: str
    relevance_probability: float
    support_probability: float
    utility_probability: float

def branch_score(candidate: CandidateSegment) -> float:
    return (
        0.2 * candidate.relevance_probability
        + 0.6 * candidate.support_probability
        + 0.2 * candidate.utility_probability
    )

candidates = [
    CandidateSegment("temporary TPM increase requires approval", 0.92, 0.96, 0.81),
    CandidateSegment("all launch traffic is unlimited", 0.95, 0.28, 0.88),
]
winner = max(candidates, key=branch_score)

print(f"chosen segment: {winner.text}")
print(f"score: {branch_score(winner):.3f}")

Output

chosen segment: temporary TPM increase requires approval
score: 0.922

CRAG (Corrective RAG)

Corrective Retrieval-Augmented Generation (CRAG) focuses on correction after imperfect retrieval. It adds a lightweight Retrieval Evaluator that scores retrieved question-document pairs and routes the request before generation.^{[7]Reference 7Corrective Retrieval Augmented Generation.https://arxiv.org/abs/2401.15884}

The evaluator

In the paper, the evaluator is a lightweight T5-large model fine-tuned to score each retrieved question-document pair, then threshold those scores into one of three actions:^{[7]Reference 7Corrective Retrieval Augmented Generation.https://arxiv.org/abs/2401.15884}

A developer asks about "current embedding TPM increase process." The internal knowledge base has weak coverage. The evaluator might score the retrieved internal documents as Incorrect, triggering a web-search fallback. A production system would still need source allowlists and citation checks before trusting those web results. If internal documents are somewhat relevant but incomplete, the evaluator returns Ambiguous, and CRAG combines refined internal strips with web results.

Correct: At least one retrieved document clears the upper threshold. Action: refine internal knowledge and answer from it.
Incorrect: All retrieved documents fall below the lower threshold. Action: discard them and fall back to web search.
Ambiguous: The scores land between those two cases. Action: combine refined internal evidence with web results.

The distinction from Self-RAG is where correction happens. CRAG doesn't train the generator to emit reflection tokens. Instead, it inserts a separate evaluator between retrieval and generation and uses that evaluator to trigger correction paths.

route-crag-evaluator-scores.py

from typing import Literal

Decision = Literal["correct", "incorrect", "ambiguous"]

def crag_action(scores: list[float], lower: float = 0.2, upper: float = 0.7) -> Decision:
    best_score = max(scores)
    if best_score > upper:
        return "correct"
    if best_score < lower:
        return "incorrect"
    return "ambiguous"

print(f"strong retrieval: {crag_action([0.81, 0.15])}")
print(f"weak retrieval: {crag_action([0.10, 0.17])}")
print(f"uncertain retrieval: {crag_action([0.45, 0.09])}")

Output

strong retrieval: correct
weak retrieval: incorrect
uncertain retrieval: ambiguous

Corrective RAG flow where an evaluator after retrieval routes evidence into internal answer, mixed repair path, or web fallback. — CRAG adds routing after retrieval. Evaluator chooses internal answer, mixed repair path, or web fallback before generation.

Knowledge refinement

Even relevant documents contain noise. CRAG includes a decompose-then-recompose step:

Break the document into fine-grained strips (sentences or small chunks).
Score each strip for relevance.
Concatenate only the relevant strips.
Pass this "refined knowledge" to the generator.

Here's a simplified CRAG sketch in Python. The actual paper scores retrieved documents individually and then applies thresholds. For readability, this sketch collapses that logic into a single classify helper. The refine_knowledge method then decomposes documents into smaller strips, scores those strips, and recomposes only the useful evidence before generation.

If you instantiate this class with real components and run crag.run("current embedding TPM increase process"), the evaluator might return "incorrect" because internal docs lack coverage. The pipeline would then call web_search.search(...) and pass the web results through refine_knowledge before generating the answer.

knowledge-refinement.py

from typing import Literal, Protocol

Decision = Literal["correct", "incorrect", "ambiguous"]

class SearchBackend(Protocol):
    def search(self, query: str, k: int = 5) -> list[str]: ...

class RetrievalEvaluator(Protocol):
    def classify(self, query: str, docs: list[str]) -> Decision: ...

    def is_relevant_strip(self, query: str, strip: str) -> bool: ...

class CorrectiveRAG:
    def __init__(
        self,
        vector_db: SearchBackend,
        evaluator_model: RetrievalEvaluator,
        web_search_tool: SearchBackend,
    ) -> None:
        self.vector_db = vector_db
        self.evaluator = evaluator_model
        self.web_search = web_search_tool

    def run(self, query: str) -> str:
        # Initial retrieval can be wrong because the private corpus is incomplete.
        retrieved_docs = self.vector_db.search(query, k=5)

        # The evaluator decides whether internal evidence is usable.
        decision = self.evaluator.classify(query, retrieved_docs)

        if decision == "correct":
            final_context = self.refine_knowledge(query, retrieved_docs)

        elif decision == "incorrect":
            web_results = self.web_search.search(query)
            final_context = self.refine_knowledge(query, web_results)

        else:  # ambiguous
            internal_context = self.refine_knowledge(query, retrieved_docs)
            web_context = self.refine_knowledge(query, self.web_search.search(query))
            final_context = internal_context + web_context

        return self.generate(query, final_context)

    def refine_knowledge(self, query: str, docs: list[str]) -> list[str]:
        refined_strips = []
        for doc in docs:
            strips = self.chunk_into_strips(doc)
            for strip in strips:
                if self.evaluator.is_relevant_strip(query, strip):
                    refined_strips.append(strip)
        return refined_strips

    def chunk_into_strips(self, doc: str) -> list[str]:
        # Teaching version: sentence segmentation by period.
        return [segment.strip() for segment in doc.split('.') if segment.strip()]

    def generate(self, query: str, context: list[str]) -> str:
        if not context:
            return "No reliable evidence found."
        return f"Answer to '{query}' using: " + " ".join(context)

class FakeVectorDB:
    def search(self, query: str, k: int = 5) -> list[str]:
        return [
            "Old SDK install guide. Pin client version 0.8 for legacy projects.",
            "Deprecated quota note. Manual review was required for all increases.",
        ][:k]

class FakeWebSearch:
    def search(self, query: str, k: int = 5) -> list[str]:
        return [
            "Official API limit guide. Launch-week TPM increases require approval.",
            "Embedding clients should use exponential backoff after rate-limit errors.",
        ][:k]

class FakeEvaluator:
    def classify(self, query: str, docs: list[str]) -> Decision:
        joined_docs = " ".join(docs).lower()
        if "tpm" in joined_docs or "rate-limit" in joined_docs:
            return "correct"
        return "incorrect"

    def is_relevant_strip(self, query: str, strip: str) -> bool:
        keywords = {"tpm", "quota", "rate-limit", "embedding", "backoff"}
        strip_words = set(strip.lower().replace(",", " ").split())
        return bool(keywords & strip_words)

pipeline = CorrectiveRAG(FakeVectorDB(), FakeEvaluator(), FakeWebSearch())
answer = pipeline.run("current embedding TPM increase process")
print(answer)

Output

Answer to 'current embedding TPM increase process' using: Launch-week TPM increases require approval Embedding clients should use exponential backoff after rate-limit errors

Agentic and iterative retrieval

Self-RAG and CRAG both introduce feedback, but at different boundaries: Self-RAG can make retrieval and critique decisions while generating segments, while CRAG routes after initial retrieval. An application can generalize feedback into agentic retrieval: instead of one fixed retrieve-then-generate pass, a model with tool access searches, reads results, tests whether evidence is sufficient, and either answers or searches again with a refined query.

The mechanism is older than the agent framing. IRCoT studies one version of this pattern: retrieve, reason a step, use that step to drive the next retrieval, and repeat until the chain of evidence is complete.^{[8]Reference 8Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questionshttps://arxiv.org/abs/2212.10509} Multi-hop questions like "Which embedding endpoint has the highest p95 latency, and what retry budget applies to that endpoint?" need exactly this, because the answer to the second part depends on resolving the first.

The engineering question is when to escalate up this ladder, since each rung costs latency and tokens:

Naive RAG for single-fact lookups answerable by one retrieval.
Query rewriting plus hybrid search and reranking for conversational ambiguity and vocabulary mismatch. This is a practical baseline to evaluate before more expensive routes.
HyDE behind a router for abstract or vocabulary-mismatched queries where a document-style proxy helps.
A correction gate (CRAG-style evaluator or prompted critique) when retrieval quality is inconsistent and a single bad context causes user-visible errors.
Iterative or agentic retrieval for genuinely multi-hop questions, where one retrieval pass can't gather all the facts.

Escalate only when the failure mode and evaluation demand it. On a simple FAQ lookup, an agentic loop can add latency and failure paths without improving retrieved evidence.

Comparison of advanced techniques

Choose these techniques by failure mode, not by novelty. HyDE targets semantic mismatch. Self-RAG changes retrieval timing and branch scoring inside a trained generator. CRAG adds a separate correction gate after retrieval.

Advanced RAG comparison showing HyDE rewriting search object before retrieval, Self-RAG scoring branches during generation, and CRAG adding a correction gate after retrieval. — These systems intervene at different boundaries. HyDE changes what gets searched, Self-RAG scores continuations while decoding, and CRAG routes weak retrieval before answer synthesis.

Feature	Naive RAG	HyDE	Self-RAG	CRAG
Retrieval Trigger	Always	Always	Dynamic (`[Retrieve]`)	Always
Query Representation	`Raw Query`	`Hypothetical Doc`	`Raw Query + Partial Generation`	`Raw Query`
Retrieval Quality Check	None	None; changes query representation	Predicted relevance, support, and utility tokens	Retrieval evaluator confidence
External Search	No	No	No	Yes (on ambiguous/incorrect retrieval)
Primary Use Case	Simple Q&A	Abstract or vocabulary-mismatched queries	High-factuality generation with a specialized model	When internal retrieval quality is inconsistent
Latency	Low	Medium	High	Medium-High

Production failure modes

Advanced RAG helps only when it matches the failure. Watch for these symptoms before adding another model call.

Symptom: the query is vague, but the answer needs history

Cause: The retriever sees only the latest turn, so it searches for "the other one" instead of the backup API key or failed batch job.
Fix: Rewrite the latest turn into a standalone query using the last few conversation turns. Keep the raw user message for the generator, but search with the rewritten query.

Symptom: HyDE improves recall on policy questions but breaks exact lookups

Cause: The hypothetical document can invent precise identifiers, dates, or prices. That invented detail may pull retrieval toward the wrong neighborhood.
Fix: Route exact-match queries to normal hybrid retrieval. Use HyDE for conceptual or vocabulary-mismatched questions where a document-style proxy helps more than it hurts.

Symptom: Self-RAG-inspired code works in prompts but isn't true Self-RAG

Cause: True Self-RAG trains a generator to emit retrieval and critique tokens during generation. Separate prompted grading calls can mimic the control loop, but they don't create the same reflection-token model.
Fix: Name the system honestly. Call it a prompted critique loop or CRAG-style evaluator unless you're hosting a model trained with Self-RAG reflection tokens.

Symptom: accuracy rises, but users feel the product is slow

Cause: Rewriting, HyDE generation, dense retrieval, sparse retrieval, reranking, grading, and final synthesis can become a long sequential path.
Fix: Parallelize independent retrievals, cap reranker candidates, cache repeated rewrites or hypothetical documents, and measure retrieval and generation latency separately.

Practical implementation strategy

Implementing all these techniques at once is overkill. Add them in stages:

Phase 1: Stable baseline

Start with Query Rewriting and Hybrid Search (dense retrieval + sparse keyword retrieval).

Why: Fixes basic vocabulary mismatch and conversational ambiguity.
Cost: Low (1 extra LLM call for rewriting).

Phase 2: Add reranking

Add a cross-encoder reranker after retrieval.

Why: Can improve precision of the final candidate set when initial retrieval has decent recall.
Cost: Moderate. Cross-encoders are slower than initial retrieval, so keep candidate count small.

Phase 3: Specialized handling (HyDE/router)

Use a Router to classify queries.

If query is conceptual or abstract (e.g., "Explain launch-week embedding throughput planning"): Use HyDE, since vocabulary mismatch is likely.
If query is factual or precise (e.g., "What happened in incident INC-48291?"): Use standard retrieval, since HyDE risks hallucinating the exact identifier.
Why: Optimizes for different query types without applying HyDE to everything.

Phase 4: Add correction loops

If accuracy is still insufficient, add a prompted critique loop or a CRAG-style evaluator before investing in a true Self-RAG model.

Prompt the model or a lightweight evaluator to grade retrieved documents before answer generation.
If evidence is weak, trigger a rewrite, second retrieval pass, or web fallback.
Why: Captures much of the reliability gain without requiring reflection-token fine-tuning.

Latency is the production constraint in advanced RAG. A pipeline with rewriting, HyDE, retrieval, reranking, grading, and generation turns one answer into several sequential model and retrieval steps. Stream the final generation, fetch dense and sparse results in parallel, and cache reusable artifacts when traffic is repetitive.

Don't release a more elaborate route because it improves a few anecdotes. Compare supported-answer quality and latency on a labeled set, then release only paths that meet both requirements.

release-a-retrieval-route-from-evals.py

evaluations = [
    {"route": "rewrite+hybrid", "supported_accuracy": 0.91, "p95_ms": 180},
    {"route": "hyde+rerank", "supported_accuracy": 0.94, "p95_ms": 260},
    {"route": "agentic-loop", "supported_accuracy": 0.95, "p95_ms": 710},
]
minimum_supported_accuracy = 0.93
maximum_p95_ms = 350

eligible = [
    row for row in evaluations
    if row["supported_accuracy"] >= minimum_supported_accuracy
    and row["p95_ms"] <= maximum_p95_ms
]
released = max(eligible, key=lambda row: row["supported_accuracy"])
print(f"released route: {released['route']}")
print(f"supported_accuracy={released['supported_accuracy']:.2f} p95_ms={released['p95_ms']}")

Output

released route: hyde+rerank
supported_accuracy=0.94 p95_ms=260

Try it yourself

Apply these techniques to a small, concrete dataset. Here's a focused exercise you can complete in under an hour.

Exercise: Build a query rewriter for a developer-support assistant

Setup: Collect five real developer-support messages from an API docs assistant (or write realistic ones). Include at least one ambiguous message that needs conversation history, one complex comparison, and one vague keyword.

Step 1, Rewrite: Write a Python function that takes a developer message plus the last two turns of chat history and outputs a standalone query. Run it on your five messages and inspect the results. Does the rewritten query contain the full intent?

Step 2, Measure: For each original message, manually decide which of your internal policy documents should be retrieved. Then run the rewritten query through a simple dense-retrieval setup (even a small embedding model like all-MiniLM-L6-v2 against a dozen policy chunks). Count how many of the top-3 results match your manual gold set. The rewrite should improve hit rate for the ambiguous and vague cases.

Step 3, Diagnose: Pick one message where retrieval still fails. Is the problem vocabulary mismatch (try multi-query expansion), semantic asymmetry (try HyDE), or weak evidence (try a CRAG-style evaluator)? Implement the fix and measure again.

Expected outcome: Record which interventions improve top-3 evidence hits or catch weak retrieval on your examples, and which add latency without a gain. A small exercise may not reproduce paper results; its value is exposing the measurement loop.

When each technique pays off

Once you understand the mechanics, the skill is knowing when each intervention earns its latency cost and when it adds unnecessary complexity. These three trade-offs show up often when engineers move from prototype to production.

When does HyDE justify its latency cost?

HyDE adds at least one extra generation call before retrieval. That cost is only worth it when measured retrieval quality improves enough to justify it. A smaller instruction model may be adequate for the hypothetical document, but evaluate it rather than assuming equivalence. Cache repeated proxy documents only when generation configuration and source policy make reuse valid. For exact-match lookups like incident IDs, error codes, or dates, skip HyDE.

When does Self-RAG degrade instead of help?

Self-RAG degrades when retrieval itself is weak or when the model's learned critique tokens stop correlating with real answer quality. It also raises inference cost because the model may retrieve on demand, branch over multiple passages, and spend extra decoding steps on critique tokens before choosing a continuation. For a simple FAQ bot, first measure a cheaper baseline such as rewrite plus reranking or a correction gate before committing to a specialized Self-RAG serving path.

Can you combine HyDE and Self-RAG?

As a system design, you can use HyDE to propose initial candidate passages, then let a true Self-RAG model score branches with ISREL, ISSUP, and ISUSE. That composition needs its own evaluation; neither mechanism guarantees the other improves it. If you don't have a reflection-token model, describe the composition as HyDE plus a CRAG-style evaluator or prompted support checks rather than Self-RAG.

Mastery check

Key concepts

standalone query rewriting
multi-query expansion
query decomposition
HyDE
Self-RAG reflection tokens
CRAG evaluator
agentic retrieval escalation
retrieval versus generation latency

Evaluation rubric

Foundational: Explains why rewrite, expansion, decomposition, HyDE, Self-RAG, and CRAG solve different failures instead of treating them as one interchangeable bag of tricks
Intermediate: Implements at least one rewrite path, one HyDE-style retrieval path, and one correction gate with clear boundaries between retrieval, grading, and answer generation
Advanced: Chooses the cheapest retrieval ladder that fits the failure mode and can defend latency, factuality, and exact-match routing trade-offs

Follow-up questions

Common pitfalls

Symptom: vague follow-up questions retrieve generic FAQ chunks. Cause: the retriever saw only the latest turn. Fix: rewrite the latest turn into a standalone query using recent history.
Symptom: HyDE improves conceptual queries but hurts incident-ID or price lookups. Cause: the hypothetical document invented precise facts and dragged retrieval off course. Fix: route exact-match queries to ordinary hybrid retrieval.
Symptom: a prompted grading loop is described as Self-RAG. Cause: the control logic looks similar at a distance. Fix: reserve the Self-RAG name for models trained to emit reflection tokens during generation.
Symptom: retrieval quality improves but user experience gets slower and more fragile. Cause: rewriting, HyDE, reranking, grading, and synthesis were stacked into one long sequential path. Fix: parallelize independent work, cap expensive stages, and measure retrieval and generation latency separately.
Symptom: teams add agentic retrieval to simple FAQ lookups. Cause: escalation happened because the technique sounded more advanced, not because the failure demanded it. Fix: climb the ladder only when one retrieval pass can't gather or validate the needed evidence.

Next Step

Continue to GraphRAG & Knowledge Graphs

There, you'll examine how Microsoft's <span data-glossary="graphrag">GraphRAG</span> architecture uses entity graphs, hierarchical community reports, and embeddings for corpus-wide relationship questions that pure vector retrieval can struggle with.

PreviousVector DB Internals: HNSW & IVF

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Query Rewriting for Retrieval-Augmented Large Language Models.

Ma, X., et al. · 2023 · arXiv preprint

RAG-Fusion: a New Take on Retrieval-Augmented Generation

Rackauckas, Z. · 2024

Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods.

Cormack, G. V., Clarke, C. L. A., & Buettcher, S. · 2009 · SIGIR '09

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Zhou, D., et al. · 2022 · ICLR 2023

Precise Zero-Shot Dense Retrieval without Relevance Labels.

Gao, L., Ma, X., Lin, J., & Callan, J. · 2022 · arXiv preprint

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection.

Asai, A., et al. · 2023 · arXiv preprint

Corrective Retrieval Augmented Generation.

Yan, S.-Q., et al. · 2024 · arXiv preprint

Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

Trivedi, H., Balasubramanian, N., Khot, T., & Sabharwal, A. · 2022

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Advanced RAG: HyDE & Self-RAG

The problem with naive RAG

Query rewriting and decomposition

1. Rewrite for disambiguation

2. Multi-query expansion

3. Query decomposition (least-to-most)

When should you rewrite a query instead of searching the raw user message?

HyDE (Hypothetical Document Embeddings)

How HyDE works

When to use HyDE

Why should HyDE be gated away from incident IDs, error codes, dates, and prices?

Self-RAG (Self-reflective RAG)

Reflection tokens

Architecture

Implementation note

CRAG (Corrective RAG)

The evaluator

What separates true Self-RAG from a prompted critique loop?

Knowledge refinement

Agentic and iterative retrieval

A question needs two facts where the second depends on resolving the first. Which strategy fits, and why not naive RAG?

Comparison of advanced techniques

Production failure modes

Symptom: the query is vague, but the answer needs history

Symptom: HyDE improves recall on policy questions but breaks exact lookups

Symptom: Self-RAG-inspired code works in prompts but isn't true Self-RAG

Symptom: accuracy rises, but users feel the product is slow

Practical implementation strategy

Phase 1: Stable baseline

Phase 2: Add reranking

Phase 3: Specialized handling (HyDE/router)

Phase 4: Add correction loops

Try it yourself

Exercise: Build a query rewriter for a developer-support assistant

When each technique pays off

When does HyDE justify its latency cost?

When does Self-RAG degrade instead of help?

Can you combine HyDE and Self-RAG?

Mastery check

Key concepts

Evaluation rubric

Follow-up questions

A user asks "What about the other one?" after discussing a backup API key and a failed batch job. Which retrieval fix should you try first, and why?

When does multi-query expansion help more than HyDE?

What makes a pipeline truly Self-RAG instead of a prompted critique loop?

Why can CRAG help even if you never train a reflection-token model?

You need two facts, and the second query depends on the first answer. Which rung of the retrieval ladder fits best?

Common pitfalls

Mastery Check

Discussion