LearnApplied LLM EngineeringHybrid Search: Dense + Sparse

🔍MediumRAG & Retrieval

Hybrid Search: Dense + Sparse

Upgrade a permission-safe RAG retriever with BM25, semantic scores, rank fusion, and recall gates for exact codes and paraphrased policy questions.

18 min read

Learning path

Step 66 of 158 in the full curriculum

Production RAG Pipelines Reranking and Cross-Encoders for RAG

The production RAG lesson built policy-answerer-v1 around a hard rule: only permitted, current evidence may reach an answer. Its simple term-overlap retriever was easy to audit, but it misses a support specialist who asks to "refresh an expired machine credential" when the policy says "stale service-account key rotation."

Upgrade only that retrieval lane. You'll build policy-answerer-v2. The code KROT-14 gives sparse retrieval an exact signal; dense retrieval catches paraphrased meaning; and Reciprocal Rank Fusion (RRF) merges candidate lists. The authorization, freshness, citation, and abstention contract doesn't change.

One retriever can't cover both queries

Luna, an EU support specialist, needs the same policy for two different searches:

Query	Useful signal	Required evidence
`KROT-14`	Exact policy code	`eu-key-rotation-v2-rule`
`refresh expired machine credential`	Meaning close to "stale service-account key rotation"	`eu-key-rotation-v2-rule`

A word-matching index has a decisive clue for the first query and no shared vocabulary for the second. A semantic encoder can represent the second query near the policy, but an unfamiliar internal identifier may carry little useful semantic signal. Neither failure says one method is bad. They solve different recall problems.

Authorization-first hybrid retrieval overview where current permitted records enter both BM25 and dense lanes, blocked records stay outside the search boundary, and only permitted candidates reach fused ranking. — Filter to current permitted records first, let sparse and dense search recover different misses, then fuse only permitted candidate IDs.

The safe online order is visible in the figure: define the current permitted corpus first, run both retrieval lanes only over that corpus, then fuse candidate IDs. Fusion improves recall; it doesn't widen access.

Recreate the permitted candidate universe

The lab reuses the policy shape from the previous lesson. It adds diagnostic policy code KROT-14 so an exact-identifier query has an unambiguous expected result. Two tempting records remain in storage but must not be searchable for Luna: a superseded revision and a restricted admin-only rule.

permitted-candidates.py

from __future__ import annotations

from dataclasses import dataclass
from datetime import date
from math import log, sqrt
import re

@dataclass(frozen=True)
class PolicyChunk:
    chunk_id: str
    document_id: str
    parent_id: str
    version: str
    region: str
    acl_tag: str
    effective_from: date
    effective_to: date | None
    text: str

@dataclass(frozen=True)
class Caller:
    actor_id: str
    region: str
    acl_tags: frozenset[str]

EVAL_DATE = date(2026, 5, 27)
LUNA = Caller("luna-48291", "EU", frozenset({"support:eu"}))
CHUNKS = [
    PolicyChunk(
        "eu-key-rotation-v2-rule",
        "eu-access",
        "eu-access-v2",
        "eu-access/2026-04-01",
        "EU",
        "support:eu",
        date(2026, 4, 1),
        None,
        (
            "Rule KROT-14. Stale service-account keys qualify for automated "
            "rotation within 14 days when a risk signal arrives within 48 hours."
        ),
    ),
    PolicyChunk(
        "eu-key-rotation-v1-rule",
        "eu-access",
        "eu-access-v1",
        "eu-access/2025-02-01",
        "EU",
        "support:eu",
        date(2025, 2, 1),
        date(2026, 3, 31),
        "Rule KROT-14. Stale service-account keys require manual rotation within 30 days.",
    ),
    PolicyChunk(
        "restricted-admin-key-rotation",
        "admin-override-terms",
        "admin-override-terms",
        "admin-override/2026-05-01",
        "EU",
        "security:admins",
        date(2026, 5, 1),
        None,
        "ADMIN-KROT-1. Security admins may run emergency key rotation.",
    ),
    PolicyChunk(
        "eu-session-timeout-v1-rule",
        "eu-session",
        "eu-session-v1",
        "eu-session/2026-01-03",
        "EU",
        "support:eu",
        date(2026, 1, 3),
        None,
        "Idle browser sessions expire after 30 days of inactivity.",
    ),
    PolicyChunk(
        "eu-audit-rebuild-v1",
        "eu-audit",
        "eu-audit-rebuild-v1",
        "eu-audit/2026-02-10",
        "EU",
        "support:eu",
        date(2026, 2, 10),
        None,
        "Rule AUD-7. Missing audit-log shards after ingestion failure qualify for replay rebuild.",
    ),
]

def is_current(chunk: PolicyChunk, on_date: date) -> bool:
    return chunk.effective_from <= on_date and (
        chunk.effective_to is None or on_date <= chunk.effective_to
    )

def permitted_chunks(
    caller: Caller,
    chunks: list[PolicyChunk],
    on_date: date,
) -> list[PolicyChunk]:
    return [
        chunk
        for chunk in chunks
        if chunk.region == caller.region
        and chunk.acl_tag in caller.acl_tags
        and is_current(chunk, on_date)
    ]

permitted = permitted_chunks(LUNA, CHUNKS, EVAL_DATE)
permitted_ids = [chunk.chunk_id for chunk in permitted]
print("Permitted current ids:", permitted_ids)
assert "eu-key-rotation-v2-rule" in permitted_ids
assert "eu-key-rotation-v1-rule" not in permitted_ids
assert "restricted-admin-key-rotation" not in permitted_ids

Output

Permitted current ids: ['eu-key-rotation-v2-rule', 'eu-session-timeout-v1-rule', 'eu-audit-rebuild-v1']

Every ranker below receives permitted, not CHUNKS. This isn't a presentation detail: it's the API boundary that prevents a new ranking algorithm from weakening the service contract.

The fixed EVAL_DATE keeps replay behavior stable. The chunk shape also preserves document_id and parent_id from the previous lesson, even though ranking changes while citation packing stays the same.

evidence-boundary-regression.py

blocked_ids = sorted(
    chunk.chunk_id
    for chunk in CHUNKS
    if chunk.chunk_id not in permitted_ids
)
print("Searchable by Luna:", permitted_ids)
print("Stored but blocked:", blocked_ids)
assert blocked_ids == ["eu-key-rotation-v1-rule", "restricted-admin-key-rotation"]

Output

Searchable by Luna: ['eu-key-rotation-v2-rule', 'eu-session-timeout-v1-rule', 'eu-audit-rebuild-v1']
Stored but blocked: ['eu-key-rotation-v1-rule', 'restricted-admin-key-rotation']

This test keeps a deliberately attractive hidden policy in storage. Later ranking changes fail loudly if they accidentally widen the searchable set.

Build the sparse lane with BM25

Sparse retrieval represents a document by vocabulary terms. Most coordinates are zero because a short policy chunk uses only a small part of the vocabulary. BM25 ranks a document higher when it shares rare query terms, while limiting the reward for repeated terms and compensating for unusually long documents.^{[1]Reference 1The Probabilistic Relevance Framework: BM25 and Beyond.https://doi.org/10.1561/1500000019} The lab below uses the common Lucene-style IDF form with a leading 1 +, which is the practical formula in many search engines rather than the classic Okapi IDF alone.

For a query term $t$ and document $d$ , the lab computes:

\operatorname{BM25}(q,d)=\sum_{t \in q}\operatorname{IDF}(t) \frac{f(t,d)(k_1+1)} {f(t,d)+k_1(1-b+b\lvert d\rvert/\operatorname{avgdl})}

Here, $f(t,d)$ is the term count in the chunk, $\lvert d\rvert$ is chunk length in tokens, and avgdl is the corpus average. k1 controls term-frequency saturation; b controls length normalization. The exact identifier krot-14 occurs only in the relevant current chunk, so it receives strong lexical weight.

The small analyzer below keeps hyphenated rule codes intact and removes common function words. Without that stopword rule, a query containing only a shared word such as "a" could appear to retrieve an unrelated policy.

bm25-lane.py

TOKEN_RE = re.compile(r"[a-z0-9]+(?:-[a-z0-9]+)*")
STOPWORDS = {"a", "an", "the", "after", "for", "of", "is", "within", "when"}

def tokens(text: str) -> list[str]:
    return [
        token
        for token in TOKEN_RE.findall(text.lower())
        if token not in STOPWORDS
    ]

def bm25_rank(
    query: str,
    chunks: list[PolicyChunk],
    top_k: int = 2,
    k1: float = 1.2,
    b: float = 0.75,
) -> list[tuple[PolicyChunk, float]]:
    if top_k <= 0:
        raise ValueError("top_k must be positive")
    if not chunks:
        return []

    doc_tokens = {chunk.chunk_id: tokens(chunk.text) for chunk in chunks}
    avgdl = sum(len(value) for value in doc_tokens.values()) / len(chunks)
    query_terms = tokens(query)
    ranked: list[tuple[PolicyChunk, float]] = []

    for chunk in chunks:
        document = doc_tokens[chunk.chunk_id]
        score = 0.0
        for term in query_terms:
            term_count = document.count(term)
            if term_count == 0:
                continue
            containing_docs = sum(term in value for value in doc_tokens.values())
            idf = log(1 + (len(chunks) - containing_docs + 0.5) / (containing_docs + 0.5))
            numerator = term_count * (k1 + 1)
            denominator = term_count + k1 * (1 - b + b * len(document) / avgdl)
            score += idf * numerator / denominator
        if score > 0:
            ranked.append((chunk, score))

    return sorted(ranked, key=lambda item: (-item[1], item[0].chunk_id))[:top_k]

EXACT = "KROT-14"
PARAPHRASE = "refresh expired machine credential"
bm25_exact = bm25_rank(EXACT, permitted)
bm25_paraphrase = bm25_rank(PARAPHRASE, permitted)

print("BM25 exact:", [chunk.chunk_id for chunk, _ in bm25_exact])
print("BM25 paraphrase:", [chunk.chunk_id for chunk, _ in bm25_paraphrase])
assert bm25_exact[0][0].chunk_id == "eu-key-rotation-v2-rule"
assert bm25_paraphrase == []

Output

BM25 exact: ['eu-key-rotation-v2-rule']
BM25 paraphrase: []

BM25 did its job. It recovered the policy from its code and openly failed when the user used no policy vocabulary. A real evaluation set needs both query types; otherwise the lexical lane can look perfect while customers miss evidence.

BM25 isn't the only sparse option. SPLADE learns sparse expansion weights, so a chunk can gain indexable related terms while preserving sparse retrieval infrastructure.^{[2]Reference 2SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking.https://arxiv.org/abs/2107.05720} That can improve vocabulary mismatch cases, but it doesn't turn sparse retrieval into an authorization layer or guarantee better recall on security-policy queries. Evaluate a SPLADE candidate against the same permitted corpus, held-out required-evidence IDs, latency budget, and hidden-source exclusions before replacing BM25.

diagnose-sparse-miss.py

required_text = next(
    chunk.text for chunk in permitted if chunk.chunk_id == "eu-key-rotation-v2-rule"
)
exact_overlap = sorted(set(tokens(EXACT)) & set(tokens(required_text)))
paraphrase_overlap = sorted(set(tokens(PARAPHRASE)) & set(tokens(required_text)))
print("Exact overlap:", exact_overlap)
print("Paraphrase overlap:", paraphrase_overlap)
assert exact_overlap == ["krot-14"]
assert paraphrase_overlap == []

Output

Exact overlap: ['krot-14']
Paraphrase overlap: []

Add a dense semantic lane

A dense retriever encodes queries and chunks as compact vectors, then retrieves chunks with high similarity. Dense Passage Retrieval (DPR), for example, uses separate encoders for questions and passages so passage representations can be indexed before requests arrive.^{[3]Reference 3Dense Passage Retrieval for Open-Domain Question Answering.https://arxiv.org/abs/2004.04906}

Downloading and training an encoder would hide the retrieval mechanics in this lab. Instead, the next cell uses frozen three-dimensional vectors as test fixtures. Read them as outputs already produced by an embedding model:

Dimension	Meaning in this fixture
1	Key-rotation intent
2	Session timeout intent
3	Audit-log rebuild intent

This fixture is deliberately honest about one failure: the internal code KROT-14 has no semantic vector by itself. The paraphrase does.

The lab ranks those vectors with cosine similarity. Vectors pointing in a similar direction score closer to 1; their raw length doesn't decide the result.

dense-lane.py

Vector = tuple[float, float, float]

DOCUMENT_VECTORS: dict[str, Vector] = {
    "eu-key-rotation-v2-rule": (1.00, 0.00, 0.00),
    "eu-session-timeout-v1-rule": (0.00, 1.00, 0.00),
    "eu-audit-rebuild-v1": (0.00, 0.00, 1.00),
}
QUERY_VECTORS: dict[str, Vector] = {
    EXACT: (0.00, 0.00, 0.00),
    PARAPHRASE: (0.98, 0.05, 0.00),
    "stale service-account key rotation within 14 days": (0.96, 0.15, 0.02),
}

def cosine(left: Vector, right: Vector) -> float:
    left_norm = sqrt(sum(value * value for value in left))
    right_norm = sqrt(sum(value * value for value in right))
    if left_norm == 0 or right_norm == 0:
        return 0.0
    return sum(a * b for a, b in zip(left, right)) / (left_norm * right_norm)

def dense_rank(
    query: str,
    chunks: list[PolicyChunk],
    top_k: int = 2,
) -> list[tuple[PolicyChunk, float]]:
    query_vector = QUERY_VECTORS.get(query, (0.0, 0.0, 0.0))
    ranked = [
        (chunk, cosine(query_vector, DOCUMENT_VECTORS[chunk.chunk_id]))
        for chunk in chunks
    ]
    return sorted(
        [(chunk, score) for chunk, score in ranked if score > 0],
        key=lambda item: (-item[1], item[0].chunk_id),
    )[:top_k]

dense_exact = dense_rank(EXACT, permitted)
dense_paraphrase = dense_rank(PARAPHRASE, permitted)
print("Dense exact:", [chunk.chunk_id for chunk, _ in dense_exact])
print("Dense paraphrase:", [chunk.chunk_id for chunk, _ in dense_paraphrase])
assert dense_exact == []
assert dense_paraphrase[0][0].chunk_id == "eu-key-rotation-v2-rule"

Output

Dense exact: []
Dense paraphrase: ['eu-key-rotation-v2-rule', 'eu-session-timeout-v1-rule']

The fixture doesn't claim that every production encoder misses every identifier. It establishes a regression case: this chosen encoder representation doesn't recover the code-only query, so deleting the sparse lane would fail a known requirement.

Fuse candidates without mixing score scales

BM25 scores and cosine similarities don't share units. A BM25 value reflects term statistics in this index; a cosine value reflects vector alignment. Adding raw values can let whichever scale is numerically larger control the order.

Reciprocal Rank Fusion avoids that comparison. It contributes $1 / (k + r)$ for each rank $r$ at which a chunk appears:

\operatorname{RRF}(d)=\sum_{\text{lane } l}\frac{1}{k+\operatorname{rank}_l(d)}

We use k=60, the setting reported in the original RRF experiments, as a starting value rather than a universal optimum.^{[4]Reference 4Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods.https://dl.acm.org/doi/10.1145/1571941.1572114} A chunk found by both lanes gains two contributions; a strong result found by one lane remains eligible.

For the shared-language query, eu-key-rotation-v2-rule ranks first in both lanes, so its fused score is 1 / 61 + 1 / 61 = 0.0328. The session-timeout distractor ranks second in both lanes, so its score is 1 / 62 + 1 / 62 = 0.0323. These values aren't probabilities. They are rank-based scores used to order the fused candidate set.

Reciprocal Rank Fusion overview where the same two documents place first and second in both BM25 and dense lists, then the fused ranking keeps that shared order because it combines rank positions rather than raw score scales. — RRF combines positions, not raw score units, so shared first-place evidence stays first after fusion.

rrf-fusion.py

RRF_K = 60

def reciprocal_rank_fusion(
    result_lists: list[list[tuple[PolicyChunk, float]]],
    k: int = RRF_K,
) -> list[tuple[PolicyChunk, float]]:
    if k <= 0:
        raise ValueError("k must be positive")
    by_id: dict[str, PolicyChunk] = {}
    scores: dict[str, float] = {}
    for results in result_lists:
        for rank, (chunk, _raw_score) in enumerate(results, start=1):
            by_id[chunk.chunk_id] = chunk
            scores[chunk.chunk_id] = scores.get(chunk.chunk_id, 0.0) + 1 / (k + rank)
    return sorted(
        [(by_id[chunk_id], score) for chunk_id, score in scores.items()],
        key=lambda item: (-item[1], item[0].chunk_id),
    )

def hybrid_rank(
    query: str,
    caller: Caller,
    chunks: list[PolicyChunk],
    top_k: int = 2,
    candidate_pool: int | None = None,
) -> list[tuple[PolicyChunk, float]]:
    searchable = permitted_chunks(caller, chunks, EVAL_DATE)
    # Fuse from a deeper per-lane pool than the final context budget.
    pool = candidate_pool if candidate_pool is not None else max(50, top_k)
    pool = min(pool, len(searchable)) if searchable else top_k
    fused = reciprocal_rank_fusion(
        [bm25_rank(query, searchable, pool), dense_rank(query, searchable, pool)]
    )
    return fused[:top_k]

SHARED_WORDS = "stale service-account key rotation within 14 days"
for query in [EXACT, PARAPHRASE, SHARED_WORDS]:
    hits = hybrid_rank(query, LUNA, CHUNKS)
    print(query, "->", [chunk.chunk_id for chunk, _ in hits])

shared_fused = hybrid_rank(SHARED_WORDS, LUNA, CHUNKS)
assert shared_fused[0][0].chunk_id == "eu-key-rotation-v2-rule"
assert shared_fused[0][1] == 2 / 61

Output

KROT-14 -> ['eu-key-rotation-v2-rule']
refresh expired machine credential -> ['eu-key-rotation-v2-rule', 'eu-session-timeout-v1-rule']
stale service-account key rotation within 14 days -> ['eu-key-rotation-v2-rule', 'eu-session-timeout-v1-rule']

RRF doesn't manufacture relevance. It makes the two candidate sources interoperable. If both lanes miss the right chunk, a fused list will still be wrong.

Don't set each lane's depth equal to the final context budget. If top_k is 2 and each lane only returns 2 hits, mid-ranked evidence never meets in fusion. Retrieve a deeper candidate pool per lane, fuse, then keep top_k.

Attack the evidence boundary through fusion

The stored admin-only policy contains a unique code. If the permission boundary moved after retrieval, BM25 would have an easy hidden hit to surface. A hybrid implementation must return nothing for Luna's request for that code.

hidden-source-attack.py

NO_ACCESS = Caller("visitor-9000", "APAC", frozenset())
attack_hits = hybrid_rank("ADMIN-KROT-1", LUNA, CHUNKS)
attack_ids = [chunk.chunk_id for chunk, _ in attack_hits]
no_access_hits = hybrid_rank(EXACT, NO_ACCESS, CHUNKS)
print("Visible candidates for hidden code:", attack_ids)
print("Visible candidates without corpus access:", no_access_hits)
assert "restricted-admin-key-rotation" not in attack_ids
assert attack_ids == []
assert no_access_hits == []

Output

Visible candidates for hidden code: []
Visible candidates without corpus access: []

Gate the retriever on recall and safety

In the previous lesson, answer quality depended on retrieving current permitted evidence. That means the retrieval upgrade needs its own release cases before you measure generated text.

Recall@2 answers a narrow question: for each supported query, did the correct permitted chunk appear in the first two candidates? It doesn't say whether the evidence order is perfect or whether the final answer is faithful. Those are later checks. Here, recall exposes whether the generator even gets a chance to see the right policy.

Hybrid retrieval release gate where three frozen queries test exact code, paraphrase, and shared-language recall across BM25, dense, and hybrid lanes, while hidden and superseded attack cases must stay excluded before release. — Freeze three positive recall cases, verify hybrid repairs both single-lane misses, and block release if hidden or superseded sources reappear anywhere.

retrieval-release-gate.py

@dataclass(frozen=True)
class RetrievalCase:
    name: str
    query: str
    expected_chunk_id: str

CASES = [
    RetrievalCase("exact-code", EXACT, "eu-key-rotation-v2-rule"),
    RetrievalCase("paraphrase", PARAPHRASE, "eu-key-rotation-v2-rule"),
    RetrievalCase("shared-language", SHARED_WORDS, "eu-key-rotation-v2-rule"),
]

def recall_at_2(rank_fn) -> float:
    recovered = 0
    for case in CASES:
        ids = [chunk.chunk_id for chunk, _ in rank_fn(case.query)]
        recovered += case.expected_chunk_id in ids[:2]
    return recovered / len(CASES)

bm25_recall = recall_at_2(lambda query: bm25_rank(query, permitted))
dense_recall = recall_at_2(lambda query: dense_rank(query, permitted))
hybrid_recall = recall_at_2(lambda query: hybrid_rank(query, LUNA, CHUNKS))

restricted_attack = hybrid_rank("ADMIN-KROT-1", LUNA, CHUNKS)
superseded_attack = hybrid_rank(
    "KROT-14 key rotation manual 30 days",
    LUNA,
    CHUNKS,
)
restricted_ids = [chunk.chunk_id for chunk, _ in restricted_attack]
superseded_ids = [chunk.chunk_id for chunk, _ in superseded_attack]
safety_pass = (
    "restricted-admin-key-rotation" not in restricted_ids
    and "eu-key-rotation-v1-rule" not in superseded_ids
)

print(f"BM25 Recall@2: {bm25_recall:.2f}")
print(f"Dense Recall@2: {dense_recall:.2f}")
print(f"Hybrid RRF Recall@2: {hybrid_recall:.2f}")
print("Restricted attack ids:", restricted_ids)
print("Superseded attack ids:", superseded_ids)
print("Safety gate:", safety_pass)
assert hybrid_recall == 1.0
assert hybrid_recall > bm25_recall
assert hybrid_recall > dense_recall
assert restricted_ids == []
assert "eu-key-rotation-v1-rule" not in superseded_ids
assert safety_pass

Output

BM25 Recall@2: 0.67
Dense Recall@2: 0.67
Hybrid RRF Recall@2: 1.00
Restricted attack ids: []
Superseded attack ids: ['eu-key-rotation-v2-rule', 'eu-session-timeout-v1-rule']
Safety gate: True

These three fixtures demonstrate complementary failures; they don't prove an offline lift for a production corpus. A release decision needs a held-out set drawn from real support requests, including exact codes, paraphrases, unsupported questions, languages served by the product, and attempts to request hidden policies.

Gate	What to freeze	Failure meaning
Permitted Recall@k	Query and required current chunk ID	Correct evidence never reaches context selection
Restricted-source exclusion	Queries that strongly match hidden chunks	Retriever boundary is unsafe
Superseded-source exclusion	Queries matching old policy wording	Freshness filter regressed
Abstention cases	Questions with no permitted supporting evidence	Retrieval or answer layer overreaches

Trace each lane before adding reranking

When a final answer is wrong, you need to tell apart three failures:

Failure	Trace evidence	Next repair
Retrieval miss	Expected chunk absent from sparse, dense, and fused candidates	Improve indexing, encoder, query handling, or fusion
Fusion ordering issue	Expected chunk exists in a lane but falls below context budget	Tune fusion on held-out labels
Later precision issue	Correct chunk is in fused candidates but distractors rank above it	Add and evaluate the reranking stage in the next lesson

Store IDs, ranks, model and index versions, fusion settings, and timing in a trace. Don't log policy text in a broad diagnostic event.

hybrid-retrieval-trace.py

def trace_hybrid_request(
    query: str,
    query_kind: str,
    caller: Caller,
) -> dict[str, object]:
    searchable = permitted_chunks(caller, CHUNKS, EVAL_DATE)
    sparse = bm25_rank(query, searchable)
    dense = dense_rank(query, searchable)
    fused = reciprocal_rank_fusion([sparse, dense])
    return {
        "versions": {
            "retriever": "policy-retriever-v2",
            "index": "policy-index/2026-05-27",
            "sparse": "bm25-tokenizer-v1",
            "dense": "fixture-embeddings-v1",
            "fusion": f"rrf-k{RRF_K}",
        },
        "query_kind": query_kind,
        "sparse_ids": [chunk.chunk_id for chunk, _ in sparse],
        "dense_ids": [chunk.chunk_id for chunk, _ in dense],
        "fused_ids": [chunk.chunk_id for chunk, _ in fused[:2]],
        "timings_ms": {"authorize": 2, "bm25": 4, "dense": 11, "fusion": 1},
    }

trace = trace_hybrid_request(PARAPHRASE, "paraphrase-regression", LUNA)
stores_raw_policy_text = any(
    chunk.text in str(trace)
    for chunk in CHUNKS
)
print("Versions:", trace["versions"])
print("Sparse ids:", trace["sparse_ids"])
print("Dense ids:", trace["dense_ids"])
print("Fused ids:", trace["fused_ids"])
print("Trace stores raw policy text:", stores_raw_policy_text)
assert trace["fused_ids"][0] == "eu-key-rotation-v2-rule"
assert "eu-session-timeout-v1-rule" in trace["fused_ids"]
assert "restricted-admin-key-rotation" not in str(trace)
assert not stores_raw_policy_text

Output

Versions: {'retriever': 'policy-retriever-v2', 'index': 'policy-index/2026-05-27', 'sparse': 'bm25-tokenizer-v1', 'dense': 'fixture-embeddings-v1', 'fusion': 'rrf-k60'}
Sparse ids: []
Dense ids: ['eu-key-rotation-v2-rule', 'eu-session-timeout-v1-rule']
Fused ids: ['eu-key-rotation-v2-rule', 'eu-session-timeout-v1-rule']
Trace stores raw policy text: False

The correct evidence is present, but the semantic lane also kept a session-timeout distractor. That's the boundary between retrieval and reranking: retrieval satisfies candidate recall; a reranker decides whether a distractor should remain near context. The trace can also preserve the latency budget created in the production RAG lesson. Two retrieval lanes add work, so the release check should report that cost explicitly.

If a context budget is being wasted by several near-duplicate candidates, Maximal Marginal Relevance (MMR) is one selection strategy: choose a relevant result while penalizing candidates too similar to what has already been selected.^{[5]Reference 5The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries.https://dl.acm.org/doi/10.1145/290941.291025} MMR handles diversity in an existing permitted candidate set. It doesn't retrieve a missing policy and doesn't replace a cross-encoder that must compare query relevance precisely.

retrieval-latency-gate.py

RETRIEVAL_BUDGET_MS = {"authorize": 10, "bm25": 12, "dense": 40, "fusion": 8}

def exceeded_retrieval_budgets(timings: dict[str, int]) -> list[str]:
    return [
        stage
        for stage, budget in RETRIEVAL_BUDGET_MS.items()
        if stage not in timings or timings[stage] > budget
    ]

healthy = trace["timings_ms"]
missing_dense = {
    stage: elapsed
    for stage, elapsed in healthy.items()
    if stage != "dense"
}
print("Healthy exceeded:", exceeded_retrieval_budgets(healthy))
print("Missing timing exceeded:", exceeded_retrieval_budgets(missing_dense))
assert exceeded_retrieval_budgets(healthy) == []
assert exceeded_retrieval_budgets(missing_dense) == ["dense"]

Output

Healthy exceeded: []
Missing timing exceeded: ['dense']

Should you tune weights instead?

RRF is a good first implementation because it doesn't require calibrating unrelated score scales. It isn't an automatic winner. Bruch et al. found that RRF can be sensitive to its parameter and that a tuned convex combination can outperform it in their tested settings.^{[6]Reference 6An Analysis of Fusion Functions for Hybrid Retrieval.https://arxiv.org/abs/2210.11934} If you have enough labeled queries, compare it against normalized weighted fusion:

\operatorname{score}_{\text{hybrid}}(d) =\alpha\widetilde{\operatorname{score}}_{\text{dense}}(d) +(1-\alpha)\widetilde{\operatorname{score}}_{\text{sparse}}(d)

The tildes denote scores normalized within their lanes before fusion. That comparison is an evaluation task, not a reason to guess an alpha in production. Keep a fixed held-out split, version the encoder and index, report Recall@k and latency for every candidate, and retain RRF if a tuned method doesn't hold up out of sample.

Build it yourself

Extend policy-answerer-v2 without weakening its contract:

Add at least eight permitted positive queries: exact codes, natural-language paraphrases, and mixed queries.
Add at least four negative or adversarial queries: restricted admin-only rules, superseded policy wording, wrong region, and absent evidence.
Replace the frozen dense vectors with embeddings produced by your chosen bi-encoder, recording the model version.
Compare BM25, dense, and RRF using permitted Recall@k; record p50 and p95 retrieval latency.
Save a compact trace for each failed case containing candidate IDs, lane ranks, versions, and timings, but not restricted content.
Keep the fused top candidates as the input artifact for the next lesson's reranker.

The important artifact isn't a search demo. It's a retrieval report showing which evidence questions are recovered, which must abstain, and which source boundaries remain enforced after the upgrade.

Mastery check

You're ready to use hybrid retrieval in a RAG system when you can:

Explain why BM25 recovers rare identifiers and why dense retrieval can recover paraphrases.
Explain when learned sparse expansion such as SPLADE is a candidate alternative to BM25.
Implement BM25 scoring and RRF fusion over permitted candidate records.
Explain why MMR diversifies selected evidence but doesn't repair a retrieval miss.
Explain why BM25 scores and cosine similarities must not be added without calibration.
Treat frozen semantic vectors as a test fixture, not proof that one production embedding model behaves the same way.
Evaluate retrieval with expected evidence IDs before evaluating generated answers.
Preserve authorization and freshness filtering ahead of both retrieval lanes.
Produce a lane-by-lane trace that makes a missed chunk diagnosable.

Evaluation rubric

Level	Evidence in your submission
Foundational	Correctly ranks `KROT-14` with BM25 and explains its term-based signal
Applied	Recovers the paraphrased key-rotation question through dense retrieval and fuses candidates with RRF
Strong	Reports BM25-only, dense-only, and hybrid Recall@k on labeled positive cases plus negative safety gates
Production-ready	Uses a versioned encoder and index, measures latency, and proves restricted or superseded policies never enter fused candidates

Common pitfalls

Symptom	Likely cause	Repair
Rule-code lookup returns a generic policy	Dense-only search lost a rare identifier	Keep or restore the lexical lane and add code queries to the release set
Paraphrased question returns nothing	Sparse-only search requires the policy's exact wording	Add a dense lane and test semantic queries against required evidence IDs
Fused ranking changes wildly after an encoder update	Raw scores or tuned weights no longer have the same calibration	Compare against RRF and retune only on a fixed labeled split
Hidden admin-only rule appears in any candidate trace	Retrieval ran before authorization filtering	Restrict the searchable candidate universe before either lane executes
Team blames generation for an unsupported answer	Retrieval evidence IDs were never evaluated	Measure permitted Recall@k and abstention before scoring final text

Next Step

Continue to Reranking and Cross-Encoders for RAG

You now have a permission-safe hybrid retriever that recovers exact codes and semantic paraphrases into a measured candidate set. Next you'll add a slower precision stage that reorders those candidates before they enter the generator context.

PreviousProduction RAG Pipelines

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

The Probabilistic Relevance Framework: BM25 and Beyond.

Robertson, S., & Zaragoza, H. · 2009 · Foundations and Trends in Information Retrieval

SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking.

Formal, T., et al. · 2021 · SIGIR 2021

Dense Passage Retrieval for Open-Domain Question Answering.

Karpukhin, V., et al. · 2020 · EMNLP 2020

Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods.

Cormack, G. V., Clarke, C. L. A., & Buettcher, S. · 2009 · SIGIR '09

The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries.

Carbonell, J., & Goldstein, J. · 1998 · SIGIR 1998

An Analysis of Fusion Functions for Hybrid Retrieval.

Bruch, S., Gai, S., & Ingber, A. · 2023 · ACM Transactions on Information Systems

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Hybrid Search: Dense + Sparse

One retriever can't cover both queries

Why must the permission filter run before both retrieval lanes rather than after fusion?

Recreate the permitted candidate universe

Build the sparse lane with BM25

If BM25 returns no result for "refresh expired machine credential," should you lower an authorization or freshness filter?

Add a dense semantic lane

Fuse candidates without mixing score scales

Why shouldn't you add a BM25 score directly to a cosine-similarity score?

Attack the evidence boundary through fusion

Gate the retriever on recall and safety

Hybrid reaches Recall@2 of 1.00 on this tiny set. Can you say hybrid search is better for all security-policy queries?

Trace each lane before adding reranking

Should you tune weights instead?

Build it yourself

Mastery check

Evaluation rubric

Common pitfalls

What remains for the next lesson if hybrid retrieval already recovers the right evidence?

Mastery Check

Discussion