Upgrade a permission-safe RAG retriever with BM25, semantic scores, rank fusion, and recall gates for exact codes and paraphrased policy questions.
The production RAG lesson built policy-answerer-v1 around a hard rule: only permitted, current evidence may reach an answer. Its simple term-overlap retriever was easy to audit, but it misses a customer who asks for a "swap for a broken reconditioned notebook" when the policy says "replacement for a damaged refurbished laptop."
This lesson upgrades only that retrieval lane. You'll build policy-answerer-v2. The code RPL-14 gives sparse retrieval an exact signal; dense retrieval catches paraphrased meaning; and Reciprocal Rank Fusion (RRF) merges candidate lists. The authorization, freshness, citation, and abstention contract doesn't change.
Luna, an EU support specialist, needs the same policy for two different searches:
| Query | Useful signal | Required evidence |
|---|---|---|
RPL-14 | Exact policy code | eu-refurb-v2-rule |
swap a broken reconditioned notebook | Meaning close to "replacement for a damaged refurbished laptop" | eu-refurb-v2-rule |
A word-matching index has a decisive clue for the first query and no shared vocabulary for the second. A semantic encoder can represent the second query near the policy, but an unfamiliar internal identifier may carry little useful semantic signal. Neither failure says one method is bad. They solve different recall problems.
The safe online order is:
The lab reuses the policy shape from the previous lesson. It adds diagnostic policy code RPL-14 so an exact-identifier query has an unambiguous expected result. Two tempting records remain in storage but must not be searchable for Luna: a superseded revision and a restricted merchant rule.
1from __future__ import annotations
2
3from dataclasses import dataclass
4from datetime import date
5from math import log, sqrt
6import re
7
8@dataclass(frozen=True)
9class PolicyChunk:
10 chunk_id: str
11 document_id: str
12 parent_id: str
13 version: str
14 region: str
15 acl_tag: str
16 effective_from: date
17 effective_to: date | None
18 text: str
19
20@dataclass(frozen=True)
21class Caller:
22 actor_id: str
23 region: str
24 acl_tags: frozenset[str]
25
26EVAL_DATE = date(2026, 5, 27)
27LUNA = Caller("luna-48291", "EU", frozenset({"support:eu"}))
28CHUNKS = [
29 PolicyChunk(
30 "eu-refurb-v2-rule",
31 "eu-electronics",
32 "eu-electronics-v2",
33 "eu-electronics/2026-04-01",
34 "EU",
35 "support:eu",
36 date(2026, 4, 1),
37 None,
38 (
39 "Rule RPL-14. Damaged refurbished laptops qualify for replacement "
40 "within 14 days of delivery when damage is reported within 48 hours."
41 ),
42 ),
43 PolicyChunk(
44 "eu-refurb-v1-rule",
45 "eu-electronics",
46 "eu-electronics-v1",
47 "eu-electronics/2025-02-01",
48 "EU",
49 "support:eu",
50 date(2025, 2, 1),
51 date(2026, 3, 31),
52 "Rule RPL-14. Damaged refurbished laptops qualify for return within 30 days.",
53 ),
54 PolicyChunk(
55 "merchant-vip-refurb",
56 "merchant-vip-terms",
57 "merchant-vip-terms",
58 "merchant-vip/2026-05-01",
59 "EU",
60 "merchant:vip-ops",
61 date(2026, 5, 1),
62 None,
63 "VIP-RPL-1. Damaged refurbished laptops receive immediate refund.",
64 ),
65 PolicyChunk(
66 "eu-footwear-v1-rule",
67 "eu-footwear",
68 "eu-footwear-v1",
69 "eu-footwear/2026-01-03",
70 "EU",
71 "support:eu",
72 date(2026, 1, 3),
73 None,
74 "Unworn footwear may be returned within 30 days of delivery.",
75 ),
76 PolicyChunk(
77 "eu-carrier-loss-v1",
78 "eu-carrier",
79 "eu-carrier-loss-v1",
80 "eu-carrier/2026-02-10",
81 "EU",
82 "support:eu",
83 date(2026, 2, 10),
84 None,
85 "Rule CLM-7. A lost parcel after carrier pickup qualifies for refund.",
86 ),
87]
88
89def is_current(chunk: PolicyChunk, on_date: date) -> bool:
90 return chunk.effective_from <= on_date and (
91 chunk.effective_to is None or on_date <= chunk.effective_to
92 )
93
94def permitted_chunks(
95 caller: Caller,
96 chunks: list[PolicyChunk],
97 on_date: date,
98) -> list[PolicyChunk]:
99 return [
100 chunk
101 for chunk in chunks
102 if chunk.region == caller.region
103 and chunk.acl_tag in caller.acl_tags
104 and is_current(chunk, on_date)
105 ]
106
107permitted = permitted_chunks(LUNA, CHUNKS, EVAL_DATE)
108permitted_ids = [chunk.chunk_id for chunk in permitted]
109print("Permitted current ids:", permitted_ids)
110assert "eu-refurb-v2-rule" in permitted_ids
111assert "eu-refurb-v1-rule" not in permitted_ids
112assert "merchant-vip-refurb" not in permitted_ids1Permitted current ids: ['eu-refurb-v2-rule', 'eu-footwear-v1-rule', 'eu-carrier-loss-v1']Every ranker below receives permitted, not CHUNKS. This isn't a presentation detail: it is the API boundary that prevents a new ranking algorithm from weakening the service contract.
The fixed EVAL_DATE keeps replay behavior stable. The chunk shape also preserves document_id and parent_id from the previous lesson, even though this chapter changes ranking rather than citation packing.
1blocked_ids = sorted(
2 chunk.chunk_id
3 for chunk in CHUNKS
4 if chunk.chunk_id not in permitted_ids
5)
6print("Searchable by Luna:", permitted_ids)
7print("Stored but blocked:", blocked_ids)
8assert blocked_ids == ["eu-refurb-v1-rule", "merchant-vip-refurb"]1Searchable by Luna: ['eu-refurb-v2-rule', 'eu-footwear-v1-rule', 'eu-carrier-loss-v1']
2Stored but blocked: ['eu-refurb-v1-rule', 'merchant-vip-refurb']This test keeps a deliberately attractive hidden policy in storage. Later ranking changes fail loudly if they accidentally widen the searchable set.
Sparse retrieval represents a document by vocabulary terms. Most coordinates are zero because a short policy chunk uses only a small part of the vocabulary. BM25 ranks a document higher when it shares rare query terms, while limiting the reward for repeated terms and compensating for unusually long documents.[1]
For a query term and document , the lab computes:
Here, is the term count in the chunk, is chunk length in tokens, and avgdl is the corpus average. k1 controls term-frequency saturation; b controls length normalization. The exact identifier rpl-14 occurs only in the relevant current chunk, so it receives strong lexical weight.
The small analyzer below keeps hyphenated rule codes intact and removes common function words. Without that stopword rule, a query containing only a shared word such as "a" could appear to retrieve an unrelated policy.
1TOKEN_RE = re.compile(r"[a-z0-9]+(?:-[a-z0-9]+)*")
2STOPWORDS = {"a", "an", "the", "after", "for", "of", "is", "within", "when"}
3
4def tokens(text: str) -> list[str]:
5 return [
6 token
7 for token in TOKEN_RE.findall(text.lower())
8 if token not in STOPWORDS
9 ]
10
11def bm25_rank(
12 query: str,
13 chunks: list[PolicyChunk],
14 top_k: int = 2,
15 k1: float = 1.2,
16 b: float = 0.75,
17) -> list[tuple[PolicyChunk, float]]:
18 if top_k <= 0:
19 raise ValueError("top_k must be positive")
20 if not chunks:
21 return []
22
23 doc_tokens = {chunk.chunk_id: tokens(chunk.text) for chunk in chunks}
24 avgdl = sum(len(value) for value in doc_tokens.values()) / len(chunks)
25 query_terms = tokens(query)
26 ranked: list[tuple[PolicyChunk, float]] = []
27
28 for chunk in chunks:
29 document = doc_tokens[chunk.chunk_id]
30 score = 0.0
31 for term in query_terms:
32 term_count = document.count(term)
33 if term_count == 0:
34 continue
35 containing_docs = sum(term in value for value in doc_tokens.values())
36 idf = log(1 + (len(chunks) - containing_docs + 0.5) / (containing_docs + 0.5))
37 numerator = term_count * (k1 + 1)
38 denominator = term_count + k1 * (1 - b + b * len(document) / avgdl)
39 score += idf * numerator / denominator
40 if score > 0:
41 ranked.append((chunk, score))
42
43 return sorted(ranked, key=lambda item: (-item[1], item[0].chunk_id))[:top_k]
44
45EXACT = "RPL-14"
46PARAPHRASE = "swap a broken reconditioned notebook"
47bm25_exact = bm25_rank(EXACT, permitted)
48bm25_paraphrase = bm25_rank(PARAPHRASE, permitted)
49
50print("BM25 exact:", [chunk.chunk_id for chunk, _ in bm25_exact])
51print("BM25 paraphrase:", [chunk.chunk_id for chunk, _ in bm25_paraphrase])
52assert bm25_exact[0][0].chunk_id == "eu-refurb-v2-rule"
53assert bm25_paraphrase == []1BM25 exact: ['eu-refurb-v2-rule']
2BM25 paraphrase: []BM25 did its job. It recovered the policy from its code and openly failed when the user used no policy vocabulary. A real evaluation set needs both query types; otherwise the lexical lane can look perfect while customers miss evidence.
BM25 is not the only sparse option. SPLADE learns sparse expansion weights, so a chunk can gain indexable related terms while preserving sparse retrieval infrastructure.[2] That can improve vocabulary mismatch cases, but it doesn't turn sparse retrieval into an authorization layer or guarantee better recall on ShopFlow queries. Evaluate a SPLADE candidate against the same permitted corpus, held-out required-evidence IDs, latency budget, and hidden-source exclusions before replacing BM25.
1required_text = next(
2 chunk.text for chunk in permitted if chunk.chunk_id == "eu-refurb-v2-rule"
3)
4exact_overlap = sorted(set(tokens(EXACT)) & set(tokens(required_text)))
5paraphrase_overlap = sorted(set(tokens(PARAPHRASE)) & set(tokens(required_text)))
6print("Exact overlap:", exact_overlap)
7print("Paraphrase overlap:", paraphrase_overlap)
8assert exact_overlap == ["rpl-14"]
9assert paraphrase_overlap == []1Exact overlap: ['rpl-14']
2Paraphrase overlap: []A dense retriever encodes queries and chunks as compact vectors, then retrieves chunks with high similarity. Dense Passage Retrieval (DPR), for example, uses separate encoders for questions and passages so passage representations can be indexed before requests arrive.[3]
Downloading and training an encoder would hide the retrieval mechanics in this lab. Instead, the next cell uses frozen three-dimensional vectors as test fixtures. Read them as outputs already produced by an embedding model:
| Dimension | Meaning in this fixture |
|---|---|
| 1 | Refurbished-device replacement intent |
| 2 | Footwear return intent |
| 3 | Lost-carrier refund intent |
This fixture is deliberately honest about one failure: the internal code RPL-14 has no semantic vector by itself. The paraphrase does.
1Vector = tuple[float, float, float]
2
3DOCUMENT_VECTORS: dict[str, Vector] = {
4 "eu-refurb-v2-rule": (1.00, 0.00, 0.00),
5 "eu-footwear-v1-rule": (0.00, 1.00, 0.00),
6 "eu-carrier-loss-v1": (0.00, 0.00, 1.00),
7}
8QUERY_VECTORS: dict[str, Vector] = {
9 EXACT: (0.00, 0.00, 0.00),
10 PARAPHRASE: (0.98, 0.05, 0.00),
11 "damaged refurbished laptop replacement after delivery": (0.96, 0.15, 0.02),
12}
13
14def cosine(left: Vector, right: Vector) -> float:
15 left_norm = sqrt(sum(value * value for value in left))
16 right_norm = sqrt(sum(value * value for value in right))
17 if left_norm == 0 or right_norm == 0:
18 return 0.0
19 return sum(a * b for a, b in zip(left, right)) / (left_norm * right_norm)
20
21def dense_rank(
22 query: str,
23 chunks: list[PolicyChunk],
24 top_k: int = 2,
25) -> list[tuple[PolicyChunk, float]]:
26 query_vector = QUERY_VECTORS.get(query, (0.0, 0.0, 0.0))
27 ranked = [
28 (chunk, cosine(query_vector, DOCUMENT_VECTORS[chunk.chunk_id]))
29 for chunk in chunks
30 ]
31 return sorted(
32 [(chunk, score) for chunk, score in ranked if score > 0],
33 key=lambda item: (-item[1], item[0].chunk_id),
34 )[:top_k]
35
36dense_exact = dense_rank(EXACT, permitted)
37dense_paraphrase = dense_rank(PARAPHRASE, permitted)
38print("Dense exact:", [chunk.chunk_id for chunk, _ in dense_exact])
39print("Dense paraphrase:", [chunk.chunk_id for chunk, _ in dense_paraphrase])
40assert dense_exact == []
41assert dense_paraphrase[0][0].chunk_id == "eu-refurb-v2-rule"1Dense exact: []
2Dense paraphrase: ['eu-refurb-v2-rule', 'eu-footwear-v1-rule']The fixture doesn't claim that every production encoder misses every identifier. It establishes a regression case: this chosen encoder representation doesn't recover the code-only query, so deleting the sparse lane would fail a known requirement.
BM25 scores and cosine similarities don't share units. A BM25 value reflects term statistics in this index; a cosine value reflects vector alignment. Adding raw values can let whichever scale is numerically larger control the order.
Reciprocal Rank Fusion avoids that comparison. It contributes for each rank at which a chunk appears:
We use k=60, the setting reported in the original RRF experiments, as a starting value rather than a universal optimum.[4] A chunk found by both lanes gains two contributions; a strong result found by one lane remains eligible.
1RRF_K = 60
2
3def reciprocal_rank_fusion(
4 result_lists: list[list[tuple[PolicyChunk, float]]],
5 k: int = RRF_K,
6) -> list[tuple[PolicyChunk, float]]:
7 if k <= 0:
8 raise ValueError("k must be positive")
9 by_id: dict[str, PolicyChunk] = {}
10 scores: dict[str, float] = {}
11 for results in result_lists:
12 for rank, (chunk, _raw_score) in enumerate(results, start=1):
13 by_id[chunk.chunk_id] = chunk
14 scores[chunk.chunk_id] = scores.get(chunk.chunk_id, 0.0) + 1 / (k + rank)
15 return sorted(
16 [(by_id[chunk_id], score) for chunk_id, score in scores.items()],
17 key=lambda item: (-item[1], item[0].chunk_id),
18 )
19
20def hybrid_rank(
21 query: str,
22 caller: Caller,
23 chunks: list[PolicyChunk],
24 top_k: int = 2,
25) -> list[tuple[PolicyChunk, float]]:
26 searchable = permitted_chunks(caller, chunks, EVAL_DATE)
27 fused = reciprocal_rank_fusion(
28 [bm25_rank(query, searchable, top_k), dense_rank(query, searchable, top_k)]
29 )
30 return fused[:top_k]
31
32SHARED_WORDS = "damaged refurbished laptop replacement after delivery"
33for query in [EXACT, PARAPHRASE, SHARED_WORDS]:
34 hits = hybrid_rank(query, LUNA, CHUNKS)
35 print(query, "->", [chunk.chunk_id for chunk, _ in hits])
36
37shared_fused = hybrid_rank(SHARED_WORDS, LUNA, CHUNKS)
38assert shared_fused[0][0].chunk_id == "eu-refurb-v2-rule"
39assert shared_fused[0][1] == 2 / 611RPL-14 -> ['eu-refurb-v2-rule']
2swap a broken reconditioned notebook -> ['eu-refurb-v2-rule', 'eu-footwear-v1-rule']
3damaged refurbished laptop replacement after delivery -> ['eu-refurb-v2-rule', 'eu-footwear-v1-rule']RRF doesn't manufacture relevance. It makes the two candidate sources interoperable. If both lanes miss the right chunk, a fused list will still be wrong.
The stored VIP policy contains a unique code. If the permission boundary moved after retrieval, BM25 would have an easy hidden hit to surface. A hybrid implementation must return nothing for Luna's request for that code.
1NO_ACCESS = Caller("visitor-9000", "APAC", frozenset())
2attack_hits = hybrid_rank("VIP-RPL-1", LUNA, CHUNKS)
3attack_ids = [chunk.chunk_id for chunk, _ in attack_hits]
4no_access_hits = hybrid_rank(EXACT, NO_ACCESS, CHUNKS)
5print("Visible candidates for hidden code:", attack_ids)
6print("Visible candidates without corpus access:", no_access_hits)
7assert "merchant-vip-refurb" not in attack_ids
8assert attack_ids == []
9assert no_access_hits == []1Visible candidates for hidden code: []
2Visible candidates without corpus access: []In the previous lesson, answer quality depended on retrieving current permitted evidence. That means the retrieval upgrade needs its own release cases before you measure generated prose.
Recall@2 answers a narrow question: for each supported query, did the correct permitted chunk appear in the first two candidates? It does not say whether the evidence order is perfect or whether the final answer is faithful. Those are later checks. Here, recall exposes whether the generator even gets a chance to see the right policy.
1@dataclass(frozen=True)
2class RetrievalCase:
3 name: str
4 query: str
5 expected_chunk_id: str
6
7CASES = [
8 RetrievalCase("exact-code", EXACT, "eu-refurb-v2-rule"),
9 RetrievalCase("paraphrase", PARAPHRASE, "eu-refurb-v2-rule"),
10 RetrievalCase("shared-language", SHARED_WORDS, "eu-refurb-v2-rule"),
11]
12
13def recall_at_2(rank_fn) -> float:
14 recovered = 0
15 for case in CASES:
16 ids = [chunk.chunk_id for chunk, _ in rank_fn(case.query)]
17 recovered += case.expected_chunk_id in ids[:2]
18 return recovered / len(CASES)
19
20bm25_recall = recall_at_2(lambda query: bm25_rank(query, permitted))
21dense_recall = recall_at_2(lambda query: dense_rank(query, permitted))
22hybrid_recall = recall_at_2(lambda query: hybrid_rank(query, LUNA, CHUNKS))
23
24requested_hidden_rule = hybrid_rank("VIP-RPL-1", LUNA, CHUNKS)
25visible_ids = [chunk.chunk_id for chunk, _ in requested_hidden_rule]
26safety_pass = (
27 "merchant-vip-refurb" not in visible_ids
28 and "eu-refurb-v1-rule" not in visible_ids
29)
30
31print(f"BM25 Recall@2: {bm25_recall:.2f}")
32print(f"Dense Recall@2: {dense_recall:.2f}")
33print(f"Hybrid RRF Recall@2: {hybrid_recall:.2f}")
34print("Safety gate:", safety_pass)
35assert hybrid_recall == 1.0
36assert hybrid_recall > bm25_recall
37assert hybrid_recall > dense_recall
38assert safety_pass1BM25 Recall@2: 0.67
2Dense Recall@2: 0.67
3Hybrid RRF Recall@2: 1.00
4Safety gate: TrueThese three fixtures demonstrate complementary failures; they don't prove an offline lift for a production corpus. A release decision needs a held-out set drawn from real support requests, including exact codes, paraphrases, unsupported questions, languages served by the product, and attempts to request hidden policies.
| Gate | What to freeze | Failure meaning |
|---|---|---|
| Permitted Recall@k | Query and required current chunk ID | Correct evidence never reaches context selection |
| Restricted-source exclusion | Queries that strongly match hidden chunks | Retriever boundary is unsafe |
| Superseded-source exclusion | Queries matching old policy wording | Freshness filter regressed |
| Abstention cases | Questions with no permitted supporting evidence | Retrieval or answer layer overreaches |
When a final answer is wrong, you need to tell apart three failures:
| Failure | Trace evidence | Next repair |
|---|---|---|
| Retrieval miss | Expected chunk absent from sparse, dense, and fused candidates | Improve indexing, encoder, query handling, or fusion |
| Fusion ordering issue | Expected chunk exists in a lane but falls below context budget | Tune fusion on held-out labels |
| Later precision issue | Correct chunk is in fused candidates but distractors rank above it | Add and evaluate the reranking stage in the next lesson |
Store IDs, ranks, model and index versions, fusion settings, and timing. Don't log policy text in a broad diagnostic event.
1def trace_hybrid_request(
2 query: str,
3 query_kind: str,
4 caller: Caller,
5) -> dict[str, object]:
6 searchable = permitted_chunks(caller, CHUNKS, EVAL_DATE)
7 sparse = bm25_rank(query, searchable)
8 dense = dense_rank(query, searchable)
9 fused = reciprocal_rank_fusion([sparse, dense])
10 return {
11 "versions": {
12 "retriever": "policy-retriever-v2",
13 "index": "policy-index/2026-05-27",
14 "sparse": "bm25-tokenizer-v1",
15 "dense": "fixture-embeddings-v1",
16 "fusion": f"rrf-k{RRF_K}",
17 },
18 "query_kind": query_kind,
19 "sparse_ids": [chunk.chunk_id for chunk, _ in sparse],
20 "dense_ids": [chunk.chunk_id for chunk, _ in dense],
21 "fused_ids": [chunk.chunk_id for chunk, _ in fused[:2]],
22 "timings_ms": {"authorize": 2, "bm25": 4, "dense": 11, "fusion": 1},
23 }
24
25trace = trace_hybrid_request(PARAPHRASE, "paraphrase-regression", LUNA)
26stores_raw_policy_text = any(
27 chunk.text in str(trace)
28 for chunk in CHUNKS
29)
30print("Versions:", trace["versions"])
31print("Sparse ids:", trace["sparse_ids"])
32print("Dense ids:", trace["dense_ids"])
33print("Fused ids:", trace["fused_ids"])
34print("Trace stores raw policy text:", stores_raw_policy_text)
35assert trace["fused_ids"][0] == "eu-refurb-v2-rule"
36assert "eu-footwear-v1-rule" in trace["fused_ids"]
37assert "merchant-vip-refurb" not in str(trace)
38assert not stores_raw_policy_text1Versions: {'retriever': 'policy-retriever-v2', 'index': 'policy-index/2026-05-27', 'sparse': 'bm25-tokenizer-v1', 'dense': 'fixture-embeddings-v1', 'fusion': 'rrf-k60'}
2Sparse ids: []
3Dense ids: ['eu-refurb-v2-rule', 'eu-footwear-v1-rule']
4Fused ids: ['eu-refurb-v2-rule', 'eu-footwear-v1-rule']
5Trace stores raw policy text: FalseThe correct evidence is present, but the semantic lane also kept a footwear distractor. That is exactly the boundary between this lesson and the next: retrieval satisfies candidate recall; reranking decides whether a distractor should remain near context. The trace can also preserve the latency budget created in the production RAG lesson. Two retrieval lanes add work, so the release check should report that cost explicitly.
If a context budget is being wasted by several near-duplicate candidates, Maximum Marginal Relevance (MMR) is one selection strategy: choose a relevant result while penalizing candidates too similar to what has already been selected.[5] MMR handles diversity in an existing permitted candidate set. It doesn't retrieve a missing policy and doesn't replace a cross-encoder that must compare query relevance precisely.
1RETRIEVAL_BUDGET_MS = {"authorize": 10, "bm25": 12, "dense": 40, "fusion": 8}
2
3def exceeded_retrieval_budgets(timings: dict[str, int]) -> list[str]:
4 return [
5 stage
6 for stage, budget in RETRIEVAL_BUDGET_MS.items()
7 if stage not in timings or timings[stage] > budget
8 ]
9
10healthy = trace["timings_ms"]
11missing_dense = {
12 stage: elapsed
13 for stage, elapsed in healthy.items()
14 if stage != "dense"
15}
16print("Healthy exceeded:", exceeded_retrieval_budgets(healthy))
17print("Missing timing exceeded:", exceeded_retrieval_budgets(missing_dense))
18assert exceeded_retrieval_budgets(healthy) == []
19assert exceeded_retrieval_budgets(missing_dense) == ["dense"]1Healthy exceeded: []
2Missing timing exceeded: ['dense']RRF is a good first implementation because it doesn't require calibrating unrelated score scales. It isn't an automatic winner. Bruch et al. found that RRF can be sensitive to its parameter and that a tuned convex combination can outperform it in their tested settings.[6] If you have enough labeled queries, compare it against normalized weighted fusion:
That comparison is an evaluation task, not a reason to guess an alpha in production. Keep a fixed held-out split, version the encoder and index, report Recall@k and latency for every candidate, and retain RRF if a tuned method doesn't hold up out of sample.
Extend policy-answerer-v2 without weakening its contract:
The important artifact is not a search demo. It is a retrieval report showing which evidence questions are recovered, which must abstain, and which source boundaries remain enforced after the upgrade.
You are ready to use hybrid retrieval in a RAG system when you can:
| Level | Evidence in your submission |
|---|---|
| Foundational | Correctly ranks RPL-14 with BM25 and explains its term-based signal |
| Applied | Recovers the paraphrased device-replacement question through dense retrieval and fuses candidates with RRF |
| Strong | Reports BM25-only, dense-only, and hybrid Recall@k on labeled positive cases plus negative safety gates |
| Production-ready | Uses a versioned encoder and index, measures latency, and proves restricted or superseded policies never enter fused candidates |
| Symptom | Likely cause | Repair |
|---|---|---|
| Rule-code lookup returns a generic policy | Dense-only search lost a rare identifier | Keep or restore the lexical lane and add code queries to the release set |
| Paraphrased question returns nothing | Sparse-only search requires the policy's exact wording | Add a dense lane and test semantic queries against required evidence IDs |
| Fused ranking changes wildly after an encoder update | Raw scores or tuned weights no longer have the same calibration | Compare against RRF and retune only on a fixed labeled split |
| Hidden merchant rule appears in any candidate trace | Retrieval ran before authorization filtering | Restrict the searchable candidate universe before either lane executes |
| Team blames generation for an unsupported answer | Retrieval evidence IDs were never evaluated | Measure permitted Recall@k and abstention before scoring final text |
The Probabilistic Relevance Framework: BM25 and Beyond.
Robertson, S., & Zaragoza, H. · 2009 · Foundations and Trends in Information Retrieval
SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking.
Formal, T., et al. · 2021 · SIGIR 2021
Dense Passage Retrieval for Open-Domain Question Answering.
Karpukhin, V., et al. · 2020 · EMNLP 2020
Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods.
Cormack, G. V., Clarke, C. L. A., & Buettcher, S. · 2009 · SIGIR '09
The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries.
Carbonell, J., & Goldstein, J. · 1998 · SIGIR 1998
An Analysis of Fusion Functions for Hybrid Retrieval.
Bruch, S., Gai, S., & Ingber, A. · 2023 · ACM Transactions on Information Systems