Build and evaluate the evidence-selection stage of a support assistant with BM25, dense similarity, rank fusion, reranking, and approximate search audits.
In the last chapter, you inspected whether unlabeled embeddings formed meaningful neighborhoods. A support assistant needs a tougher guarantee: given one customer question, can it place the policy page that answers the question near the top? An LLM can't quote a rule that retrieval never places in its context.
We'll build that evidence-selection stage around one query and four help-center passages. You will implement lexical ranking with BM25, dense ranking with cosine similarity, rank fusion, retrieval evaluation, and an approximate-search failure test. By the end, a fluent wrong answer won't look like a mysterious model failure. You'll know how to inspect the evidence gate first.
Customer question:
How do I get a refund for an annual plan?
The support assistant may retrieve these passages:
| Doc | Passage | Relevance judgment |
|---|---|---|
d1 | "Annual plan refund policy. Request a refund within 30 days of purchase." | Direct answer. |
d2 | "Cancel during your first month and we will return your payment." | Relevant paraphrase. |
d3 | "Update your billing address in account settings." | Not relevant. |
d4 | "Refund status for duplicate charges. Refunds usually appear in 5 to 10 days." | Related word, wrong problem. |
The distinction between d2 and d4 matters. A page may mention refunds and still fail to answer this refund question. Retrieval evaluation needs judgments tied to a question, not a general document category.
Sparse retrieval treats documents as words and counts. It first builds an inverted index, a map from a term to documents containing that term. This is why exact policy phrases, order codes, and error identifiers can be found efficiently.
We'll lowercase words, remove punctuation, and discard a short stopword list so the query keeps three discriminating terms: refund, annual, and plan.
1import re
2from collections import defaultdict
3
4docs = {
5 "d1": "Annual plan refund policy. Request a refund within 30 days of purchase.",
6 "d2": "Cancel during your first month and we will return your payment.",
7 "d3": "Update your billing address in account settings.",
8 "d4": "Refund status for duplicate charges. Refunds usually appear in 5 to 10 days.",
9}
10query = "How do I get a refund for an annual plan?"
11stopwords = {"a", "an", "and", "do", "for", "get", "how", "i", "in", "the", "to", "within", "your"}
12
13def tokenize(text: str) -> list[str]:
14 return [word for word in re.findall(r"[a-z]+", text.lower()) if word not in stopwords]
15
16postings = defaultdict(list)
17for doc_id, passage in docs.items():
18 for term in sorted(set(tokenize(passage))):
19 postings[term].append(doc_id)
20
21query_terms = tokenize(query)
22print("query terms:", query_terms)
23for term in query_terms:
24 print(f"{term:6} -> {postings[term]}")1query terms: ['refund', 'annual', 'plan']
2refund -> ['d1', 'd4']
3annual -> ['d1']
4plan -> ['d1']d2 is judged relevant, but it doesn't appear in any of these three posting lists. That isn't a bug in lexical retrieval. It is the limitation we will need dense retrieval to cover.
BM25 ranks passages from term counts while preventing two easy distortions: repeating one word forever and rewarding long documents merely for having more chances to contain a word. For each query term, its score contribution is:
In this equation, is the count of term in passage ; is the passage length; is the average passage length; and inverse document frequency (IDF) gives more weight to rarer terms. controls term-frequency saturation and controls length normalization. This is the BM25 family presented by Robertson and Zaragoza.[1]
BM25 has implementation variants. The code below uses the always-positive IDF term , where is the number of passages and is the number containing the term. A query term still contributes a non-negative weight when it appears in most passages.
The first full ranker uses only Python data structures. It is short enough to inspect and precise enough to reproduce each score.
1import math
2import re
3from collections import Counter
4
5docs = {
6 "d1": "Annual plan refund policy. Request a refund within 30 days of purchase.",
7 "d2": "Cancel during your first month and we will return your payment.",
8 "d3": "Update your billing address in account settings.",
9 "d4": "Refund status for duplicate charges. Refunds usually appear in 5 to 10 days.",
10}
11query = "How do I get a refund for an annual plan?"
12stopwords = {"a", "an", "and", "do", "for", "get", "how", "i", "in", "the", "to", "within", "your"}
13
14def tokenize(text):
15 return [word for word in re.findall(r"[a-z]+", text.lower()) if word not in stopwords]
16
17tokens = {doc_id: tokenize(text) for doc_id, text in docs.items()}
18query_terms = tokenize(query)
19average_length = sum(len(row) for row in tokens.values()) / len(tokens)
20document_frequency = Counter(term for row in tokens.values() for term in set(row))
21
22def bm25(doc_id, k1=1.2, b=0.75):
23 score = 0.0
24 counts = Counter(tokens[doc_id])
25 for term in query_terms:
26 frequency = counts[term]
27 if frequency == 0:
28 continue
29 df = document_frequency[term]
30 idf = math.log(1 + (len(docs) - df + 0.5) / (df + 0.5))
31 denominator = frequency + k1 * (1 - b + b * len(tokens[doc_id]) / average_length)
32 score += idf * frequency * (k1 + 1) / denominator
33 return score
34
35ranking = sorted(((doc_id, bm25(doc_id)) for doc_id in docs), key=lambda item: (-item[1], item[0]))
36for rank, (doc_id, score) in enumerate(ranking, start=1):
37 print(f"#{rank} {doc_id}: {score:.3f}")1#1 d1: 3.128
2#2 d4: 0.675
3#3 d2: 0.000
4#4 d3: 0.000BM25 should put d1 first: it contains all three exact query terms. It should rank d4 above d2 because d4 contains refund, although a human judgment says d2 is the better answer. This is a retrieval failure mode worth keeping visible.
Copying a query word repeatedly shouldn't make a weak page unbeatable. In BM25, term frequency raises the contribution but with diminishing returns. The next lab isolates that factor by holding IDF and length normalization fixed.
1def term_weight(frequency: int, k1: float = 1.2) -> float:
2 return frequency * (k1 + 1) / (frequency + k1)
3
4limit = 1.2 + 1
5for frequency in (1, 2, 4, 20):
6 weight = term_weight(frequency)
7 remaining = limit - weight
8 print(f"refund count={frequency:2}: tf weight={weight:.3f} gap to limit={remaining:.3f}")1refund count= 1: tf weight=1.000 gap to limit=1.200
2refund count= 2: tf weight=1.375 gap to limit=0.825
3refund count= 4: tf weight=1.692 gap to limit=0.508
4refund count=20: tf weight=2.075 gap to limit=0.125The weight grows toward the limit , but the remaining gap shrinks. BM25 still requires good tokenization and good documents; it just stops keyword stuffing from behaving like relevance.
The previous chapter taught you to audit vector geometry. Dense retrieval applies that geometry to a query: embed the query and each passage independently, then rank passages by similarity. Dense Passage Retrieval demonstrated this bi-encoder pattern for passage ranking, using query and passage representations scored by inner product.[2]
To see the mechanics, use three interpretable coordinates rather than a real embedding model:
| Item | Vector | Intended meaning |
|---|---|---|
| query | [1.0, 0.8, 0.0] | refund plus annual-plan intent |
d1 | [1.0, 0.4, 0.0] | exact refund page |
d2 | [0.9, 0.9, 0.0] | paraphrased refund page |
d3 | [0.0, 0.2, 1.0] | billing settings |
d4 | [0.4, 0.0, 0.3] | related refund status |
Cosine similarity compares direction:
For d2, the numerator is . Dividing by the two lengths gives approximately 0.994, even though d2 didn't use the literal word refund.
1import numpy as np
2
3vectors = {
4 "query": np.array([1.0, 0.8, 0.0]),
5 "d1": np.array([1.0, 0.4, 0.0]),
6 "d2": np.array([0.9, 0.9, 0.0]),
7 "d3": np.array([0.0, 0.2, 1.0]),
8 "d4": np.array([0.4, 0.0, 0.3]),
9}
10
11def cosine(left, right):
12 return float(left @ right / (np.linalg.norm(left) * np.linalg.norm(right)))
13
14ranking = sorted(
15 ((doc_id, cosine(vectors["query"], vector)) for doc_id, vector in vectors.items() if doc_id != "query"),
16 key=lambda item: (-item[1], item[0]),
17)
18for rank, (doc_id, score) in enumerate(ranking, start=1):
19 print(f"#{rank} {doc_id}: {score:.3f}")1#1 d2: 0.994
2#2 d1: 0.957
3#3 d4: 0.625
4#4 d3: 0.123Dense ranking promotes d2, the paraphrase that sparse ranking could not see. It also retains d4 as partially similar, so dense retrieval doesn't remove the need for judgments or reranking.
Raw dot product and cosine similarity produce the same order only when vector norms are controlled. A larger vector can win dot product despite pointing in a worse direction.
1import numpy as np
2
3query = np.array([1.0, 0.8, 0.0])
4candidates = {
5 "aligned_paraphrase": np.array([1.0, 0.8, 0.0]),
6 "large_partial_match": np.array([6.0, 0.0, 0.0]),
7}
8
9dot_ranking = sorted(candidates, key=lambda name: -(query @ candidates[name]))
10cosine_ranking = sorted(
11 candidates,
12 key=lambda name: -(query @ candidates[name]) / (np.linalg.norm(query) * np.linalg.norm(candidates[name])),
13)
14
15print("dot-product winner:", dot_ranking[0])
16print("cosine winner: ", cosine_ranking[0])1dot-product winner: large_partial_match
2cosine winner: aligned_paraphraseThis doesn't mean cosine is always right and dot product is always wrong. It means index construction, offline evaluation, and online queries must agree on the metric the embedding model is intended to use.
For this question, sparse and dense rankings contain complementary evidence:
| Lane | Ranking | Strength |
|---|---|---|
| BM25 | d1, d4, d2, d3 | protects exact terms such as annual plan |
| Dense cosine | d2, d1, d4, d3 | rescues the paraphrase return your payment |
Do not average a BM25 number such as 3.128 with a cosine number such as 0.994. They have unrelated units. Reciprocal rank fusion (RRF) uses rank positions instead:
Here, is the set of ranked lists and is document 's position in list . If one lane doesn't return , that lane contributes nothing for . Cormack, Clarke, and Buettcher proposed RRF and used k=60 as a fixed constant in their reported experiments.[3]
1from collections import defaultdict
2
3bm25 = ["d1", "d4", "d2", "d3"]
4dense = ["d2", "d1", "d4", "d3"]
5
6def rrf(rankings, k=60):
7 scores = defaultdict(float)
8 for ranking in rankings:
9 for rank, document in enumerate(ranking, start=1):
10 scores[document] += 1 / (k + rank)
11 return scores
12
13scores = rrf([bm25, dense])
14fused = sorted(scores, key=lambda document: (-scores[document], document))
15
16for rank, document in enumerate(fused, start=1):
17 print(f"#{rank} {document}: {scores[document]:.6f}")1#1 d1: 0.032522
2#2 d2: 0.032266
3#3 d4: 0.032002
4#4 d3: 0.031250The paraphrased answer d2 moves above the refund-status distractor d4. The loop accumulates scores across the union of returned documents, so a dense-only candidate isn't silently discarded. RRF didn't decide that d2 was relevant; your judgment table did. RRF only gave both retrieval lanes a fair way to contribute candidates.
A dense retriever encodes a query separately from its passages so passage vectors can be stored and searched cheaply. A cross-encoder reranker instead reads a query-passage pair together and predicts relevance for that pair. Nogueira and Cho showed this second-stage BERT reranking pattern on retrieved passages.[4]
Reranking improves ordering only if the correct passage survived candidate retrieval. For a paraphrased request such as q2 return my payment, the next check makes that boundary explicit with assumed reranker scores where d2 is the best answer.
1reranker_score = {"d1": 0.55, "d2": 0.96, "d4": 0.12}
2
3def rerank(candidates):
4 return sorted(candidates, key=lambda doc: -reranker_score[doc])
5
6for name, candidates in [
7 ("sparse-only top2", ["d1", "d4"]),
8 ("hybrid top3", ["d1", "d2", "d4"]),
9]:
10 ordered = rerank(candidates)
11 recovered = "d2" in ordered and ordered[0] == "d2"
12 print(f"{name:16}: top={ordered[0]} recovered_best={recovered}")1sparse-only top2: top=d1 recovered_best=False
2hybrid top3 : top=d2 recovered_best=TrueThis is why a retrieval cascade has two separate metrics: candidate recall for the first stage and final ordering quality after reranking.
To evaluate retrieval, create queries with relevance judgments. Two useful metrics are:
For multi-relevance or graded judgments, normalized Discounted Cumulative Gain (nDCG) is also common; BEIR reports nDCG@10 across diverse information-retrieval datasets.[5]
The mini evaluation below compares a sparse-only run with a hybrid run. q2 uses paraphrased wording, so the dense lane must preserve d2 near the top.
1gold = {
2 "q1 annual plan refund": {"d1", "d2"},
3 "q2 return my payment": {"d2"},
4 "q3 duplicate charge status": {"d4"},
5}
6sparse = {
7 "q1 annual plan refund": ["d1", "d4", "d2"],
8 "q2 return my payment": ["d3", "d4", "d2"],
9 "q3 duplicate charge status": ["d4", "d1", "d2"],
10}
11hybrid = {
12 "q1 annual plan refund": ["d1", "d2", "d4"],
13 "q2 return my payment": ["d2", "d1", "d4"],
14 "q3 duplicate charge status": ["d4", "d1", "d2"],
15}
16
17def evaluate(run, k=2):
18 hits = []
19 reciprocals = []
20 for question, ranking in run.items():
21 relevant = gold[question]
22 hits.append(any(doc in relevant for doc in ranking[:k]))
23 first_rank = next((rank for rank, doc in enumerate(ranking, start=1) if doc in relevant), None)
24 reciprocals.append(0.0 if first_rank is None else 1 / first_rank)
25 return sum(hits) / len(hits), sum(reciprocals) / len(reciprocals)
26
27for name, run in [("sparse", sparse), ("hybrid", hybrid)]:
28 hit_rate, mrr = evaluate(run)
29 print(f"{name:6}: hit_rate@2={hit_rate:.3f} mrr={mrr:.3f}")1sparse: hit_rate@2=0.667 mrr=0.778
2hybrid: hit_rate@2=1.000 mrr=1.000The lesson isn't that hybrid always wins. It's that an evaluated hard query reveals the weakness that an easy query hides.
Exact dense retrieval compares a query vector against every stored passage. Once a corpus grows large, an approximate nearest neighbor (ANN) index searches fewer candidates or uses compressed vectors to lower latency and memory cost.
Three building blocks appear often:
| Tool | How it avoids exact work | Risk to evaluate |
|---|---|---|
| Inverted file index (IVF) | searches only selected coarse vector partitions | correct neighbor lies in an unprobed partition |
| Hierarchical Navigable Small World (HNSW) | explores a proximity graph from long-range layers into local neighbors | limited exploration misses a closer point |
| Product quantization (PQ) | represents subvectors with codebook entries for compressed distance calculations | quantized distance changes the order |
IVF and PQ are described in Faiss's similarity-search work and PQ's original nearest-neighbor paper; HNSW uses a hierarchical navigable proximity graph for approximate search.[6][7][8]
An IVF index assigns documents to coarse partitions. At query time it searches the nearest partition, or several nearby partitions controlled by a probe budget. Near a partition boundary, the closest document may live in the next partition.
1import numpy as np
2
3documents = {
4 "generic_refund": np.array([0.0, 0.0]),
5 "annual_refund": np.array([5.1, 0.0]),
6 "shipping": np.array([9.0, 0.0]),
7}
8centroids = {0: np.array([0.0, 0.0]), 1: np.array([10.0, 0.0])}
9lists = {0: ["generic_refund"], 1: ["annual_refund", "shipping"]}
10query = np.array([4.9, 0.0])
11
12def nearest(candidates):
13 return min(candidates, key=lambda doc: np.linalg.norm(documents[doc] - query))
14
15exact = nearest(list(documents))
16ordered_lists = sorted(centroids, key=lambda bucket: np.linalg.norm(centroids[bucket] - query))
17one_probe = nearest(lists[ordered_lists[0]])
18two_probe = nearest(lists[ordered_lists[0]] + lists[ordered_lists[1]])
19
20print("exact nearest: ", exact)
21print("nprobe=1 result:", one_probe)
22print("nprobe=2 result:", two_probe)1exact nearest: annual_refund
2nprobe=1 result: generic_refund
3nprobe=2 result: annual_refundThe miss isn't a defect in vector meaning. It is an approximation decision. If exact search finds the correct evidence and ANN doesn't, tune or replace the index before changing embeddings.
One anecdote isn't an evaluation. Use a held-out query slice, compute exact nearest results, and report recall@1: how often an ANN configuration returns the same top passage.
1exact_top1 = ["annual_refund", "shipping_delay", "billing_address", "duplicate_charge"]
2ann_fast = ["generic_refund", "shipping_delay", "billing_address", "generic_refund"]
3ann_tuned = ["annual_refund", "shipping_delay", "billing_address", "duplicate_charge"]
4
5def recall_at_one(approximate, exact):
6 matches = sum(found == expected for found, expected in zip(approximate, exact))
7 return matches / len(exact)
8
9print(f"fast setting recall@1: {recall_at_one(ann_fast, exact_top1):.2f}")
10print(f"tuned setting recall@1: {recall_at_one(ann_tuned, exact_top1):.2f}")
11print("ship only after checking latency alongside recall")1fast setting recall@1: 0.50
2tuned setting recall@1: 1.00
3ship only after checking latency alongside recallLatency gains aren't free evidence. A production retriever needs both an accuracy report and a latency report for the index setting it serves.
Before letting an LLM answer from retrieved passages, inspect the stages separately:
| Stage | Question | Artifact |
|---|---|---|
| Judgments | Which passage really answers each question? | held-out query-to-passage labels |
| Sparse lane | Do exact policy terms and identifiers survive? | BM25 top-k report |
| Dense lane | Do paraphrases appear near the top? | hard paraphrase set |
| Fusion | Do both lanes contribute useful candidates? | fused ranking with lane attribution |
| Reranking | Is the best evidence ordered first? | reranked MRR or nDCG |
| ANN index | Did latency savings drop recall? | ANN-vs-exact recall and latency |
This is the first production habit of retrieval-augmented generation (RAG): evaluate the documents entering the prompt before blaming the generator for an unsupported answer.
bm25-from-scratch.py. Print each query term's document frequency and explain why annual and plan should carry more lexical evidence than refund.d5 to reciprocal-rank-fusion.py. Confirm that the fused output includes it even though BM25 never returned it.measure-hit-rate-and-mrr.py where the first relevant passage appears at rank 3. Predict how Hit Rate @ 2 and MRR change before running the script.[4.9, 0.0] to [5.2, 0.0]. Explain why nprobe=1 now recovers annual_refund.d5 contributes no reciprocal-rank term for it.3 misses Hit Rate @ 2 but contributes 1 / 3 to MRR. The metrics answer different questions.[5.2, 0.0], centroid 1 is closer than centroid 0, so the first probed list contains annual_refund. Probe failures depend on both partition assignment and query location.d1, d2, and d4 are treated differently for the same refund question, then reproduces BM25 and cosine rankings.| Symptom | Cause | Fix |
|---|---|---|
| Exact SKU or policy terms disappear after switching to dense-only search. | Semantic retrieval blurred decisive literal strings. | Retain a BM25 lane and evaluate hybrid recall. |
| A paraphrased answer never appears in the shortlist. | Sparse retrieval was asked to solve semantic matching alone. | Add dense retrieval and include paraphrase queries in evaluation. |
| Raw-score averaging makes one lane dominate fusion. | BM25 and cosine scores have unrelated units. | Use RRF or a calibrated learned combination. |
| Reranker scores look strong but answers remain unsupported. | Candidate retrieval already discarded the correct evidence. | Measure candidate Hit Rate before reranker quality. |
| Index tuning lowers latency while policies vanish from top-k. | ANN search skipped or distorted relevant vectors. | Gate ANN settings against exact-search recall and latency reports. |
Answer every question, then check your score. Score above 75% to mark this lesson complete.
9 questions remaining.
The Probabilistic Relevance Framework: BM25 and Beyond.
Robertson, S., & Zaragoza, H. · 2009 · Foundations and Trends in Information Retrieval
Dense Passage Retrieval for Open-Domain Question Answering.
Karpukhin, V., et al. · 2020 · EMNLP 2020
Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods.
Cormack, G. V., Clarke, C. L. A., & Buettcher, S. · 2009 · SIGIR '09
Passage Re-ranking with BERT.
Nogueira, R. & Cho, K. · 2019 · arXiv preprint
BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models.
Thakur, N., et al. · 2021 · NeurIPS 2021 Datasets and Benchmarks
Billion-scale similarity search with GPUs.
Johnson, J., Douze, M., & Jégou, H. · 2017 · arXiv preprint
Product Quantization for Nearest Neighbor Search.
Jégou, H., Douze, M., & Schmid, C. · 2011 · IEEE TPAMI 2011
Efficient and Robust Approximate Nearest Neighbor Using Hierarchical Navigable Small World Graphs.
Malkov, Y. A., & Yashunin, D. A. · 2018 · IEEE Transactions on Pattern Analysis and Machine Intelligence