LeetLLM
LearnPracticeFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Practice
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnML Algorithms & EvaluationCore Retrieval Algorithms
🔍MediumRAG & Retrieval

Core Retrieval Algorithms

Build and evaluate the evidence-selection stage of a support assistant with BM25, dense similarity, rank fusion, reranking, and approximate search audits.

16 min read
Learning path
Step 34 of 155 in the full curriculum
Clustering and PCADecoding Algorithms

In the last chapter, you inspected whether unlabeled embeddings formed meaningful neighborhoods. A support assistant needs a tougher guarantee: given one customer question, can it place the policy page that answers the question near the top? An LLM can't quote a rule that retrieval never places in its context.

We'll build that evidence-selection stage around one query and four help-center passages. You will implement lexical ranking with BM25, dense ranking with cosine similarity, rank fusion, retrieval evaluation, and an approximate-search failure test. By the end, a fluent wrong answer won't look like a mysterious model failure. You'll know how to inspect the evidence gate first.

One Question, Four Possible Sources

Customer question:

How do I get a refund for an annual plan?

The support assistant may retrieve these passages:

DocPassageRelevance judgment
d1"Annual plan refund policy. Request a refund within 30 days of purchase."Direct answer.
d2"Cancel during your first month and we will return your payment."Relevant paraphrase.
d3"Update your billing address in account settings."Not relevant.
d4"Refund status for duplicate charges. Refunds usually appear in 5 to 10 days."Related word, wrong problem.

The distinction between d2 and d4 matters. A page may mention refunds and still fail to answer this refund question. Retrieval evaluation needs judgments tied to a question, not a general document category.

Query ranks candidates; the top two move into model context. Query ranks candidates; the top two move into model context.
Top-k filters candidates before model context.

Sparse Retrieval Starts With an Inverted Index

Sparse retrieval treats documents as words and counts. It first builds an inverted index, a map from a term to documents containing that term. This is why exact policy phrases, order codes, and error identifiers can be found efficiently.

We'll lowercase words, remove punctuation, and discard a short stopword list so the query keeps three discriminating terms: refund, annual, and plan.

inspect-the-inverted-index.py
1import re 2from collections import defaultdict 3 4docs = { 5 "d1": "Annual plan refund policy. Request a refund within 30 days of purchase.", 6 "d2": "Cancel during your first month and we will return your payment.", 7 "d3": "Update your billing address in account settings.", 8 "d4": "Refund status for duplicate charges. Refunds usually appear in 5 to 10 days.", 9} 10query = "How do I get a refund for an annual plan?" 11stopwords = {"a", "an", "and", "do", "for", "get", "how", "i", "in", "the", "to", "within", "your"} 12 13def tokenize(text: str) -> list[str]: 14 return [word for word in re.findall(r"[a-z]+", text.lower()) if word not in stopwords] 15 16postings = defaultdict(list) 17for doc_id, passage in docs.items(): 18 for term in sorted(set(tokenize(passage))): 19 postings[term].append(doc_id) 20 21query_terms = tokenize(query) 22print("query terms:", query_terms) 23for term in query_terms: 24 print(f"{term:6} -> {postings[term]}")
Output
1query terms: ['refund', 'annual', 'plan'] 2refund -> ['d1', 'd4'] 3annual -> ['d1'] 4plan -> ['d1']

d2 is judged relevant, but it doesn't appear in any of these three posting lists. That isn't a bug in lexical retrieval. It is the limitation we will need dense retrieval to cover.

BM25 Scores Lexical Evidence

BM25 ranks passages from term counts while preventing two easy distortions: repeating one word forever and rewarding long documents merely for having more chances to contain a word. For each query term, its score contribution is:

IDF(t)⋅f(t,d)(k1+1)f(t,d)+k1(1−b+b∣d∣/avgdl)\mathrm{IDF}(t) \cdot \frac{f(t,d)(k_1+1)} {f(t,d) + k_1(1-b+b|d|/\mathrm{avgdl})}IDF(t)⋅f(t,d)+k1​(1−b+b∣d∣/avgdl)f(t,d)(k1​+1)​

In this equation, f(t,d)f(t,d)f(t,d) is the count of term ttt in passage ddd; ∣d∣|d|∣d∣ is the passage length; avgdl\mathrm{avgdl}avgdl is the average passage length; and inverse document frequency (IDF) gives more weight to rarer terms. k1k_1k1​ controls term-frequency saturation and bbb controls length normalization. This is the BM25 family presented by Robertson and Zaragoza.[1]

BM25 has implementation variants. The code below uses the always-positive IDF term log⁡(1+(N−df+0.5)/(df+0.5))\log(1 + (N - \mathrm{df} + 0.5) / (\mathrm{df} + 0.5))log(1+(N−df+0.5)/(df+0.5)), where NNN is the number of passages and df\mathrm{df}df is the number containing the term. A query term still contributes a non-negative weight when it appears in most passages.

The first full ranker uses only Python data structures. It is short enough to inspect and precise enough to reproduce each score.

bm25-from-scratch.py
1import math 2import re 3from collections import Counter 4 5docs = { 6 "d1": "Annual plan refund policy. Request a refund within 30 days of purchase.", 7 "d2": "Cancel during your first month and we will return your payment.", 8 "d3": "Update your billing address in account settings.", 9 "d4": "Refund status for duplicate charges. Refunds usually appear in 5 to 10 days.", 10} 11query = "How do I get a refund for an annual plan?" 12stopwords = {"a", "an", "and", "do", "for", "get", "how", "i", "in", "the", "to", "within", "your"} 13 14def tokenize(text): 15 return [word for word in re.findall(r"[a-z]+", text.lower()) if word not in stopwords] 16 17tokens = {doc_id: tokenize(text) for doc_id, text in docs.items()} 18query_terms = tokenize(query) 19average_length = sum(len(row) for row in tokens.values()) / len(tokens) 20document_frequency = Counter(term for row in tokens.values() for term in set(row)) 21 22def bm25(doc_id, k1=1.2, b=0.75): 23 score = 0.0 24 counts = Counter(tokens[doc_id]) 25 for term in query_terms: 26 frequency = counts[term] 27 if frequency == 0: 28 continue 29 df = document_frequency[term] 30 idf = math.log(1 + (len(docs) - df + 0.5) / (df + 0.5)) 31 denominator = frequency + k1 * (1 - b + b * len(tokens[doc_id]) / average_length) 32 score += idf * frequency * (k1 + 1) / denominator 33 return score 34 35ranking = sorted(((doc_id, bm25(doc_id)) for doc_id in docs), key=lambda item: (-item[1], item[0])) 36for rank, (doc_id, score) in enumerate(ranking, start=1): 37 print(f"#{rank} {doc_id}: {score:.3f}")
Output
1#1 d1: 3.128 2#2 d4: 0.675 3#3 d2: 0.000 4#4 d3: 0.000

BM25 should put d1 first: it contains all three exact query terms. It should rank d4 above d2 because d4 contains refund, although a human judgment says d2 is the better answer. This is a retrieval failure mode worth keeping visible.

Repetition Saturates Instead of Winning Automatically

Copying a query word repeatedly shouldn't make a weak page unbeatable. In BM25, term frequency raises the contribution but with diminishing returns. The next lab isolates that factor by holding IDF and length normalization fixed.

see-bm25-saturation.py
1def term_weight(frequency: int, k1: float = 1.2) -> float: 2 return frequency * (k1 + 1) / (frequency + k1) 3 4limit = 1.2 + 1 5for frequency in (1, 2, 4, 20): 6 weight = term_weight(frequency) 7 remaining = limit - weight 8 print(f"refund count={frequency:2}: tf weight={weight:.3f} gap to limit={remaining:.3f}")
Output
1refund count= 1: tf weight=1.000 gap to limit=1.200 2refund count= 2: tf weight=1.375 gap to limit=0.825 3refund count= 4: tf weight=1.692 gap to limit=0.508 4refund count=20: tf weight=2.075 gap to limit=0.125

The weight grows toward the limit k1+1=2.2k_1 + 1 = 2.2k1​+1=2.2, but the remaining gap shrinks. BM25 still requires good tokenization and good documents; it just stops keyword stuffing from behaving like relevance.

Dense Retrieval Recovers Paraphrases

The previous chapter taught you to audit vector geometry. Dense retrieval applies that geometry to a query: embed the query and each passage independently, then rank passages by similarity. Dense Passage Retrieval demonstrated this bi-encoder pattern for passage ranking, using query and passage representations scored by inner product.[2]

To see the mechanics, use three interpretable coordinates rather than a real embedding model:

ItemVectorIntended meaning
query[1.0, 0.8, 0.0]refund plus annual-plan intent
d1[1.0, 0.4, 0.0]exact refund page
d2[0.9, 0.9, 0.0]paraphrased refund page
d3[0.0, 0.2, 1.0]billing settings
d4[0.4, 0.0, 0.3]related refund status

Cosine similarity compares direction:

cos⁡(q,d)=q⋅d∥q∥2∥d∥2\cos(q,d) = \frac{q \cdot d}{\lVert q \rVert_2 \lVert d \rVert_2}cos(q,d)=∥q∥2​∥d∥2​q⋅d​

For d2, the numerator is 1.0(0.9)+0.8(0.9)=1.621.0(0.9) + 0.8(0.9)=1.621.0(0.9)+0.8(0.9)=1.62. Dividing by the two lengths gives approximately 0.994, even though d2 didn't use the literal word refund.

rank-with-cosine-similarity.py
1import numpy as np 2 3vectors = { 4 "query": np.array([1.0, 0.8, 0.0]), 5 "d1": np.array([1.0, 0.4, 0.0]), 6 "d2": np.array([0.9, 0.9, 0.0]), 7 "d3": np.array([0.0, 0.2, 1.0]), 8 "d4": np.array([0.4, 0.0, 0.3]), 9} 10 11def cosine(left, right): 12 return float(left @ right / (np.linalg.norm(left) * np.linalg.norm(right))) 13 14ranking = sorted( 15 ((doc_id, cosine(vectors["query"], vector)) for doc_id, vector in vectors.items() if doc_id != "query"), 16 key=lambda item: (-item[1], item[0]), 17) 18for rank, (doc_id, score) in enumerate(ranking, start=1): 19 print(f"#{rank} {doc_id}: {score:.3f}")
Output
1#1 d2: 0.994 2#2 d1: 0.957 3#3 d4: 0.625 4#4 d3: 0.123

Dense ranking promotes d2, the paraphrase that sparse ranking could not see. It also retains d4 as partially similar, so dense retrieval doesn't remove the need for judgments or reranking.

Similarity Must Match the Embedding Contract

Raw dot product and cosine similarity produce the same order only when vector norms are controlled. A larger vector can win dot product despite pointing in a worse direction.

dot-product-versus-cosine.py
1import numpy as np 2 3query = np.array([1.0, 0.8, 0.0]) 4candidates = { 5 "aligned_paraphrase": np.array([1.0, 0.8, 0.0]), 6 "large_partial_match": np.array([6.0, 0.0, 0.0]), 7} 8 9dot_ranking = sorted(candidates, key=lambda name: -(query @ candidates[name])) 10cosine_ranking = sorted( 11 candidates, 12 key=lambda name: -(query @ candidates[name]) / (np.linalg.norm(query) * np.linalg.norm(candidates[name])), 13) 14 15print("dot-product winner:", dot_ranking[0]) 16print("cosine winner: ", cosine_ranking[0])
Output
1dot-product winner: large_partial_match 2cosine winner: aligned_paraphrase

This doesn't mean cosine is always right and dot product is always wrong. It means index construction, offline evaluation, and online queries must agree on the metric the embedding model is intended to use.

Fuse Independent Retrieval Lanes

For this question, sparse and dense rankings contain complementary evidence:

LaneRankingStrength
BM25d1, d4, d2, d3protects exact terms such as annual plan
Dense cosined2, d1, d4, d3rescues the paraphrase return your payment

Do not average a BM25 number such as 3.128 with a cosine number such as 0.994. They have unrelated units. Reciprocal rank fusion (RRF) uses rank positions instead:

RRF(d)=∑ℓ∈L1k+rℓ(d)\mathrm{RRF}(d) = \sum_{\ell \in L} \frac{1}{k + r_{\ell}(d)}RRF(d)=ℓ∈L∑​k+rℓ​(d)1​

Here, LLL is the set of ranked lists and rℓ(d)r_{\ell}(d)rℓ​(d) is document ddd's position in list ℓ\ellℓ. If one lane doesn't return ddd, that lane contributes nothing for ddd. Cormack, Clarke, and Buettcher proposed RRF and used k=60 as a fixed constant in their reported experiments.[3]

BM25 and dense feed RRF. d2 rises above d4. BM25 and dense feed RRF. d2 rises above d4.
RRF merges rank positions. The dense paraphrase hit lifts d2 above d4.
reciprocal-rank-fusion.py
1from collections import defaultdict 2 3bm25 = ["d1", "d4", "d2", "d3"] 4dense = ["d2", "d1", "d4", "d3"] 5 6def rrf(rankings, k=60): 7 scores = defaultdict(float) 8 for ranking in rankings: 9 for rank, document in enumerate(ranking, start=1): 10 scores[document] += 1 / (k + rank) 11 return scores 12 13scores = rrf([bm25, dense]) 14fused = sorted(scores, key=lambda document: (-scores[document], document)) 15 16for rank, document in enumerate(fused, start=1): 17 print(f"#{rank} {document}: {scores[document]:.6f}")
Output
1#1 d1: 0.032522 2#2 d2: 0.032266 3#3 d4: 0.032002 4#4 d3: 0.031250

The paraphrased answer d2 moves above the refund-status distractor d4. The loop accumulates scores across the union of returned documents, so a dense-only candidate isn't silently discarded. RRF didn't decide that d2 was relevant; your judgment table did. RRF only gave both retrieval lanes a fair way to contribute candidates.

Reranking Spends Compute on the Shortlist

A dense retriever encodes a query separately from its passages so passage vectors can be stored and searched cheaply. A cross-encoder reranker instead reads a query-passage pair together and predicts relevance for that pair. Nogueira and Cho showed this second-stage BERT reranking pattern on retrieved passages.[4]

Reranking improves ordering only if the correct passage survived candidate retrieval. For a paraphrased request such as q2 return my payment, the next check makes that boundary explicit with assumed reranker scores where d2 is the best answer.

reranking-cannot-recover-a-missing-candidate.py
1reranker_score = {"d1": 0.55, "d2": 0.96, "d4": 0.12} 2 3def rerank(candidates): 4 return sorted(candidates, key=lambda doc: -reranker_score[doc]) 5 6for name, candidates in [ 7 ("sparse-only top2", ["d1", "d4"]), 8 ("hybrid top3", ["d1", "d2", "d4"]), 9]: 10 ordered = rerank(candidates) 11 recovered = "d2" in ordered and ordered[0] == "d2" 12 print(f"{name:16}: top={ordered[0]} recovered_best={recovered}")
Output
1sparse-only top2: top=d1 recovered_best=False 2hybrid top3 : top=d2 recovered_best=True

This is why a retrieval cascade has two separate metrics: candidate recall for the first stage and final ordering quality after reranking.

Measure Ranking Before Evaluating Answers

To evaluate retrieval, create queries with relevance judgments. Two useful metrics are:

  • Hit Rate @ K: fraction of questions with at least one relevant passage in the first KKK results.
  • Mean Reciprocal Rank (MRR): average of 1/r1/r1/r, where rrr is the rank of the first relevant passage.

For multi-relevance or graded judgments, normalized Discounted Cumulative Gain (nDCG) is also common; BEIR reports nDCG@10 across diverse information-retrieval datasets.[5]

The mini evaluation below compares a sparse-only run with a hybrid run. q2 uses paraphrased wording, so the dense lane must preserve d2 near the top.

measure-hit-rate-and-mrr.py
1gold = { 2 "q1 annual plan refund": {"d1", "d2"}, 3 "q2 return my payment": {"d2"}, 4 "q3 duplicate charge status": {"d4"}, 5} 6sparse = { 7 "q1 annual plan refund": ["d1", "d4", "d2"], 8 "q2 return my payment": ["d3", "d4", "d2"], 9 "q3 duplicate charge status": ["d4", "d1", "d2"], 10} 11hybrid = { 12 "q1 annual plan refund": ["d1", "d2", "d4"], 13 "q2 return my payment": ["d2", "d1", "d4"], 14 "q3 duplicate charge status": ["d4", "d1", "d2"], 15} 16 17def evaluate(run, k=2): 18 hits = [] 19 reciprocals = [] 20 for question, ranking in run.items(): 21 relevant = gold[question] 22 hits.append(any(doc in relevant for doc in ranking[:k])) 23 first_rank = next((rank for rank, doc in enumerate(ranking, start=1) if doc in relevant), None) 24 reciprocals.append(0.0 if first_rank is None else 1 / first_rank) 25 return sum(hits) / len(hits), sum(reciprocals) / len(reciprocals) 26 27for name, run in [("sparse", sparse), ("hybrid", hybrid)]: 28 hit_rate, mrr = evaluate(run) 29 print(f"{name:6}: hit_rate@2={hit_rate:.3f} mrr={mrr:.3f}")
Output
1sparse: hit_rate@2=0.667 mrr=0.778 2hybrid: hit_rate@2=1.000 mrr=1.000

The lesson isn't that hybrid always wins. It's that an evaluated hard query reveals the weakness that an easy query hides.

Approximate Search Is a Recall Decision

Exact dense retrieval compares a query vector against every stored passage. Once a corpus grows large, an approximate nearest neighbor (ANN) index searches fewer candidates or uses compressed vectors to lower latency and memory cost.

Three building blocks appear often:

ToolHow it avoids exact workRisk to evaluate
Inverted file index (IVF)searches only selected coarse vector partitionscorrect neighbor lies in an unprobed partition
Hierarchical Navigable Small World (HNSW)explores a proximity graph from long-range layers into local neighborslimited exploration misses a closer point
Product quantization (PQ)represents subvectors with codebook entries for compressed distance calculationsquantized distance changes the order

IVF and PQ are described in Faiss's similarity-search work and PQ's original nearest-neighbor paper; HNSW uses a hierarchical navigable proximity graph for approximate search.[6][7][8]

IVF probe, HNSW walk, and PQ compression trade speed for recall. Tuned settings match the exact top-1 baseline. IVF probe, HNSW walk, and PQ compression trade speed for recall. Tuned settings match the exact top-1 baseline.
Approximate search only ships after matching an exact-search recall gate.

Watch an IVF Probe Miss the Nearest Passage

An IVF index assigns documents to coarse partitions. At query time it searches the nearest partition, or several nearby partitions controlled by a probe budget. Near a partition boundary, the closest document may live in the next partition.

ivf-probe-can-miss-a-neighbor.py
1import numpy as np 2 3documents = { 4 "generic_refund": np.array([0.0, 0.0]), 5 "annual_refund": np.array([5.1, 0.0]), 6 "shipping": np.array([9.0, 0.0]), 7} 8centroids = {0: np.array([0.0, 0.0]), 1: np.array([10.0, 0.0])} 9lists = {0: ["generic_refund"], 1: ["annual_refund", "shipping"]} 10query = np.array([4.9, 0.0]) 11 12def nearest(candidates): 13 return min(candidates, key=lambda doc: np.linalg.norm(documents[doc] - query)) 14 15exact = nearest(list(documents)) 16ordered_lists = sorted(centroids, key=lambda bucket: np.linalg.norm(centroids[bucket] - query)) 17one_probe = nearest(lists[ordered_lists[0]]) 18two_probe = nearest(lists[ordered_lists[0]] + lists[ordered_lists[1]]) 19 20print("exact nearest: ", exact) 21print("nprobe=1 result:", one_probe) 22print("nprobe=2 result:", two_probe)
Output
1exact nearest: annual_refund 2nprobe=1 result: generic_refund 3nprobe=2 result: annual_refund

The miss isn't a defect in vector meaning. It is an approximation decision. If exact search finds the correct evidence and ANN doesn't, tune or replace the index before changing embeddings.

Put an Exact-Search Gate Around ANN Settings

One anecdote isn't an evaluation. Use a held-out query slice, compute exact nearest results, and report recall@1: how often an ANN configuration returns the same top passage.

audit-ann-recall-before-shipping.py
1exact_top1 = ["annual_refund", "shipping_delay", "billing_address", "duplicate_charge"] 2ann_fast = ["generic_refund", "shipping_delay", "billing_address", "generic_refund"] 3ann_tuned = ["annual_refund", "shipping_delay", "billing_address", "duplicate_charge"] 4 5def recall_at_one(approximate, exact): 6 matches = sum(found == expected for found, expected in zip(approximate, exact)) 7 return matches / len(exact) 8 9print(f"fast setting recall@1: {recall_at_one(ann_fast, exact_top1):.2f}") 10print(f"tuned setting recall@1: {recall_at_one(ann_tuned, exact_top1):.2f}") 11print("ship only after checking latency alongside recall")
Output
1fast setting recall@1: 0.50 2tuned setting recall@1: 1.00 3ship only after checking latency alongside recall

Latency gains aren't free evidence. A production retriever needs both an accuracy report and a latency report for the index setting it serves.

A Retrieval Review Checklist

Before letting an LLM answer from retrieved passages, inspect the stages separately:

StageQuestionArtifact
JudgmentsWhich passage really answers each question?held-out query-to-passage labels
Sparse laneDo exact policy terms and identifiers survive?BM25 top-k report
Dense laneDo paraphrases appear near the top?hard paraphrase set
FusionDo both lanes contribute useful candidates?fused ranking with lane attribution
RerankingIs the best evidence ordered first?reranked MRR or nDCG
ANN indexDid latency savings drop recall?ANN-vs-exact recall and latency

This is the first production habit of retrieval-augmented generation (RAG): evaluate the documents entering the prompt before blaming the generator for an unsupported answer.

Practice Tasks

  1. Add five refund-heavy distractors to bm25-from-scratch.py. Print each query term's document frequency and explain why annual and plan should carry more lexical evidence than refund.
  2. Add a dense-only document d5 to reciprocal-rank-fusion.py. Confirm that the fused output includes it even though BM25 never returned it.
  3. Add a fourth judged query to measure-hit-rate-and-mrr.py where the first relevant passage appears at rank 3. Predict how Hit Rate @ 2 and MRR change before running the script.
  4. Move the IVF fixture query from [4.9, 0.0] to [5.2, 0.0]. Explain why nprobe=1 now recovers annual_refund.
  5. Write a retrieval review record for a small policy corpus: judged queries, sparse top-k, dense top-k, fused candidates, reranked order, ANN recall against exact search, and latency.

Practice guidance

  1. Common terms should have larger document frequency and smaller IDF. The rare policy-specific terms should contribute more evidence when they match.
  2. Build fusion scores from the union of documents returned by every lane. A lane that omits d5 contributes no reciprocal-rank term for it.
  3. A relevant passage at rank 3 misses Hit Rate @ 2 but contributes 1 / 3 to MRR. The metrics answer different questions.
  4. At [5.2, 0.0], centroid 1 is closer than centroid 0, so the first probed list contains annual_refund. Probe failures depend on both partition assignment and query location.
  5. Keep stage-level artifacts separate. A final answer score can't tell you whether evidence vanished in sparse retrieval, dense retrieval, fusion, reranking, or ANN search.

Key Concepts

  • relevance judgments and evidence selection
  • inverted indexes and BM25
  • dense retrieval and cosine similarity
  • metric contracts for dot product versus cosine
  • reciprocal rank fusion (RRF)
  • reranking as a second-stage precision pass
  • Hit Rate @ K and Mean Reciprocal Rank (MRR)
  • ANN recall against exact-search baselines

Evaluation rubric

  • Foundational: Explains why d1, d2, and d4 are treated differently for the same refund question, then reproduces BM25 and cosine rankings.
  • Intermediate: Builds sparse, dense, and RRF stages and measures a judged retrieval run with Hit Rate and MRR.
  • Advanced: Diagnoses whether a missing policy page was lost by retrieval, reranking, or ANN approximation, and proposes an exact-baseline release gate.

Follow-up questions

Common pitfalls

SymptomCauseFix
Exact SKU or policy terms disappear after switching to dense-only search.Semantic retrieval blurred decisive literal strings.Retain a BM25 lane and evaluate hybrid recall.
A paraphrased answer never appears in the shortlist.Sparse retrieval was asked to solve semantic matching alone.Add dense retrieval and include paraphrase queries in evaluation.
Raw-score averaging makes one lane dominate fusion.BM25 and cosine scores have unrelated units.Use RRF or a calibrated learned combination.
Reranker scores look strong but answers remain unsupported.Candidate retrieval already discarded the correct evidence.Measure candidate Hit Rate before reranker quality.
Index tuning lowers latency while policies vanish from top-k.ANN search skipped or distorted relevant vectors.Gate ANN settings against exact-search recall and latency reports.
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.A customer asks, "How do I get a refund for an annual plan?" BM25 ranks d4, "Refund status for duplicate charges," above d2, "Cancel during your first month and we will return your payment." What does this reveal?
2.With IDF and length normalization held fixed, BM25 uses tf weight f * (k1 + 1) / (f + k1). For k1 = 1.2, what happens as a page repeats refund from 1 to 20 times?
3.A query vector [1.0, 0.8, 0.0] is compared with d2 [0.9, 0.9, 0.0], a passage that says "return your payment" rather than "refund." Why can dense cosine retrieval rank d2 near the top?
4.A query vector is [1.0, 0.8, 0.0]. Candidate A is [1.0, 0.8, 0.0], and candidate B is [6.0, 0.0, 0.0]. If the embedding contract is directional similarity, what failure can raw dot product introduce?
5.A sparse lane returns BM25 scores such as 3.128, while a dense lane returns cosine similarities such as 0.994. Why use reciprocal rank fusion instead of averaging those raw scores?
6.A first-stage retriever returns only [d1, d4] for a paraphrased refund query. A cross-encoder reranker would score d2 highest if it received it, but d2 is absent from the candidate list. Why can't reranking recover d2?
7.For three judged queries, the relevant sets are q1: {d1, d2}, q2: {d2}, and q3: {d4}. A run returns q1: [d1, d4, d2], q2: [d3, d4, d2], and q3: [d4, d1, d2]. Hit Rate @ 2 is the fraction of queries with any relevant passage in the first two results. MRR is the mean reciprocal rank of the first relevant passage. What are the two metrics?
8.After switching dense retrieval to an ANN index, some correct policy pages vanish from top 1. The query and document embeddings are unchanged. Which audit isolates whether approximation damaged retrieval?
9.In a corpus, refund appears in most passages while annual appears in only a few. Passages A and B each contain both terms once, but A is much longer. With positive BM25 IDF and b > 0, which effect is expected?

9 questions remaining.

Next Step
Continue to Decoding Algorithms

You can now measure which evidence reaches an LLM prompt; next you will control how the model turns that evidence and its logits into generated tokens.

PreviousClustering and PCA
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

The Probabilistic Relevance Framework: BM25 and Beyond.

Robertson, S., & Zaragoza, H. · 2009 · Foundations and Trends in Information Retrieval

Dense Passage Retrieval for Open-Domain Question Answering.

Karpukhin, V., et al. · 2020 · EMNLP 2020

Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods.

Cormack, G. V., Clarke, C. L. A., & Buettcher, S. · 2009 · SIGIR '09

Passage Re-ranking with BERT.

Nogueira, R. & Cho, K. · 2019 · arXiv preprint

BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models.

Thakur, N., et al. · 2021 · NeurIPS 2021 Datasets and Benchmarks

Billion-scale similarity search with GPUs.

Johnson, J., Douze, M., & Jégou, H. · 2017 · arXiv preprint

Product Quantization for Nearest Neighbor Search.

Jégou, H., Douze, M., & Schmid, C. · 2011 · IEEE TPAMI 2011

Efficient and Robust Approximate Nearest Neighbor Using Hierarchical Navigable Small World Graphs.

Malkov, Y. A., & Yashunin, D. A. · 2018 · IEEE Transactions on Pattern Analysis and Machine Intelligence