LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnApplied LLM EngineeringHybrid Search: Dense + Sparse
🔍MediumRAG & Retrieval

Hybrid Search: Dense + Sparse

Upgrade a permission-safe RAG retriever with BM25, semantic scores, rank fusion, and recall gates for exact codes and paraphrased policy questions.

17 min read
Learning path
Step 62 of 155 in the full curriculum
Production RAG PipelinesReranking and Cross-Encoders for RAG

Hybrid Search: Dense + Sparse Retrieval

The production RAG lesson built policy-answerer-v1 around a hard rule: only permitted, current evidence may reach an answer. Its simple term-overlap retriever was easy to audit, but it misses a customer who asks for a "swap for a broken reconditioned notebook" when the policy says "replacement for a damaged refurbished laptop."

This lesson upgrades only that retrieval lane. You'll build policy-answerer-v2. The code RPL-14 gives sparse retrieval an exact signal; dense retrieval catches paraphrased meaning; and Reciprocal Rank Fusion (RRF) merges candidate lists. The authorization, freshness, citation, and abstention contract doesn't change.

One retriever can't cover both queries

Luna, an EU support specialist, needs the same policy for two different searches:

QueryUseful signalRequired evidence
RPL-14Exact policy codeeu-refurb-v2-rule
swap a broken reconditioned notebookMeaning close to "replacement for a damaged refurbished laptop"eu-refurb-v2-rule

A word-matching index has a decisive clue for the first query and no shared vocabulary for the second. A semantic encoder can represent the second query near the policy, but an unfamiliar internal identifier may carry little useful semantic signal. Neither failure says one method is bad. They solve different recall problems.

Permission-safe hybrid retrieval: current permitted evidence flows into BM25 and dense lanes, then RRF produces candidate IDs without exposing hidden policies. Permission-safe hybrid retrieval: current permitted evidence flows into BM25 and dense lanes, then RRF produces candidate IDs without exposing hidden policies.
Hybrid retrieval runs only after the policy boundary has selected current evidence Luna may see. BM25 protects exact-code recall; dense search protects paraphrase recall; fusion never grants access to a hidden chunk.

The safe online order is:

Diagram showing Support question, Current and permitted chunks, BM25 candidates, and Dense candidates. Diagram showing Support question, Current and permitted chunks, BM25 candidates, and Dense candidates.
Support question, Current and permitted chunks, BM25 candidates, and Dense candidates.

Recreate the permitted candidate universe

The lab reuses the policy shape from the previous lesson. It adds diagnostic policy code RPL-14 so an exact-identifier query has an unambiguous expected result. Two tempting records remain in storage but must not be searchable for Luna: a superseded revision and a restricted merchant rule.

permitted-candidates.py
1from __future__ import annotations 2 3from dataclasses import dataclass 4from datetime import date 5from math import log, sqrt 6import re 7 8@dataclass(frozen=True) 9class PolicyChunk: 10 chunk_id: str 11 document_id: str 12 parent_id: str 13 version: str 14 region: str 15 acl_tag: str 16 effective_from: date 17 effective_to: date | None 18 text: str 19 20@dataclass(frozen=True) 21class Caller: 22 actor_id: str 23 region: str 24 acl_tags: frozenset[str] 25 26EVAL_DATE = date(2026, 5, 27) 27LUNA = Caller("luna-48291", "EU", frozenset({"support:eu"})) 28CHUNKS = [ 29 PolicyChunk( 30 "eu-refurb-v2-rule", 31 "eu-electronics", 32 "eu-electronics-v2", 33 "eu-electronics/2026-04-01", 34 "EU", 35 "support:eu", 36 date(2026, 4, 1), 37 None, 38 ( 39 "Rule RPL-14. Damaged refurbished laptops qualify for replacement " 40 "within 14 days of delivery when damage is reported within 48 hours." 41 ), 42 ), 43 PolicyChunk( 44 "eu-refurb-v1-rule", 45 "eu-electronics", 46 "eu-electronics-v1", 47 "eu-electronics/2025-02-01", 48 "EU", 49 "support:eu", 50 date(2025, 2, 1), 51 date(2026, 3, 31), 52 "Rule RPL-14. Damaged refurbished laptops qualify for return within 30 days.", 53 ), 54 PolicyChunk( 55 "merchant-vip-refurb", 56 "merchant-vip-terms", 57 "merchant-vip-terms", 58 "merchant-vip/2026-05-01", 59 "EU", 60 "merchant:vip-ops", 61 date(2026, 5, 1), 62 None, 63 "VIP-RPL-1. Damaged refurbished laptops receive immediate refund.", 64 ), 65 PolicyChunk( 66 "eu-footwear-v1-rule", 67 "eu-footwear", 68 "eu-footwear-v1", 69 "eu-footwear/2026-01-03", 70 "EU", 71 "support:eu", 72 date(2026, 1, 3), 73 None, 74 "Unworn footwear may be returned within 30 days of delivery.", 75 ), 76 PolicyChunk( 77 "eu-carrier-loss-v1", 78 "eu-carrier", 79 "eu-carrier-loss-v1", 80 "eu-carrier/2026-02-10", 81 "EU", 82 "support:eu", 83 date(2026, 2, 10), 84 None, 85 "Rule CLM-7. A lost parcel after carrier pickup qualifies for refund.", 86 ), 87] 88 89def is_current(chunk: PolicyChunk, on_date: date) -> bool: 90 return chunk.effective_from <= on_date and ( 91 chunk.effective_to is None or on_date <= chunk.effective_to 92 ) 93 94def permitted_chunks( 95 caller: Caller, 96 chunks: list[PolicyChunk], 97 on_date: date, 98) -> list[PolicyChunk]: 99 return [ 100 chunk 101 for chunk in chunks 102 if chunk.region == caller.region 103 and chunk.acl_tag in caller.acl_tags 104 and is_current(chunk, on_date) 105 ] 106 107permitted = permitted_chunks(LUNA, CHUNKS, EVAL_DATE) 108permitted_ids = [chunk.chunk_id for chunk in permitted] 109print("Permitted current ids:", permitted_ids) 110assert "eu-refurb-v2-rule" in permitted_ids 111assert "eu-refurb-v1-rule" not in permitted_ids 112assert "merchant-vip-refurb" not in permitted_ids
Output
1Permitted current ids: ['eu-refurb-v2-rule', 'eu-footwear-v1-rule', 'eu-carrier-loss-v1']

Every ranker below receives permitted, not CHUNKS. This isn't a presentation detail: it is the API boundary that prevents a new ranking algorithm from weakening the service contract.

The fixed EVAL_DATE keeps replay behavior stable. The chunk shape also preserves document_id and parent_id from the previous lesson, even though this chapter changes ranking rather than citation packing.

evidence-boundary-regression.py
1blocked_ids = sorted( 2 chunk.chunk_id 3 for chunk in CHUNKS 4 if chunk.chunk_id not in permitted_ids 5) 6print("Searchable by Luna:", permitted_ids) 7print("Stored but blocked:", blocked_ids) 8assert blocked_ids == ["eu-refurb-v1-rule", "merchant-vip-refurb"]
Output
1Searchable by Luna: ['eu-refurb-v2-rule', 'eu-footwear-v1-rule', 'eu-carrier-loss-v1'] 2Stored but blocked: ['eu-refurb-v1-rule', 'merchant-vip-refurb']

This test keeps a deliberately attractive hidden policy in storage. Later ranking changes fail loudly if they accidentally widen the searchable set.

Build the sparse lane with BM25

Sparse retrieval represents a document by vocabulary terms. Most coordinates are zero because a short policy chunk uses only a small part of the vocabulary. BM25 ranks a document higher when it shares rare query terms, while limiting the reward for repeated terms and compensating for unusually long documents.[1]

For a query term ttt and document ddd, the lab computes:

BM25⁡(q,d)=∑t∈qIDF⁡(t)f(t,d)(k1+1)f(t,d)+k1(1−b+b∣d∣/avgdl⁡)\operatorname{BM25}(q,d)=\sum_{t \in q}\operatorname{IDF}(t) \frac{f(t,d)(k_1+1)} {f(t,d)+k_1(1-b+b\lvert d\rvert/\operatorname{avgdl})}BM25(q,d)=t∈q∑​IDF(t)f(t,d)+k1​(1−b+b∣d∣/avgdl)f(t,d)(k1​+1)​

Here, f(t,d)f(t,d)f(t,d) is the term count in the chunk, ∣d∣\lvert d\rvert∣d∣ is chunk length in tokens, and avgdl is the corpus average. k1 controls term-frequency saturation; b controls length normalization. The exact identifier rpl-14 occurs only in the relevant current chunk, so it receives strong lexical weight.

The small analyzer below keeps hyphenated rule codes intact and removes common function words. Without that stopword rule, a query containing only a shared word such as "a" could appear to retrieve an unrelated policy.

bm25-lane.py
1TOKEN_RE = re.compile(r"[a-z0-9]+(?:-[a-z0-9]+)*") 2STOPWORDS = {"a", "an", "the", "after", "for", "of", "is", "within", "when"} 3 4def tokens(text: str) -> list[str]: 5 return [ 6 token 7 for token in TOKEN_RE.findall(text.lower()) 8 if token not in STOPWORDS 9 ] 10 11def bm25_rank( 12 query: str, 13 chunks: list[PolicyChunk], 14 top_k: int = 2, 15 k1: float = 1.2, 16 b: float = 0.75, 17) -> list[tuple[PolicyChunk, float]]: 18 if top_k <= 0: 19 raise ValueError("top_k must be positive") 20 if not chunks: 21 return [] 22 23 doc_tokens = {chunk.chunk_id: tokens(chunk.text) for chunk in chunks} 24 avgdl = sum(len(value) for value in doc_tokens.values()) / len(chunks) 25 query_terms = tokens(query) 26 ranked: list[tuple[PolicyChunk, float]] = [] 27 28 for chunk in chunks: 29 document = doc_tokens[chunk.chunk_id] 30 score = 0.0 31 for term in query_terms: 32 term_count = document.count(term) 33 if term_count == 0: 34 continue 35 containing_docs = sum(term in value for value in doc_tokens.values()) 36 idf = log(1 + (len(chunks) - containing_docs + 0.5) / (containing_docs + 0.5)) 37 numerator = term_count * (k1 + 1) 38 denominator = term_count + k1 * (1 - b + b * len(document) / avgdl) 39 score += idf * numerator / denominator 40 if score > 0: 41 ranked.append((chunk, score)) 42 43 return sorted(ranked, key=lambda item: (-item[1], item[0].chunk_id))[:top_k] 44 45EXACT = "RPL-14" 46PARAPHRASE = "swap a broken reconditioned notebook" 47bm25_exact = bm25_rank(EXACT, permitted) 48bm25_paraphrase = bm25_rank(PARAPHRASE, permitted) 49 50print("BM25 exact:", [chunk.chunk_id for chunk, _ in bm25_exact]) 51print("BM25 paraphrase:", [chunk.chunk_id for chunk, _ in bm25_paraphrase]) 52assert bm25_exact[0][0].chunk_id == "eu-refurb-v2-rule" 53assert bm25_paraphrase == []
Output
1BM25 exact: ['eu-refurb-v2-rule'] 2BM25 paraphrase: []

BM25 did its job. It recovered the policy from its code and openly failed when the user used no policy vocabulary. A real evaluation set needs both query types; otherwise the lexical lane can look perfect while customers miss evidence.

BM25 is not the only sparse option. SPLADE learns sparse expansion weights, so a chunk can gain indexable related terms while preserving sparse retrieval infrastructure.[2] That can improve vocabulary mismatch cases, but it doesn't turn sparse retrieval into an authorization layer or guarantee better recall on ShopFlow queries. Evaluate a SPLADE candidate against the same permitted corpus, held-out required-evidence IDs, latency budget, and hidden-source exclusions before replacing BM25.

diagnose-sparse-miss.py
1required_text = next( 2 chunk.text for chunk in permitted if chunk.chunk_id == "eu-refurb-v2-rule" 3) 4exact_overlap = sorted(set(tokens(EXACT)) & set(tokens(required_text))) 5paraphrase_overlap = sorted(set(tokens(PARAPHRASE)) & set(tokens(required_text))) 6print("Exact overlap:", exact_overlap) 7print("Paraphrase overlap:", paraphrase_overlap) 8assert exact_overlap == ["rpl-14"] 9assert paraphrase_overlap == []
Output
1Exact overlap: ['rpl-14'] 2Paraphrase overlap: []

Add a dense semantic lane

A dense retriever encodes queries and chunks as compact vectors, then retrieves chunks with high similarity. Dense Passage Retrieval (DPR), for example, uses separate encoders for questions and passages so passage representations can be indexed before requests arrive.[3]

Downloading and training an encoder would hide the retrieval mechanics in this lab. Instead, the next cell uses frozen three-dimensional vectors as test fixtures. Read them as outputs already produced by an embedding model:

DimensionMeaning in this fixture
1Refurbished-device replacement intent
2Footwear return intent
3Lost-carrier refund intent

This fixture is deliberately honest about one failure: the internal code RPL-14 has no semantic vector by itself. The paraphrase does.

dense-lane.py
1Vector = tuple[float, float, float] 2 3DOCUMENT_VECTORS: dict[str, Vector] = { 4 "eu-refurb-v2-rule": (1.00, 0.00, 0.00), 5 "eu-footwear-v1-rule": (0.00, 1.00, 0.00), 6 "eu-carrier-loss-v1": (0.00, 0.00, 1.00), 7} 8QUERY_VECTORS: dict[str, Vector] = { 9 EXACT: (0.00, 0.00, 0.00), 10 PARAPHRASE: (0.98, 0.05, 0.00), 11 "damaged refurbished laptop replacement after delivery": (0.96, 0.15, 0.02), 12} 13 14def cosine(left: Vector, right: Vector) -> float: 15 left_norm = sqrt(sum(value * value for value in left)) 16 right_norm = sqrt(sum(value * value for value in right)) 17 if left_norm == 0 or right_norm == 0: 18 return 0.0 19 return sum(a * b for a, b in zip(left, right)) / (left_norm * right_norm) 20 21def dense_rank( 22 query: str, 23 chunks: list[PolicyChunk], 24 top_k: int = 2, 25) -> list[tuple[PolicyChunk, float]]: 26 query_vector = QUERY_VECTORS.get(query, (0.0, 0.0, 0.0)) 27 ranked = [ 28 (chunk, cosine(query_vector, DOCUMENT_VECTORS[chunk.chunk_id])) 29 for chunk in chunks 30 ] 31 return sorted( 32 [(chunk, score) for chunk, score in ranked if score > 0], 33 key=lambda item: (-item[1], item[0].chunk_id), 34 )[:top_k] 35 36dense_exact = dense_rank(EXACT, permitted) 37dense_paraphrase = dense_rank(PARAPHRASE, permitted) 38print("Dense exact:", [chunk.chunk_id for chunk, _ in dense_exact]) 39print("Dense paraphrase:", [chunk.chunk_id for chunk, _ in dense_paraphrase]) 40assert dense_exact == [] 41assert dense_paraphrase[0][0].chunk_id == "eu-refurb-v2-rule"
Output
1Dense exact: [] 2Dense paraphrase: ['eu-refurb-v2-rule', 'eu-footwear-v1-rule']

The fixture doesn't claim that every production encoder misses every identifier. It establishes a regression case: this chosen encoder representation doesn't recover the code-only query, so deleting the sparse lane would fail a known requirement.

Fuse candidates without mixing score scales

BM25 scores and cosine similarities don't share units. A BM25 value reflects term statistics in this index; a cosine value reflects vector alignment. Adding raw values can let whichever scale is numerically larger control the order.

Reciprocal Rank Fusion avoids that comparison. It contributes 1/(k+r)1 / (k + r)1/(k+r) for each rank rrr at which a chunk appears:

RRF⁡(d)=∑lane l1k+rank⁡l(d)\operatorname{RRF}(d)=\sum_{\text{lane } l}\frac{1}{k+\operatorname{rank}_l(d)}RRF(d)=lane l∑​k+rankl​(d)1​

We use k=60, the setting reported in the original RRF experiments, as a starting value rather than a universal optimum.[4] A chunk found by both lanes gains two contributions; a strong result found by one lane remains eligible.

RRF worked example: BM25 and dense ranks for a refurbished laptop replacement query contribute rank-based values to a fused evidence list. RRF worked example: BM25 and dense ranks for a refurbished laptop replacement query contribute rank-based values to a fused evidence list.
RRF combines ordering rather than incompatible scores. Shared evidence receives votes from both lanes, while an exact-code-only or paraphrase-only hit can still survive into the candidate set.
rrf-fusion.py
1RRF_K = 60 2 3def reciprocal_rank_fusion( 4 result_lists: list[list[tuple[PolicyChunk, float]]], 5 k: int = RRF_K, 6) -> list[tuple[PolicyChunk, float]]: 7 if k <= 0: 8 raise ValueError("k must be positive") 9 by_id: dict[str, PolicyChunk] = {} 10 scores: dict[str, float] = {} 11 for results in result_lists: 12 for rank, (chunk, _raw_score) in enumerate(results, start=1): 13 by_id[chunk.chunk_id] = chunk 14 scores[chunk.chunk_id] = scores.get(chunk.chunk_id, 0.0) + 1 / (k + rank) 15 return sorted( 16 [(by_id[chunk_id], score) for chunk_id, score in scores.items()], 17 key=lambda item: (-item[1], item[0].chunk_id), 18 ) 19 20def hybrid_rank( 21 query: str, 22 caller: Caller, 23 chunks: list[PolicyChunk], 24 top_k: int = 2, 25) -> list[tuple[PolicyChunk, float]]: 26 searchable = permitted_chunks(caller, chunks, EVAL_DATE) 27 fused = reciprocal_rank_fusion( 28 [bm25_rank(query, searchable, top_k), dense_rank(query, searchable, top_k)] 29 ) 30 return fused[:top_k] 31 32SHARED_WORDS = "damaged refurbished laptop replacement after delivery" 33for query in [EXACT, PARAPHRASE, SHARED_WORDS]: 34 hits = hybrid_rank(query, LUNA, CHUNKS) 35 print(query, "->", [chunk.chunk_id for chunk, _ in hits]) 36 37shared_fused = hybrid_rank(SHARED_WORDS, LUNA, CHUNKS) 38assert shared_fused[0][0].chunk_id == "eu-refurb-v2-rule" 39assert shared_fused[0][1] == 2 / 61
Output
1RPL-14 -> ['eu-refurb-v2-rule'] 2swap a broken reconditioned notebook -> ['eu-refurb-v2-rule', 'eu-footwear-v1-rule'] 3damaged refurbished laptop replacement after delivery -> ['eu-refurb-v2-rule', 'eu-footwear-v1-rule']

RRF doesn't manufacture relevance. It makes the two candidate sources interoperable. If both lanes miss the right chunk, a fused list will still be wrong.

Attack the evidence boundary through fusion

The stored VIP policy contains a unique code. If the permission boundary moved after retrieval, BM25 would have an easy hidden hit to surface. A hybrid implementation must return nothing for Luna's request for that code.

hidden-source-attack.py
1NO_ACCESS = Caller("visitor-9000", "APAC", frozenset()) 2attack_hits = hybrid_rank("VIP-RPL-1", LUNA, CHUNKS) 3attack_ids = [chunk.chunk_id for chunk, _ in attack_hits] 4no_access_hits = hybrid_rank(EXACT, NO_ACCESS, CHUNKS) 5print("Visible candidates for hidden code:", attack_ids) 6print("Visible candidates without corpus access:", no_access_hits) 7assert "merchant-vip-refurb" not in attack_ids 8assert attack_ids == [] 9assert no_access_hits == []
Output
1Visible candidates for hidden code: [] 2Visible candidates without corpus access: []

Gate the retriever on recall and safety

In the previous lesson, answer quality depended on retrieving current permitted evidence. That means the retrieval upgrade needs its own release cases before you measure generated prose.

Recall@2 answers a narrow question: for each supported query, did the correct permitted chunk appear in the first two candidates? It does not say whether the evidence order is perfect or whether the final answer is faithful. Those are later checks. Here, recall exposes whether the generator even gets a chance to see the right policy.

Retrieval release gate comparing BM25, dense, and hybrid recall on three evidence cases while requiring hidden-source safety. Retrieval release gate comparing BM25, dense, and hybrid recall on three evidence cases while requiring hidden-source safety.
The lab's measured claim is narrow: hybrid recovers all three frozen permitted-evidence cases while each individual lane misses its known counterexample. The safety gate separately proves restricted and stale chunks never appear.
retrieval-release-gate.py
1@dataclass(frozen=True) 2class RetrievalCase: 3 name: str 4 query: str 5 expected_chunk_id: str 6 7CASES = [ 8 RetrievalCase("exact-code", EXACT, "eu-refurb-v2-rule"), 9 RetrievalCase("paraphrase", PARAPHRASE, "eu-refurb-v2-rule"), 10 RetrievalCase("shared-language", SHARED_WORDS, "eu-refurb-v2-rule"), 11] 12 13def recall_at_2(rank_fn) -> float: 14 recovered = 0 15 for case in CASES: 16 ids = [chunk.chunk_id for chunk, _ in rank_fn(case.query)] 17 recovered += case.expected_chunk_id in ids[:2] 18 return recovered / len(CASES) 19 20bm25_recall = recall_at_2(lambda query: bm25_rank(query, permitted)) 21dense_recall = recall_at_2(lambda query: dense_rank(query, permitted)) 22hybrid_recall = recall_at_2(lambda query: hybrid_rank(query, LUNA, CHUNKS)) 23 24requested_hidden_rule = hybrid_rank("VIP-RPL-1", LUNA, CHUNKS) 25visible_ids = [chunk.chunk_id for chunk, _ in requested_hidden_rule] 26safety_pass = ( 27 "merchant-vip-refurb" not in visible_ids 28 and "eu-refurb-v1-rule" not in visible_ids 29) 30 31print(f"BM25 Recall@2: {bm25_recall:.2f}") 32print(f"Dense Recall@2: {dense_recall:.2f}") 33print(f"Hybrid RRF Recall@2: {hybrid_recall:.2f}") 34print("Safety gate:", safety_pass) 35assert hybrid_recall == 1.0 36assert hybrid_recall > bm25_recall 37assert hybrid_recall > dense_recall 38assert safety_pass
Output
1BM25 Recall@2: 0.67 2Dense Recall@2: 0.67 3Hybrid RRF Recall@2: 1.00 4Safety gate: True

These three fixtures demonstrate complementary failures; they don't prove an offline lift for a production corpus. A release decision needs a held-out set drawn from real support requests, including exact codes, paraphrases, unsupported questions, languages served by the product, and attempts to request hidden policies.

GateWhat to freezeFailure meaning
Permitted Recall@kQuery and required current chunk IDCorrect evidence never reaches context selection
Restricted-source exclusionQueries that strongly match hidden chunksRetriever boundary is unsafe
Superseded-source exclusionQueries matching old policy wordingFreshness filter regressed
Abstention casesQuestions with no permitted supporting evidenceRetrieval or answer layer overreaches

Trace each lane before adding reranking

When a final answer is wrong, you need to tell apart three failures:

FailureTrace evidenceNext repair
Retrieval missExpected chunk absent from sparse, dense, and fused candidatesImprove indexing, encoder, query handling, or fusion
Fusion ordering issueExpected chunk exists in a lane but falls below context budgetTune fusion on held-out labels
Later precision issueCorrect chunk is in fused candidates but distractors rank above itAdd and evaluate the reranking stage in the next lesson

Store IDs, ranks, model and index versions, fusion settings, and timing. Don't log policy text in a broad diagnostic event.

hybrid-retrieval-trace.py
1def trace_hybrid_request( 2 query: str, 3 query_kind: str, 4 caller: Caller, 5) -> dict[str, object]: 6 searchable = permitted_chunks(caller, CHUNKS, EVAL_DATE) 7 sparse = bm25_rank(query, searchable) 8 dense = dense_rank(query, searchable) 9 fused = reciprocal_rank_fusion([sparse, dense]) 10 return { 11 "versions": { 12 "retriever": "policy-retriever-v2", 13 "index": "policy-index/2026-05-27", 14 "sparse": "bm25-tokenizer-v1", 15 "dense": "fixture-embeddings-v1", 16 "fusion": f"rrf-k{RRF_K}", 17 }, 18 "query_kind": query_kind, 19 "sparse_ids": [chunk.chunk_id for chunk, _ in sparse], 20 "dense_ids": [chunk.chunk_id for chunk, _ in dense], 21 "fused_ids": [chunk.chunk_id for chunk, _ in fused[:2]], 22 "timings_ms": {"authorize": 2, "bm25": 4, "dense": 11, "fusion": 1}, 23 } 24 25trace = trace_hybrid_request(PARAPHRASE, "paraphrase-regression", LUNA) 26stores_raw_policy_text = any( 27 chunk.text in str(trace) 28 for chunk in CHUNKS 29) 30print("Versions:", trace["versions"]) 31print("Sparse ids:", trace["sparse_ids"]) 32print("Dense ids:", trace["dense_ids"]) 33print("Fused ids:", trace["fused_ids"]) 34print("Trace stores raw policy text:", stores_raw_policy_text) 35assert trace["fused_ids"][0] == "eu-refurb-v2-rule" 36assert "eu-footwear-v1-rule" in trace["fused_ids"] 37assert "merchant-vip-refurb" not in str(trace) 38assert not stores_raw_policy_text
Output
1Versions: {'retriever': 'policy-retriever-v2', 'index': 'policy-index/2026-05-27', 'sparse': 'bm25-tokenizer-v1', 'dense': 'fixture-embeddings-v1', 'fusion': 'rrf-k60'} 2Sparse ids: [] 3Dense ids: ['eu-refurb-v2-rule', 'eu-footwear-v1-rule'] 4Fused ids: ['eu-refurb-v2-rule', 'eu-footwear-v1-rule'] 5Trace stores raw policy text: False

The correct evidence is present, but the semantic lane also kept a footwear distractor. That is exactly the boundary between this lesson and the next: retrieval satisfies candidate recall; reranking decides whether a distractor should remain near context. The trace can also preserve the latency budget created in the production RAG lesson. Two retrieval lanes add work, so the release check should report that cost explicitly.

If a context budget is being wasted by several near-duplicate candidates, Maximum Marginal Relevance (MMR) is one selection strategy: choose a relevant result while penalizing candidates too similar to what has already been selected.[5] MMR handles diversity in an existing permitted candidate set. It doesn't retrieve a missing policy and doesn't replace a cross-encoder that must compare query relevance precisely.

retrieval-latency-gate.py
1RETRIEVAL_BUDGET_MS = {"authorize": 10, "bm25": 12, "dense": 40, "fusion": 8} 2 3def exceeded_retrieval_budgets(timings: dict[str, int]) -> list[str]: 4 return [ 5 stage 6 for stage, budget in RETRIEVAL_BUDGET_MS.items() 7 if stage not in timings or timings[stage] > budget 8 ] 9 10healthy = trace["timings_ms"] 11missing_dense = { 12 stage: elapsed 13 for stage, elapsed in healthy.items() 14 if stage != "dense" 15} 16print("Healthy exceeded:", exceeded_retrieval_budgets(healthy)) 17print("Missing timing exceeded:", exceeded_retrieval_budgets(missing_dense)) 18assert exceeded_retrieval_budgets(healthy) == [] 19assert exceeded_retrieval_budgets(missing_dense) == ["dense"]
Output
1Healthy exceeded: [] 2Missing timing exceeded: ['dense']

Should you tune weights instead?

RRF is a good first implementation because it doesn't require calibrating unrelated score scales. It isn't an automatic winner. Bruch et al. found that RRF can be sensitive to its parameter and that a tuned convex combination can outperform it in their tested settings.[6] If you have enough labeled queries, compare it against normalized weighted fusion:

score⁡hybrid(d)=αscore⁡dense(d)+(1−α)score⁡sparse(d)\operatorname{score}_{\text{hybrid}}(d) =\alpha\operatorname{score}_{\text{dense}}(d) +(1-\alpha)\operatorname{score}_{\text{sparse}}(d)scorehybrid​(d)=αscoredense​(d)+(1−α)scoresparse​(d)

That comparison is an evaluation task, not a reason to guess an alpha in production. Keep a fixed held-out split, version the encoder and index, report Recall@k and latency for every candidate, and retain RRF if a tuned method doesn't hold up out of sample.

Build it yourself

Extend policy-answerer-v2 without weakening its contract:

  1. Add at least eight permitted positive queries: exact codes, natural-language paraphrases, and mixed queries.
  2. Add at least four negative or adversarial queries: restricted merchant rules, superseded policy wording, wrong region, and absent evidence.
  3. Replace the frozen dense vectors with embeddings produced by your chosen bi-encoder, recording the model version.
  4. Compare BM25, dense, and RRF using permitted Recall@k; record p50 and p95 retrieval latency.
  5. Save a compact trace for each failed case containing candidate IDs, lane ranks, versions, and timings, but not restricted content.
  6. Keep the fused top candidates as the input artifact for the next lesson's reranker.

The important artifact is not a search demo. It is a retrieval report showing which evidence questions are recovered, which must abstain, and which source boundaries remain enforced after the upgrade.

Mastery check

You are ready to use hybrid retrieval in a RAG system when you can:

  • Explain why BM25 recovers rare identifiers and why dense retrieval can recover paraphrases.
  • Explain when learned sparse expansion such as SPLADE is a candidate alternative to BM25.
  • Implement BM25 scoring and RRF fusion over permitted candidate records.
  • Explain why MMR diversifies selected evidence but doesn't repair a retrieval miss.
  • Explain why BM25 scores and cosine similarities must not be added without calibration.
  • Treat frozen semantic vectors as a test fixture, not proof that one production embedding model behaves the same way.
  • Evaluate retrieval with expected evidence IDs before evaluating generated answers.
  • Preserve authorization and freshness filtering ahead of both retrieval lanes.
  • Produce a lane-by-lane trace that makes a missed chunk diagnosable.

Evaluation rubric

LevelEvidence in your submission
FoundationalCorrectly ranks RPL-14 with BM25 and explains its term-based signal
AppliedRecovers the paraphrased device-replacement question through dense retrieval and fuses candidates with RRF
StrongReports BM25-only, dense-only, and hybrid Recall@k on labeled positive cases plus negative safety gates
Production-readyUses a versioned encoder and index, measures latency, and proves restricted or superseded policies never enter fused candidates

Common pitfalls

SymptomLikely causeRepair
Rule-code lookup returns a generic policyDense-only search lost a rare identifierKeep or restore the lexical lane and add code queries to the release set
Paraphrased question returns nothingSparse-only search requires the policy's exact wordingAdd a dense lane and test semantic queries against required evidence IDs
Fused ranking changes wildly after an encoder updateRaw scores or tuned weights no longer have the same calibrationCompare against RRF and retune only on a fixed labeled split
Hidden merchant rule appears in any candidate traceRetrieval ran before authorization filteringRestrict the searchable candidate universe before either lane executes
Team blames generation for an unsupported answerRetrieval evidence IDs were never evaluatedMeasure permitted Recall@k and abstention before scoring final text

Next Step
Continue to Reranking and Cross-Encoders for RAG

You now have a permission-safe hybrid retriever that recovers exact codes and semantic paraphrases into a measured candidate set. Next you will add a slower precision stage that reorders those candidates before they enter the generator context.

PreviousProduction RAG Pipelines
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

The Probabilistic Relevance Framework: BM25 and Beyond.

Robertson, S., & Zaragoza, H. · 2009 · Foundations and Trends in Information Retrieval

SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking.

Formal, T., et al. · 2021 · SIGIR 2021

Dense Passage Retrieval for Open-Domain Question Answering.

Karpukhin, V., et al. · 2020 · EMNLP 2020

Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods.

Cormack, G. V., Clarke, C. L. A., & Buettcher, S. · 2009 · SIGIR '09

The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries.

Carbonell, J., & Goldstein, J. · 1998 · SIGIR 1998

An Analysis of Fusion Functions for Hybrid Retrieval.

Bruch, S., Gai, S., & Ingber, A. · 2023 · ACM Transactions on Information Systems