LeetLLM
LearnTracksPracticeBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Tracks
  • Practice
  • Blog
  • RSS

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 158 articles completed

🛠️Computing Foundations0/9
Git, Shell, Linux for AIDocker for Reproducible AIPython for AI EngineeringNumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models: Images & TextReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnApplied LLM EngineeringReranking and Cross-Encoders for RAG
🔍MediumRAG & Retrieval

Reranking and Cross-Encoders for RAG

Turn a permission-safe hybrid candidate list into precise context using cross-encoder reasoning, ordering metrics, latency gates, and traceable evidence selection.

14 min read
Learning path
Step 66 of 158 in the full curriculum
Hybrid Search: Dense + SparseRAG Evaluation for Reliable Answers

The hybrid-search lesson built policy-answerer-v2 for retrieval-augmented generation (RAG): only current, permitted policy chunks enter retrieval, then sparse and dense ranks are fused. That fixed recall. It didn't guarantee that the best chunk fits into a small generator context window.

policy-answerer-v3 adds the missing reranking step. Luna is answering a developer's question:

Can a service account use the legacy token endpoint for 10 more days if audit logging is enabled?

Hybrid retrieval has already found the current rule api-token-legacy-v2-rule, but a nearby troubleshooting note appears above it. You'll add a reranking stage that reorders only the permitted candidates, admits supported evidence into a two-chunk maximum context budget, and emits a release trace for evaluation.

Reranking overview where the same permitted hybrid candidate list feeds pair scoring, the supported legacy-token rule moves from rank three to rank one, and the two-chunk context budget keeps supported evidence while blocked or stale records remain outside scoring. Reranking overview where the same permitted hybrid candidate list feeds pair scoring, the supported legacy-token rule moves from rank three to rank one, and the two-chunk context budget keeps supported evidence while blocked or stale records remain outside scoring.
Reranking keeps candidate membership fixed, moves the supported legacy-token rule into the top slot, and never scores blocked or stale records.

The boundary: retrieve candidates, then improve their order

A two-stage retriever separates two problems:

StageQuestionOptimized signalMust never change
Hybrid retrievalIs useful evidence in the candidate set?Recall over current permitted chunksAuthorization and freshness boundary
RerankingWhich retrieved chunks best answer this query?Precision near the topCandidate membership
GenerationWhat answer can be supported?Grounded response with citationsSelected evidence only

The reranker can't restore a missing document. It must stay inside the same permission boundary: never search a restricted or superseded document as a shortcut.

The runtime contract is compact: accept the permitted hybrid candidates, score each query-chunk pair, admit only score-gated top context, and write an evidence-level trace.

The lab starts from a small fixture representing the hybrid output from the previous lesson. api-token-legacy-v2-rule is present, so recall succeeded. It's at rank 3, so a two-chunk context window would still omit the answer.

permitted-hybrid-candidates.py
1from __future__ import annotations 2 3from dataclasses import dataclass 4from math import log2 5 6@dataclass(frozen=True) 7class Candidate: 8 chunk_id: str 9 document_id: str 10 parent_id: str 11 version: str 12 permitted: bool 13 current: bool 14 first_stage_rank: int 15 text: str 16 17QUERY = ( 18 "legacy token endpoint for service account during 10 day migration " 19 "with audit logging enabled" 20) 21TARGET_ID = "api-token-legacy-v2-rule" 22CANDIDATES = [ 23 Candidate( 24 "api-token-troubleshooting-v1", 25 "api-token-troubleshooting", 26 "api-token-troubleshooting-v1", 27 "api-token-troubleshooting/2026-04-20", 28 True, 29 True, 30 1, 31 ( 32 "Legacy token endpoint errors can be inspected during migration. " 33 "This note does not authorize temporary access." 34 ), 35 ), 36 Candidate( 37 "api-password-reset-v1", 38 "api-password-reset", 39 "api-password-reset-v1", 40 "api-password-reset/2026-01-03", 41 True, 42 True, 43 2, 44 "Password reset tokens expire after 30 minutes.", 45 ), 46 Candidate( 47 "api-token-legacy-v2-rule", 48 "api-auth", 49 "api-auth-v2", 50 "api-auth/2026-04-01", 51 True, 52 True, 53 3, 54 ( 55 "Rule AUTH-14. Service accounts may use the legacy token endpoint " 56 "within 14 days of deprecation when audit logging is enabled." 57 ), 58 ), 59 Candidate( 60 "api-audit-export-v1", 61 "api-audit", 62 "api-audit-v1", 63 "api-audit/2026-03-12", 64 True, 65 True, 66 4, 67 "Audit logs can be exported within 14 days.", 68 ), 69] 70 71first_stage = sorted(CANDIDATES, key=lambda candidate: candidate.first_stage_rank) 72first_stage_ids = [candidate.chunk_id for candidate in first_stage] 73print("Hybrid order:", first_stage_ids) 74assert TARGET_ID in first_stage_ids 75assert first_stage_ids.index(TARGET_ID) + 1 == 3
Output
1Hybrid order: ['api-token-troubleshooting-v1', 'api-password-reset-v1', 'api-token-legacy-v2-rule', 'api-audit-export-v1']

The candidate record keeps document_id and parent_id from the earlier RAG pipeline. Reranking changes evidence order, not the source identity that citation packing needs later.

Keep the evidence boundary executable

The store also contains records Luna may not use: an admin-only rule and an expired policy revision. A regression test should prove that neither enters reranking, even though either might be attractive for the query.

reranker-boundary-regression.py
1SOURCE_STORE = CANDIDATES + [ 2 Candidate( 3 "admin-token-legacy", 4 "admin-token-terms", 5 "admin-token-terms", 6 "admin-token/2026-05-01", 7 False, 8 True, 9 0, 10 "Admin service accounts receive immediate legacy token access.", 11 ), 12 Candidate( 13 "api-token-legacy-v1-rule", 14 "api-auth", 15 "api-auth-v1", 16 "api-auth/2025-02-01", 17 True, 18 False, 19 0, 20 "Service accounts may use the legacy token endpoint within 30 days.", 21 ), 22] 23FIRST_STAGE_ID_SET = set(first_stage_ids) 24 25def rerankable_candidates(store: list[Candidate]) -> list[Candidate]: 26 return [ 27 candidate 28 for candidate in store 29 if ( 30 candidate.permitted 31 and candidate.current 32 and candidate.chunk_id in FIRST_STAGE_ID_SET 33 ) 34 ] 35 36rerankable = rerankable_candidates(SOURCE_STORE) 37blocked_ids = sorted( 38 candidate.chunk_id 39 for candidate in SOURCE_STORE 40 if candidate not in rerankable 41) 42print("Rerankable:", [candidate.chunk_id for candidate in rerankable]) 43print("Blocked:", blocked_ids) 44assert rerankable == CANDIDATES 45assert blocked_ids == ["admin-token-legacy", "api-token-legacy-v1-rule"]
Output
1Rerankable: ['api-token-troubleshooting-v1', 'api-password-reset-v1', 'api-token-legacy-v2-rule', 'api-audit-export-v1'] 2Blocked: ['admin-token-legacy', 'api-token-legacy-v1-rule']

The allowlist is the first-stage chunk-ID set, not a fresh corpus scan. Stable IDs keep candidate membership explicit even if records are loaded from another store layer.

With no reranker, the generator receives the first two hybrid candidates. That context contains a troubleshooting note that explicitly says doesn't authorize temporary access, plus a password-reset policy. The correct legacy-token rule is present in retrieval results but absent from context.

context-before-reranking.py
1CONTEXT_BUDGET = 2 2 3def pack_context(candidates: list[Candidate], budget: int) -> list[str]: 4 return [candidate.chunk_id for candidate in candidates[:budget]] 5 6context_before = pack_context(first_stage, CONTEXT_BUDGET) 7print("Context before rerank:", context_before) 8print("Target reaches generation:", TARGET_ID in context_before) 9assert TARGET_ID not in context_before
Output
1Context before rerank: ['api-token-troubleshooting-v1', 'api-password-reset-v1'] 2Target reaches generation: False

Why a cross-encoder helps

A bi-encoder encodes a query and each chunk separately, then compares stored document vectors to a query vector. Separately encoded representations make large-scale retrieval practical because document embeddings can be indexed before a request arrives.[1]

A cross-encoder receives the query and one candidate chunk together and returns a relevance score for that pair. Passage reranking with BERT applies this joint scoring only after an initial retrieval stage, because request-time scoring for each pair is more expensive than searching a prebuilt index.[2]

That distinction matters here. A troubleshooting note shares legacy token endpoint, migration, and 14 days with Luna's question, but it also says doesn't authorize temporary access. The legacy-token rule must match the requested endpoint, temporary-access remedy, migration window, and audit condition together.

The next fixture is deliberately transparent. It isn't a trained transformer, and it doesn't claim real model accuracy. It makes the pairwise requirements inspectable before you plug in a model endpoint.

inspect-pairwise-requirements.py
1@dataclass(frozen=True) 2class Request: 3 endpoint: str 4 remedy: str 5 days_since_deprecation: int 6 audit_enabled: bool 7 8REQUEST = Request( 9 endpoint="legacy token endpoint", 10 remedy="temporary access", 11 days_since_deprecation=10, 12 audit_enabled=True, 13) 14requirements = [ 15 REQUEST.endpoint, 16 REQUEST.remedy, 17 "migration window covers 10 days", 18 "audit logging is enabled", 19] 20print("Pairwise requirements:", requirements) 21assert REQUEST.days_since_deprecation <= 14 22assert REQUEST.audit_enabled
Output
1Pairwise requirements: ['legacy token endpoint', 'temporary access', 'migration window covers 10 days', 'audit logging is enabled']

The score below rewards a chunk only when its policy can support the developer's constraints. It also applies an explicit contradiction penalty. A learned cross-encoder would learn a scoring function from labeled query-passage pairs; the fixture gives the lab a stable expected result.

pairwise-reranker.py
1@dataclass(frozen=True) 2class PairScore: 3 candidate: Candidate 4 score: int 5 reasons: tuple[str, ...] 6 7class ConstraintAwarePairScorer: 8 def score(self, request: Request, candidate: Candidate) -> PairScore: 9 text = candidate.text.lower() 10 score = 0 11 reasons: list[str] = [] 12 13 if request.endpoint in text: 14 score += 2 15 reasons.append("endpoint") 16 if request.remedy in text: 17 score += 3 18 reasons.append("remedy") 19 if "service accounts" in text: 20 score += 1 21 reasons.append("principal") 22 if "within 14 days" in text and request.days_since_deprecation <= 14: 23 score += 2 24 reasons.append("migration-window") 25 if "audit logging is enabled" in text and request.audit_enabled: 26 score += 2 27 reasons.append("audit-condition") 28 if "does not authorize temporary access" in text: 29 score -= 6 30 reasons.append("contradiction") 31 return PairScore(candidate, score, tuple(reasons)) 32 33def rerank( 34 request: Request, 35 candidates: list[Candidate], 36 scorer: ConstraintAwarePairScorer, 37) -> list[PairScore]: 38 scored = [scorer.score(request, candidate) for candidate in candidates] 39 return sorted( 40 scored, 41 key=lambda result: (result.score, -result.candidate.first_stage_rank), 42 reverse=True, 43 ) 44 45reranked = rerank(REQUEST, rerankable, ConstraintAwarePairScorer()) 46reranked_ids = [result.candidate.chunk_id for result in reranked] 47for result in reranked: 48 print(result.candidate.chunk_id, result.score, result.reasons) 49assert reranked_ids[0] == TARGET_ID 50assert rerank(REQUEST, [], ConstraintAwarePairScorer()) == []
Output
1api-token-legacy-v2-rule 7 ('endpoint', 'principal', 'migration-window', 'audit-condition') 2api-audit-export-v1 2 ('migration-window',) 3api-password-reset-v1 0 () 4api-token-troubleshooting-v1 -1 ('endpoint', 'remedy', 'contradiction')

Now context selection changes for the right reason: the candidate set stays fixed while the supported rule moves to rank 1. A context budget is a maximum, not a quota, so low-scoring near matches aren't admitted merely to fill space.

context-after-reranking.py
1reranked_candidates = [result.candidate for result in reranked] 2MIN_CONTEXT_SCORE = 5 3selected_after = [ 4 result.candidate 5 for result in reranked 6 if result.score >= MIN_CONTEXT_SCORE 7][:CONTEXT_BUDGET] 8context_after = [candidate.chunk_id for candidate in selected_after] 9print("Context before:", context_before) 10print("Context after:", context_after) 11assert TARGET_ID in context_after 12assert "api-token-troubleshooting-v1" not in context_after 13assert set(reranked_ids) == set(first_stage_ids)
Output
1Context before: ['api-token-troubleshooting-v1', 'api-password-reset-v1'] 2Context after: ['api-token-legacy-v2-rule']

Measure ordering, not vibes

For this stage, keep evaluation narrow:

  1. Recall@candidate_k checks whether first-stage retrieval gave the reranker a chance.
  2. MRR (Mean Reciprocal Rank) averages how early the first relevant chunk appears across a release suite.
  3. NDCG@context_k (Normalized Discounted Cumulative Gain) checks whether relevant chunks fit near the top of the context budget.[3]

For this one request, reciprocal rank (RR) is easy to read: rank 3 gives 1 / 3; rank 1 gives 1. MRR is the mean of those per-request values. NDCG supports graded relevance; this lab uses binary relevance, where a relevant chunk has rel_i = 1 and an irrelevant chunk has rel_i = 0:

DCG⁡k=∑i=1k2reli−1log⁡2(i+1)NDCG⁡k=DCG⁡kIDCG⁡k\operatorname{DCG}_k = \sum_{i=1}^{k} \frac{2^{rel_i}-1}{\log_2(i+1)} \qquad \operatorname{NDCG}_k = \frac{\operatorname{DCG}_k}{\operatorname{IDCG}_k}DCGk​=i=1∑k​log2​(i+1)2reli​−1​NDCGk​=IDCGk​DCGk​​
ordering-gate.py
1RELEVANT_IDS = {TARGET_ID} 2 3def reciprocal_rank(ids: list[str], relevant_ids: set[str]) -> float: 4 for rank, chunk_id in enumerate(ids, start=1): 5 if chunk_id in relevant_ids: 6 return 1.0 / rank 7 return 0.0 8 9def ndcg_at_k(ids: list[str], relevant_ids: set[str], k: int) -> float: 10 dcg = sum( 11 (2 ** int(chunk_id in relevant_ids) - 1) / log2(rank + 1) 12 for rank, chunk_id in enumerate(ids[:k], start=1) 13 ) 14 ideal_hits = min(len(relevant_ids), k) 15 idcg = sum((2**1 - 1) / log2(rank + 1) for rank in range(1, ideal_hits + 1)) 16 return dcg / idcg if idcg else 0.0 17 18before_rr = reciprocal_rank(first_stage_ids, RELEVANT_IDS) 19after_rr = reciprocal_rank(reranked_ids, RELEVANT_IDS) 20before_ndcg = ndcg_at_k(first_stage_ids, RELEVANT_IDS, CONTEXT_BUDGET) 21after_ndcg = ndcg_at_k(reranked_ids, RELEVANT_IDS, CONTEXT_BUDGET) 22print(f"RR: {before_rr:.2f} -> {after_rr:.2f}") 23print(f"NDCG@{CONTEXT_BUDGET}: {before_ndcg:.2f} -> {after_ndcg:.2f}") 24assert after_rr > before_rr 25assert before_ndcg == 0.0 and after_ndcg == 1.0
Output
1RR: 0.33 -> 1.00 2NDCG@2: 0.00 -> 1.00

Don't confuse improved evidence ordering with a correct generated answer. The next lesson will evaluate faithfulness and citation agreement after context reaches the generator.

Candidate count sets the ceiling and the bill

If the reranker receives only the first two candidates, it can't select api-token-legacy-v2-rule: the rule was cut before pair scoring. Use a candidate-recall gate before comparing reranker models.

candidate-count-gate.py
1def target_present(candidates: list[Candidate]) -> bool: 2 return TARGET_ID in [candidate.chunk_id for candidate in candidates] 3 4RERANK_CANDIDATE_BUDGET = 3 5too_small = first_stage[:2] 6release_input = first_stage[:RERANK_CANDIDATE_BUDGET] 7limited_ids = [result.candidate.chunk_id for result in rerank( 8 REQUEST, too_small, ConstraintAwarePairScorer() 9)] 10release_reranked = rerank(REQUEST, release_input, ConstraintAwarePairScorer()) 11release_reranked_ids = [result.candidate.chunk_id for result in release_reranked] 12print("top-2 includes target:", target_present(too_small), limited_ids) 13print("top-3 includes target:", target_present(release_input), release_reranked_ids) 14assert not target_present(too_small) 15assert release_reranked_ids[0] == TARGET_ID
Output
1top-2 includes target: False ['api-password-reset-v1', 'api-token-troubleshooting-v1'] 2top-3 includes target: True ['api-token-legacy-v2-rule', 'api-password-reset-v1', 'api-token-troubleshooting-v1']

A cross-encoder scores every query-candidate pair online. This fixture uses a deliberately sequential cost model so the candidate-count tradeoff stays visible. It isn't a p95 estimator or a hardware benchmark: production serving may batch pairs, and percentiles must be measured end to end.

latency-budget-gate.py
1SEQUENTIAL_PAIR_COST_MS = 7.5 2REQUEST_OVERHEAD_MS = 4.0 3FIXTURE_LATENCY_BUDGET_MS = 30.0 4 5def estimated_fixture_latency_ms(candidate_count: int) -> float: 6 return REQUEST_OVERHEAD_MS + SEQUENTIAL_PAIR_COST_MS * candidate_count 7 8release_fixture_latency = estimated_fixture_latency_ms(len(release_input)) 9too_wide_fixture_latency = estimated_fixture_latency_ms(len(first_stage)) 10print( 11 f"top-3 fixture latency: {release_fixture_latency:.1f} ms, " 12 f"pass={release_fixture_latency <= FIXTURE_LATENCY_BUDGET_MS}" 13) 14print( 15 f"top-4 fixture latency: {too_wide_fixture_latency:.1f} ms, " 16 f"pass={too_wide_fixture_latency <= FIXTURE_LATENCY_BUDGET_MS}" 17) 18assert release_fixture_latency <= FIXTURE_LATENCY_BUDGET_MS 19assert too_wide_fixture_latency > FIXTURE_LATENCY_BUDGET_MS
Output
1top-3 fixture latency: 26.5 ms, pass=True 2top-4 fixture latency: 34.0 ms, pass=False

The lab result is intentionally specific: retrieve at least three candidates, rerank three, and pass two chunks. A real service chooses these numbers from held-out traces and measured end-to-end p95 latency by candidate count, input length, and batch strategy, not from a generic default.

Choose an interaction design deliberately

Three ranking designs differ in when the query and chunk can interact:

DesignStored before requestRequest-time workRole in this pipeline
Bi-encoder retrievalOne embedding per chunkCompare query vector to indexed chunk vectorsBuild a broad candidate set
Cross-encoder rerankingNo query-specific pair scoreJointly score each retrieved query-chunk pairPrecision stage here
ColBERT late interactionContextual token vectors per chunkMatch query token vectors against chunk token vectorsAlternative to evaluate under a tighter latency budget

ColBERT keeps document-side representations indexable while applying token-level late interaction at request time.[4] For each query token, MaxSim keeps its highest similarity to any document token, then sums those per-token maxima into a relevance score. It isn't an automatic upgrade. Compare it against a cross-encoder on the same held-out candidate lists, ordering metric, memory cost, and p95 latency.

Three-way interaction comparison where a bi-encoder compresses query and chunk separately into one vector each, a cross-encoder lets all query and chunk tokens interact in one joint sequence, and ColBERT keeps chunk token vectors indexed while matching query tokens online with MaxSim. Three-way interaction comparison where a bi-encoder compresses query and chunk separately into one vector each, a cross-encoder lets all query and chunk tokens interact in one joint sequence, and ColBERT keeps chunk token vectors indexed while matching query tokens online with MaxSim.
Bi-encoders interact only after separate compression, cross-encoders interact before scoring, and ColBERT keeps document tokens indexed while matching query tokens online.

Emit the trace that evaluation needs

Shipping a reranker without evidence-level traces makes failures ambiguous. Record first-stage IDs, actual reranker input, candidate source identity, policy versions, pair scores, model versions, selected context IDs, and release gates. A document version is part of correctness: a well-ranked stale rule is still unsafe context.

reranking-release-trace.py
1SCORER_VERSION = "fixture-cross-encoder-v1" 2selected_context = [ 3 result.candidate 4 for result in release_reranked 5 if result.score >= MIN_CONTEXT_SCORE 6][:CONTEXT_BUDGET] 7release_trace = { 8 "query_id": "api-token-legacy-access-001", 9 "versions": { 10 "retriever": "policy-retriever-v2", 11 "index": "policy-index/2026-05-27", 12 "sparse": "bm25-tokenizer-v1", 13 "dense": "fixture-embeddings-v1", 14 "fusion": "rrf-k60", 15 "reranker": SCORER_VERSION, 16 }, 17 "first_stage_ids": first_stage_ids, 18 "rerank_input_ids": [candidate.chunk_id for candidate in release_input], 19 "reranked_ids": release_reranked_ids, 20 "candidate_records": [ 21 { 22 "chunk_id": result.candidate.chunk_id, 23 "document_id": result.candidate.document_id, 24 "parent_id": result.candidate.parent_id, 25 "version": result.candidate.version, 26 "first_stage_rank": result.candidate.first_stage_rank, 27 "rerank_score": result.score, 28 "reasons": result.reasons, 29 } 30 for result in release_reranked 31 ], 32 "selected_context_ids": [candidate.chunk_id for candidate in selected_context], 33 "selected_versions": [candidate.version for candidate in selected_context], 34 "gates": { 35 "permitted_and_current": all( 36 candidate.permitted and candidate.current for candidate in release_input 37 ), 38 "target_in_context": TARGET_ID in [ 39 candidate.chunk_id for candidate in selected_context 40 ], 41 "ordering_lift": after_ndcg > before_ndcg, 42 "latency_budget": release_fixture_latency <= FIXTURE_LATENCY_BUDGET_MS, 43 }, 44} 45stores_raw_policy_text = any( 46 candidate.text in str(release_trace) 47 for candidate in SOURCE_STORE 48) 49print("Versions:", release_trace["versions"]) 50print("Rerank input:", release_trace["rerank_input_ids"]) 51print("Selected context:", release_trace["selected_context_ids"]) 52print("Trace stores raw policy text:", stores_raw_policy_text) 53print("Gates:", release_trace["gates"]) 54assert set(release_trace["reranked_ids"]) == set(release_trace["rerank_input_ids"]) 55assert not stores_raw_policy_text 56assert all(release_trace["gates"].values())
Output
1Versions: {'retriever': 'policy-retriever-v2', 'index': 'policy-index/2026-05-27', 'sparse': 'bm25-tokenizer-v1', 'dense': 'fixture-embeddings-v1', 'fusion': 'rrf-k60', 'reranker': 'fixture-cross-encoder-v1'} 2Rerank input: ['api-token-troubleshooting-v1', 'api-password-reset-v1', 'api-token-legacy-v2-rule'] 3Selected context: ['api-token-legacy-v2-rule'] 4Trace stores raw policy text: False 5Gates: {'permitted_and_current': True, 'target_in_context': True, 'ordering_lift': True, 'latency_budget': True}

This trace proves that reranking improved the evidence presented to generation. It doesn't yet prove that an answer states the rule faithfully or cites it correctly. That's exactly the boundary for the next lesson.

Production checks

Before releasing a real reranker:

GateEvidence to logFailure response
Candidate recallGold chunk ID in first-stage top kFix retrieval or raise measured candidate budget
Authorization and freshnessACL decision, chunk version, effective dateReject request context; never score blocked evidence
Ordering liftBefore/after MRR or NDCG on held-out tracesRetrain, replace, or remove reranker
Serving budgetCandidate count, token length, model version, p95 latencyBatch, cap input, or choose a tested alternative
Downstream groundingSelected chunk IDs and generated citationsEvaluate in the next pipeline stage

One attractive mistake is caching a reranker result without its source version. If api-token-legacy-v2-rule changes, a cached score for an older chunk can keep stale evidence at rank 1. Key caches and traces by chunk checksum or policy version, then rerun golden traces after source updates.

Mastery check

You're ready to add a reranker to a production RAG pipeline when you can:

  • Preserve authorization and freshness filtering before any pair scoring.
  • Explain why first-stage recall and final evidence order need separate gates.
  • Compare bi-encoder retrieval, cross-encoder reranking, and ColBERT-style late interaction.
  • Measure ordering improvement with MRR or NDCG inside the context budget.
  • Choose candidate count using candidate recall and measured end-to-end latency.
  • Emit a trace suitable for downstream faithfulness and citation evaluation.

Evaluation rubric

LevelEvidence in submission
FoundationalPreserves authorization and freshness filtering before pair scoring.
AppliedReranks only first-stage candidates and proves ordering lift with MRR or NDCG.
StrongChooses candidate count using candidate recall and measured end-to-end latency.
Production-readyEmits versioned evidence traces for downstream faithfulness and citation evaluation.

Follow-up questions

Common pitfalls

Reranking is used to hide recall failure

  • Symptom: Teams swap reranker models, but a gold policy chunk never appears in final context.
  • Cause: First-stage retrieval did not include the chunk in its candidate set.
  • Fix: Gate on Recall@candidate_k first, then evaluate ordering only on candidate sets that contain the answer.

Candidate count grows without a serving measurement

  • Symptom: Ordering metrics improve while request latency exceeds the product budget.
  • Cause: Each additional cross-encoder pair consumes request-time inference.
  • Fix: Benchmark p95 latency by candidate count and input length, then choose the smallest candidate set that clears recall and ordering gates.

Evidence order improves but grounding remains untested

  • Symptom: NDCG rises, but users still receive unsupported answers.
  • Cause: Reranking evaluates selected context, not whether generation follows it.
  • Fix: Carry the reranking trace into answer faithfulness and citation checks in the next evaluation stage.
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.The source store contains the first-stage candidates plus an admin-only legacy-token rule and an expired API rule. Which set may be sent to pair scoring?
2.Luna's candidates include a troubleshooting note that shares legacy-token and migration language, but says it does not authorize temporary access. The legacy-token rule is also in the permitted candidate list. What should the pairwise reranker favor?
3.After reranking, scores are: api-token-legacy-v2-rule 10, api-audit-export-v1 2, api-password-reset-v1 0, and api-token-troubleshooting-v1 -1. The context budget is 2 and MIN_CONTEXT_SCORE is 5. What should be passed to generation?
4.In the fixture, the only relevant chunk is rank 3 before reranking and rank 1 after reranking. The context budget is 2 and relevance is binary. Which metric outcome is correct?
5.The target is first-stage rank 3. The sequential fixture latency is 4.0 ms plus 7.5 ms per candidate, with a 30.0 ms gate. Which choice is justified by this fixture, and what remains before using a production budget?
6.Which statement correctly distinguishes the three interaction designs for a permission-safe retrieval pipeline?
7.A cached reranker score keeps an older api-token rule at rank 1 after the policy source has changed. What should the release design do?
8.A reranker release passes ordering metrics, but a later evaluation must debug why selected evidence changed after a policy update. Which trace design supports that investigation without dumping raw policy text?

8 questions remaining.

Next Step
Continue to RAG Evaluation for Reliable Answers

`policy-answerer-v3` now selects evidence precisely; the next lesson tests whether generation uses that evidence faithfully.

PreviousHybrid Search: Dense + Sparse
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.

Reimers, N., & Gurevych, I. · 2019 · EMNLP 2019

Passage Re-ranking with BERT.

Nogueira, R. & Cho, K. · 2019 · arXiv preprint

Introduction to Information Retrieval.

Manning, C. D., Raghavan, P., Schutze, H. · 2008 · Cambridge University Press

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.

Khattab, O., & Zaharia, M. · 2020 · SIGIR 2020