Turn a permission-safe hybrid candidate list into precise context using cross-encoder reasoning, ordering metrics, latency gates, and traceable evidence selection.
The hybrid-search lesson built policy-answerer-v2 for retrieval-augmented generation (RAG): only current, permitted policy chunks enter retrieval, then sparse and dense ranks are fused. That fixed recall. It didn't guarantee that the best chunk fits into a small generator context window.
policy-answerer-v3 adds the missing reranking step. Luna is answering a developer's question:
Can a service account use the legacy token endpoint for 10 more days if audit logging is enabled?
Hybrid retrieval has already found the current rule api-token-legacy-v2-rule, but a nearby troubleshooting note appears above it. You'll add a reranking stage that reorders only the permitted candidates, admits supported evidence into a two-chunk maximum context budget, and emits a release trace for evaluation.
A two-stage retriever separates two problems:
| Stage | Question | Optimized signal | Must never change |
|---|---|---|---|
| Hybrid retrieval | Is useful evidence in the candidate set? | Recall over current permitted chunks | Authorization and freshness boundary |
| Reranking | Which retrieved chunks best answer this query? | Precision near the top | Candidate membership |
| Generation | What answer can be supported? | Grounded response with citations | Selected evidence only |
The reranker can't restore a missing document. It must stay inside the same permission boundary: never search a restricted or superseded document as a shortcut.
The runtime contract is compact: accept the permitted hybrid candidates, score each query-chunk pair, admit only score-gated top context, and write an evidence-level trace.
The lab starts from a small fixture representing the hybrid output from the previous lesson. api-token-legacy-v2-rule is present, so recall succeeded. It's at rank 3, so a two-chunk context window would still omit the answer.
1from __future__ import annotations
2
3from dataclasses import dataclass
4from math import log2
5
6@dataclass(frozen=True)
7class Candidate:
8 chunk_id: str
9 document_id: str
10 parent_id: str
11 version: str
12 permitted: bool
13 current: bool
14 first_stage_rank: int
15 text: str
16
17QUERY = (
18 "legacy token endpoint for service account during 10 day migration "
19 "with audit logging enabled"
20)
21TARGET_ID = "api-token-legacy-v2-rule"
22CANDIDATES = [
23 Candidate(
24 "api-token-troubleshooting-v1",
25 "api-token-troubleshooting",
26 "api-token-troubleshooting-v1",
27 "api-token-troubleshooting/2026-04-20",
28 True,
29 True,
30 1,
31 (
32 "Legacy token endpoint errors can be inspected during migration. "
33 "This note does not authorize temporary access."
34 ),
35 ),
36 Candidate(
37 "api-password-reset-v1",
38 "api-password-reset",
39 "api-password-reset-v1",
40 "api-password-reset/2026-01-03",
41 True,
42 True,
43 2,
44 "Password reset tokens expire after 30 minutes.",
45 ),
46 Candidate(
47 "api-token-legacy-v2-rule",
48 "api-auth",
49 "api-auth-v2",
50 "api-auth/2026-04-01",
51 True,
52 True,
53 3,
54 (
55 "Rule AUTH-14. Service accounts may use the legacy token endpoint "
56 "within 14 days of deprecation when audit logging is enabled."
57 ),
58 ),
59 Candidate(
60 "api-audit-export-v1",
61 "api-audit",
62 "api-audit-v1",
63 "api-audit/2026-03-12",
64 True,
65 True,
66 4,
67 "Audit logs can be exported within 14 days.",
68 ),
69]
70
71first_stage = sorted(CANDIDATES, key=lambda candidate: candidate.first_stage_rank)
72first_stage_ids = [candidate.chunk_id for candidate in first_stage]
73print("Hybrid order:", first_stage_ids)
74assert TARGET_ID in first_stage_ids
75assert first_stage_ids.index(TARGET_ID) + 1 == 31Hybrid order: ['api-token-troubleshooting-v1', 'api-password-reset-v1', 'api-token-legacy-v2-rule', 'api-audit-export-v1']The candidate record keeps document_id and parent_id from the earlier RAG pipeline. Reranking changes evidence order, not the source identity that citation packing needs later.
The store also contains records Luna may not use: an admin-only rule and an expired policy revision. A regression test should prove that neither enters reranking, even though either might be attractive for the query.
1SOURCE_STORE = CANDIDATES + [
2 Candidate(
3 "admin-token-legacy",
4 "admin-token-terms",
5 "admin-token-terms",
6 "admin-token/2026-05-01",
7 False,
8 True,
9 0,
10 "Admin service accounts receive immediate legacy token access.",
11 ),
12 Candidate(
13 "api-token-legacy-v1-rule",
14 "api-auth",
15 "api-auth-v1",
16 "api-auth/2025-02-01",
17 True,
18 False,
19 0,
20 "Service accounts may use the legacy token endpoint within 30 days.",
21 ),
22]
23FIRST_STAGE_ID_SET = set(first_stage_ids)
24
25def rerankable_candidates(store: list[Candidate]) -> list[Candidate]:
26 return [
27 candidate
28 for candidate in store
29 if (
30 candidate.permitted
31 and candidate.current
32 and candidate.chunk_id in FIRST_STAGE_ID_SET
33 )
34 ]
35
36rerankable = rerankable_candidates(SOURCE_STORE)
37blocked_ids = sorted(
38 candidate.chunk_id
39 for candidate in SOURCE_STORE
40 if candidate not in rerankable
41)
42print("Rerankable:", [candidate.chunk_id for candidate in rerankable])
43print("Blocked:", blocked_ids)
44assert rerankable == CANDIDATES
45assert blocked_ids == ["admin-token-legacy", "api-token-legacy-v1-rule"]1Rerankable: ['api-token-troubleshooting-v1', 'api-password-reset-v1', 'api-token-legacy-v2-rule', 'api-audit-export-v1']
2Blocked: ['admin-token-legacy', 'api-token-legacy-v1-rule']The allowlist is the first-stage chunk-ID set, not a fresh corpus scan. Stable IDs keep candidate membership explicit even if records are loaded from another store layer.
With no reranker, the generator receives the first two hybrid candidates. That context contains a troubleshooting note that explicitly says doesn't authorize temporary access, plus a password-reset policy. The correct legacy-token rule is present in retrieval results but absent from context.
1CONTEXT_BUDGET = 2
2
3def pack_context(candidates: list[Candidate], budget: int) -> list[str]:
4 return [candidate.chunk_id for candidate in candidates[:budget]]
5
6context_before = pack_context(first_stage, CONTEXT_BUDGET)
7print("Context before rerank:", context_before)
8print("Target reaches generation:", TARGET_ID in context_before)
9assert TARGET_ID not in context_before1Context before rerank: ['api-token-troubleshooting-v1', 'api-password-reset-v1']
2Target reaches generation: FalseA bi-encoder encodes a query and each chunk separately, then compares stored document vectors to a query vector. Separately encoded representations make large-scale retrieval practical because document embeddings can be indexed before a request arrives.[1]
A cross-encoder receives the query and one candidate chunk together and returns a relevance score for that pair. Passage reranking with BERT applies this joint scoring only after an initial retrieval stage, because request-time scoring for each pair is more expensive than searching a prebuilt index.[2]
That distinction matters here. A troubleshooting note shares legacy token endpoint, migration, and 14 days with Luna's question, but it also says doesn't authorize temporary access. The legacy-token rule must match the requested endpoint, temporary-access remedy, migration window, and audit condition together.
The next fixture is deliberately transparent. It isn't a trained transformer, and it doesn't claim real model accuracy. It makes the pairwise requirements inspectable before you plug in a model endpoint.
1@dataclass(frozen=True)
2class Request:
3 endpoint: str
4 remedy: str
5 days_since_deprecation: int
6 audit_enabled: bool
7
8REQUEST = Request(
9 endpoint="legacy token endpoint",
10 remedy="temporary access",
11 days_since_deprecation=10,
12 audit_enabled=True,
13)
14requirements = [
15 REQUEST.endpoint,
16 REQUEST.remedy,
17 "migration window covers 10 days",
18 "audit logging is enabled",
19]
20print("Pairwise requirements:", requirements)
21assert REQUEST.days_since_deprecation <= 14
22assert REQUEST.audit_enabled1Pairwise requirements: ['legacy token endpoint', 'temporary access', 'migration window covers 10 days', 'audit logging is enabled']The score below rewards a chunk only when its policy can support the developer's constraints. It also applies an explicit contradiction penalty. A learned cross-encoder would learn a scoring function from labeled query-passage pairs; the fixture gives the lab a stable expected result.
1@dataclass(frozen=True)
2class PairScore:
3 candidate: Candidate
4 score: int
5 reasons: tuple[str, ...]
6
7class ConstraintAwarePairScorer:
8 def score(self, request: Request, candidate: Candidate) -> PairScore:
9 text = candidate.text.lower()
10 score = 0
11 reasons: list[str] = []
12
13 if request.endpoint in text:
14 score += 2
15 reasons.append("endpoint")
16 if request.remedy in text:
17 score += 3
18 reasons.append("remedy")
19 if "service accounts" in text:
20 score += 1
21 reasons.append("principal")
22 if "within 14 days" in text and request.days_since_deprecation <= 14:
23 score += 2
24 reasons.append("migration-window")
25 if "audit logging is enabled" in text and request.audit_enabled:
26 score += 2
27 reasons.append("audit-condition")
28 if "does not authorize temporary access" in text:
29 score -= 6
30 reasons.append("contradiction")
31 return PairScore(candidate, score, tuple(reasons))
32
33def rerank(
34 request: Request,
35 candidates: list[Candidate],
36 scorer: ConstraintAwarePairScorer,
37) -> list[PairScore]:
38 scored = [scorer.score(request, candidate) for candidate in candidates]
39 return sorted(
40 scored,
41 key=lambda result: (result.score, -result.candidate.first_stage_rank),
42 reverse=True,
43 )
44
45reranked = rerank(REQUEST, rerankable, ConstraintAwarePairScorer())
46reranked_ids = [result.candidate.chunk_id for result in reranked]
47for result in reranked:
48 print(result.candidate.chunk_id, result.score, result.reasons)
49assert reranked_ids[0] == TARGET_ID
50assert rerank(REQUEST, [], ConstraintAwarePairScorer()) == []1api-token-legacy-v2-rule 7 ('endpoint', 'principal', 'migration-window', 'audit-condition')
2api-audit-export-v1 2 ('migration-window',)
3api-password-reset-v1 0 ()
4api-token-troubleshooting-v1 -1 ('endpoint', 'remedy', 'contradiction')Now context selection changes for the right reason: the candidate set stays fixed while the supported rule moves to rank 1. A context budget is a maximum, not a quota, so low-scoring near matches aren't admitted merely to fill space.
1reranked_candidates = [result.candidate for result in reranked]
2MIN_CONTEXT_SCORE = 5
3selected_after = [
4 result.candidate
5 for result in reranked
6 if result.score >= MIN_CONTEXT_SCORE
7][:CONTEXT_BUDGET]
8context_after = [candidate.chunk_id for candidate in selected_after]
9print("Context before:", context_before)
10print("Context after:", context_after)
11assert TARGET_ID in context_after
12assert "api-token-troubleshooting-v1" not in context_after
13assert set(reranked_ids) == set(first_stage_ids)1Context before: ['api-token-troubleshooting-v1', 'api-password-reset-v1']
2Context after: ['api-token-legacy-v2-rule']For this stage, keep evaluation narrow:
MRR (Mean Reciprocal Rank) averages how early the first relevant chunk appears across a release suite.NDCG@context_k (Normalized Discounted Cumulative Gain) checks whether relevant chunks fit near the top of the context budget.[3]For this one request, reciprocal rank (RR) is easy to read: rank 3 gives 1 / 3; rank 1 gives 1. MRR is the mean of those per-request values. NDCG supports graded relevance; this lab uses binary relevance, where a relevant chunk has rel_i = 1 and an irrelevant chunk has rel_i = 0:
1RELEVANT_IDS = {TARGET_ID}
2
3def reciprocal_rank(ids: list[str], relevant_ids: set[str]) -> float:
4 for rank, chunk_id in enumerate(ids, start=1):
5 if chunk_id in relevant_ids:
6 return 1.0 / rank
7 return 0.0
8
9def ndcg_at_k(ids: list[str], relevant_ids: set[str], k: int) -> float:
10 dcg = sum(
11 (2 ** int(chunk_id in relevant_ids) - 1) / log2(rank + 1)
12 for rank, chunk_id in enumerate(ids[:k], start=1)
13 )
14 ideal_hits = min(len(relevant_ids), k)
15 idcg = sum((2**1 - 1) / log2(rank + 1) for rank in range(1, ideal_hits + 1))
16 return dcg / idcg if idcg else 0.0
17
18before_rr = reciprocal_rank(first_stage_ids, RELEVANT_IDS)
19after_rr = reciprocal_rank(reranked_ids, RELEVANT_IDS)
20before_ndcg = ndcg_at_k(first_stage_ids, RELEVANT_IDS, CONTEXT_BUDGET)
21after_ndcg = ndcg_at_k(reranked_ids, RELEVANT_IDS, CONTEXT_BUDGET)
22print(f"RR: {before_rr:.2f} -> {after_rr:.2f}")
23print(f"NDCG@{CONTEXT_BUDGET}: {before_ndcg:.2f} -> {after_ndcg:.2f}")
24assert after_rr > before_rr
25assert before_ndcg == 0.0 and after_ndcg == 1.01RR: 0.33 -> 1.00
2NDCG@2: 0.00 -> 1.00Don't confuse improved evidence ordering with a correct generated answer. The next lesson will evaluate faithfulness and citation agreement after context reaches the generator.
If the reranker receives only the first two candidates, it can't select api-token-legacy-v2-rule: the rule was cut before pair scoring. Use a candidate-recall gate before comparing reranker models.
1def target_present(candidates: list[Candidate]) -> bool:
2 return TARGET_ID in [candidate.chunk_id for candidate in candidates]
3
4RERANK_CANDIDATE_BUDGET = 3
5too_small = first_stage[:2]
6release_input = first_stage[:RERANK_CANDIDATE_BUDGET]
7limited_ids = [result.candidate.chunk_id for result in rerank(
8 REQUEST, too_small, ConstraintAwarePairScorer()
9)]
10release_reranked = rerank(REQUEST, release_input, ConstraintAwarePairScorer())
11release_reranked_ids = [result.candidate.chunk_id for result in release_reranked]
12print("top-2 includes target:", target_present(too_small), limited_ids)
13print("top-3 includes target:", target_present(release_input), release_reranked_ids)
14assert not target_present(too_small)
15assert release_reranked_ids[0] == TARGET_ID1top-2 includes target: False ['api-password-reset-v1', 'api-token-troubleshooting-v1']
2top-3 includes target: True ['api-token-legacy-v2-rule', 'api-password-reset-v1', 'api-token-troubleshooting-v1']A cross-encoder scores every query-candidate pair online. This fixture uses a deliberately sequential cost model so the candidate-count tradeoff stays visible. It isn't a p95 estimator or a hardware benchmark: production serving may batch pairs, and percentiles must be measured end to end.
1SEQUENTIAL_PAIR_COST_MS = 7.5
2REQUEST_OVERHEAD_MS = 4.0
3FIXTURE_LATENCY_BUDGET_MS = 30.0
4
5def estimated_fixture_latency_ms(candidate_count: int) -> float:
6 return REQUEST_OVERHEAD_MS + SEQUENTIAL_PAIR_COST_MS * candidate_count
7
8release_fixture_latency = estimated_fixture_latency_ms(len(release_input))
9too_wide_fixture_latency = estimated_fixture_latency_ms(len(first_stage))
10print(
11 f"top-3 fixture latency: {release_fixture_latency:.1f} ms, "
12 f"pass={release_fixture_latency <= FIXTURE_LATENCY_BUDGET_MS}"
13)
14print(
15 f"top-4 fixture latency: {too_wide_fixture_latency:.1f} ms, "
16 f"pass={too_wide_fixture_latency <= FIXTURE_LATENCY_BUDGET_MS}"
17)
18assert release_fixture_latency <= FIXTURE_LATENCY_BUDGET_MS
19assert too_wide_fixture_latency > FIXTURE_LATENCY_BUDGET_MS1top-3 fixture latency: 26.5 ms, pass=True
2top-4 fixture latency: 34.0 ms, pass=FalseThe lab result is intentionally specific: retrieve at least three candidates, rerank three, and pass two chunks. A real service chooses these numbers from held-out traces and measured end-to-end p95 latency by candidate count, input length, and batch strategy, not from a generic default.
Three ranking designs differ in when the query and chunk can interact:
| Design | Stored before request | Request-time work | Role in this pipeline |
|---|---|---|---|
| Bi-encoder retrieval | One embedding per chunk | Compare query vector to indexed chunk vectors | Build a broad candidate set |
| Cross-encoder reranking | No query-specific pair score | Jointly score each retrieved query-chunk pair | Precision stage here |
| ColBERT late interaction | Contextual token vectors per chunk | Match query token vectors against chunk token vectors | Alternative to evaluate under a tighter latency budget |
ColBERT keeps document-side representations indexable while applying token-level late interaction at request time.[4] For each query token, MaxSim keeps its highest similarity to any document token, then sums those per-token maxima into a relevance score. It isn't an automatic upgrade. Compare it against a cross-encoder on the same held-out candidate lists, ordering metric, memory cost, and p95 latency.
Shipping a reranker without evidence-level traces makes failures ambiguous. Record first-stage IDs, actual reranker input, candidate source identity, policy versions, pair scores, model versions, selected context IDs, and release gates. A document version is part of correctness: a well-ranked stale rule is still unsafe context.
1SCORER_VERSION = "fixture-cross-encoder-v1"
2selected_context = [
3 result.candidate
4 for result in release_reranked
5 if result.score >= MIN_CONTEXT_SCORE
6][:CONTEXT_BUDGET]
7release_trace = {
8 "query_id": "api-token-legacy-access-001",
9 "versions": {
10 "retriever": "policy-retriever-v2",
11 "index": "policy-index/2026-05-27",
12 "sparse": "bm25-tokenizer-v1",
13 "dense": "fixture-embeddings-v1",
14 "fusion": "rrf-k60",
15 "reranker": SCORER_VERSION,
16 },
17 "first_stage_ids": first_stage_ids,
18 "rerank_input_ids": [candidate.chunk_id for candidate in release_input],
19 "reranked_ids": release_reranked_ids,
20 "candidate_records": [
21 {
22 "chunk_id": result.candidate.chunk_id,
23 "document_id": result.candidate.document_id,
24 "parent_id": result.candidate.parent_id,
25 "version": result.candidate.version,
26 "first_stage_rank": result.candidate.first_stage_rank,
27 "rerank_score": result.score,
28 "reasons": result.reasons,
29 }
30 for result in release_reranked
31 ],
32 "selected_context_ids": [candidate.chunk_id for candidate in selected_context],
33 "selected_versions": [candidate.version for candidate in selected_context],
34 "gates": {
35 "permitted_and_current": all(
36 candidate.permitted and candidate.current for candidate in release_input
37 ),
38 "target_in_context": TARGET_ID in [
39 candidate.chunk_id for candidate in selected_context
40 ],
41 "ordering_lift": after_ndcg > before_ndcg,
42 "latency_budget": release_fixture_latency <= FIXTURE_LATENCY_BUDGET_MS,
43 },
44}
45stores_raw_policy_text = any(
46 candidate.text in str(release_trace)
47 for candidate in SOURCE_STORE
48)
49print("Versions:", release_trace["versions"])
50print("Rerank input:", release_trace["rerank_input_ids"])
51print("Selected context:", release_trace["selected_context_ids"])
52print("Trace stores raw policy text:", stores_raw_policy_text)
53print("Gates:", release_trace["gates"])
54assert set(release_trace["reranked_ids"]) == set(release_trace["rerank_input_ids"])
55assert not stores_raw_policy_text
56assert all(release_trace["gates"].values())1Versions: {'retriever': 'policy-retriever-v2', 'index': 'policy-index/2026-05-27', 'sparse': 'bm25-tokenizer-v1', 'dense': 'fixture-embeddings-v1', 'fusion': 'rrf-k60', 'reranker': 'fixture-cross-encoder-v1'}
2Rerank input: ['api-token-troubleshooting-v1', 'api-password-reset-v1', 'api-token-legacy-v2-rule']
3Selected context: ['api-token-legacy-v2-rule']
4Trace stores raw policy text: False
5Gates: {'permitted_and_current': True, 'target_in_context': True, 'ordering_lift': True, 'latency_budget': True}This trace proves that reranking improved the evidence presented to generation. It doesn't yet prove that an answer states the rule faithfully or cites it correctly. That's exactly the boundary for the next lesson.
Before releasing a real reranker:
| Gate | Evidence to log | Failure response |
|---|---|---|
| Candidate recall | Gold chunk ID in first-stage top k | Fix retrieval or raise measured candidate budget |
| Authorization and freshness | ACL decision, chunk version, effective date | Reject request context; never score blocked evidence |
| Ordering lift | Before/after MRR or NDCG on held-out traces | Retrain, replace, or remove reranker |
| Serving budget | Candidate count, token length, model version, p95 latency | Batch, cap input, or choose a tested alternative |
| Downstream grounding | Selected chunk IDs and generated citations | Evaluate in the next pipeline stage |
One attractive mistake is caching a reranker result without its source version. If api-token-legacy-v2-rule changes, a cached score for an older chunk can keep stale evidence at rank 1. Key caches and traces by chunk checksum or policy version, then rerun golden traces after source updates.
You're ready to add a reranker to a production RAG pipeline when you can:
| Level | Evidence in submission |
|---|---|
| Foundational | Preserves authorization and freshness filtering before pair scoring. |
| Applied | Reranks only first-stage candidates and proves ordering lift with MRR or NDCG. |
| Strong | Chooses candidate count using candidate recall and measured end-to-end latency. |
| Production-ready | Emits versioned evidence traces for downstream faithfulness and citation evaluation. |
Recall@candidate_k first, then evaluate ordering only on candidate sets that contain the answer.Answer every question, then check your score. Score above 75% to mark this lesson complete.
8 questions remaining.
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.
Reimers, N., & Gurevych, I. · 2019 · EMNLP 2019
Passage Re-ranking with BERT.
Nogueira, R. & Cho, K. · 2019 · arXiv preprint
Introduction to Information Retrieval.
Manning, C. D., Raghavan, P., Schutze, H. · 2008 · Cambridge University Press
ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.
Khattab, O., & Zaharia, M. · 2020 · SIGIR 2020