LearnApplied LLM EngineeringReranking and Cross-Encoders for RAG

🔍MediumRAG & Retrieval

Reranking and Cross-Encoders for RAG

Turn a permission-safe hybrid candidate list into precise context using cross-encoder reasoning, ordering metrics, latency gates, and traceable evidence selection.

14 min read

Learning path

Step 66 of 158 in the full curriculum

Hybrid Search: Dense + Sparse RAG Evaluation for Reliable Answers

The hybrid-search lesson built policy-answerer-v2 for retrieval-augmented generation (RAG): only current, permitted policy chunks enter retrieval, then sparse and dense ranks are fused. That fixed recall. It didn't guarantee that the best chunk fits into a small generator context window.

policy-answerer-v3 adds the missing reranking step. Luna is answering a developer's question:

Can a service account use the legacy token endpoint for 10 more days if audit logging is enabled?

Hybrid retrieval has already found the current rule api-token-legacy-v2-rule, but a nearby troubleshooting note appears above it. You'll add a reranking stage that reorders only the permitted candidates, admits supported evidence into a two-chunk maximum context budget, and emits a release trace for evaluation.

Reranking overview where the same permitted hybrid candidate list feeds pair scoring, the supported legacy-token rule moves from rank three to rank one, and the two-chunk context budget keeps supported evidence while blocked or stale records remain outside scoring. — Reranking keeps candidate membership fixed, moves the supported legacy-token rule into the top slot, and never scores blocked or stale records.

The boundary: retrieve candidates, then improve their order

A two-stage retriever separates two problems:

Stage	Question	Optimized signal	Must never change
Hybrid retrieval	Is useful evidence in the candidate set?	Recall over current permitted chunks	Authorization and freshness boundary
Reranking	Which retrieved chunks best answer this query?	Precision near the top	Candidate membership
Generation	What answer can be supported?	Grounded response with citations	Selected evidence only

The reranker can't restore a missing document. It must stay inside the same permission boundary: never search a restricted or superseded document as a shortcut.

The runtime contract is compact: accept the permitted hybrid candidates, score each query-chunk pair, admit only score-gated top context, and write an evidence-level trace.

The lab starts from a small fixture representing the hybrid output from the previous lesson. api-token-legacy-v2-rule is present, so recall succeeded. It's at rank 3, so a two-chunk context window would still omit the answer.

permitted-hybrid-candidates.py

from __future__ import annotations

from dataclasses import dataclass
from math import log2

@dataclass(frozen=True)
class Candidate:
    chunk_id: str
    document_id: str
    parent_id: str
    version: str
    permitted: bool
    current: bool
    first_stage_rank: int
    text: str

QUERY = (
    "legacy token endpoint for service account during 10 day migration "
    "with audit logging enabled"
)
TARGET_ID = "api-token-legacy-v2-rule"
CANDIDATES = [
    Candidate(
        "api-token-troubleshooting-v1",
        "api-token-troubleshooting",
        "api-token-troubleshooting-v1",
        "api-token-troubleshooting/2026-04-20",
        True,
        True,
        1,
        (
            "Legacy token endpoint errors can be inspected during migration. "
            "This note does not authorize temporary access."
        ),
    ),
    Candidate(
        "api-password-reset-v1",
        "api-password-reset",
        "api-password-reset-v1",
        "api-password-reset/2026-01-03",
        True,
        True,
        2,
        "Password reset tokens expire after 30 minutes.",
    ),
    Candidate(
        "api-token-legacy-v2-rule",
        "api-auth",
        "api-auth-v2",
        "api-auth/2026-04-01",
        True,
        True,
        3,
        (
            "Rule AUTH-14. Service accounts may use the legacy token endpoint "
            "within 14 days of deprecation when audit logging is enabled."
        ),
    ),
    Candidate(
        "api-audit-export-v1",
        "api-audit",
        "api-audit-v1",
        "api-audit/2026-03-12",
        True,
        True,
        4,
        "Audit logs can be exported within 14 days.",
    ),
]

first_stage = sorted(CANDIDATES, key=lambda candidate: candidate.first_stage_rank)
first_stage_ids = [candidate.chunk_id for candidate in first_stage]
print("Hybrid order:", first_stage_ids)
assert TARGET_ID in first_stage_ids
assert first_stage_ids.index(TARGET_ID) + 1 == 3

Output

Hybrid order: ['api-token-troubleshooting-v1', 'api-password-reset-v1', 'api-token-legacy-v2-rule', 'api-audit-export-v1']

The candidate record keeps document_id and parent_id from the earlier RAG pipeline. Reranking changes evidence order, not the source identity that citation packing needs later.

Keep the evidence boundary executable

The store also contains records Luna may not use: an admin-only rule and an expired policy revision. A regression test should prove that neither enters reranking, even though either might be attractive for the query.

reranker-boundary-regression.py

SOURCE_STORE = CANDIDATES + [
    Candidate(
        "admin-token-legacy",
        "admin-token-terms",
        "admin-token-terms",
        "admin-token/2026-05-01",
        False,
        True,
        0,
        "Admin service accounts receive immediate legacy token access.",
    ),
    Candidate(
        "api-token-legacy-v1-rule",
        "api-auth",
        "api-auth-v1",
        "api-auth/2025-02-01",
        True,
        False,
        0,
        "Service accounts may use the legacy token endpoint within 30 days.",
    ),
]
FIRST_STAGE_ID_SET = set(first_stage_ids)

def rerankable_candidates(store: list[Candidate]) -> list[Candidate]:
    return [
        candidate
        for candidate in store
        if (
            candidate.permitted
            and candidate.current
            and candidate.chunk_id in FIRST_STAGE_ID_SET
        )
    ]

rerankable = rerankable_candidates(SOURCE_STORE)
blocked_ids = sorted(
    candidate.chunk_id
    for candidate in SOURCE_STORE
    if candidate not in rerankable
)
print("Rerankable:", [candidate.chunk_id for candidate in rerankable])
print("Blocked:", blocked_ids)
assert rerankable == CANDIDATES
assert blocked_ids == ["admin-token-legacy", "api-token-legacy-v1-rule"]

Output

Rerankable: ['api-token-troubleshooting-v1', 'api-password-reset-v1', 'api-token-legacy-v2-rule', 'api-audit-export-v1']
Blocked: ['admin-token-legacy', 'api-token-legacy-v1-rule']

The allowlist is the first-stage chunk-ID set, not a fresh corpus scan. Stable IDs keep candidate membership explicit even if records are loaded from another store layer.

With no reranker, the generator receives the first two hybrid candidates. That context contains a troubleshooting note that explicitly says doesn't authorize temporary access, plus a password-reset policy. The correct legacy-token rule is present in retrieval results but absent from context.

context-before-reranking.py

CONTEXT_BUDGET = 2

def pack_context(candidates: list[Candidate], budget: int) -> list[str]:
    return [candidate.chunk_id for candidate in candidates[:budget]]

context_before = pack_context(first_stage, CONTEXT_BUDGET)
print("Context before rerank:", context_before)
print("Target reaches generation:", TARGET_ID in context_before)
assert TARGET_ID not in context_before

Output

Context before rerank: ['api-token-troubleshooting-v1', 'api-password-reset-v1']
Target reaches generation: False

Why a cross-encoder helps

A bi-encoder encodes a query and each chunk separately, then compares stored document vectors to a query vector. Separately encoded representations make large-scale retrieval practical because document embeddings can be indexed before a request arrives.^[1]

A cross-encoder receives the query and one candidate chunk together and returns a relevance score for that pair. Passage reranking with BERT applies this joint scoring only after an initial retrieval stage, because request-time scoring for each pair is more expensive than searching a prebuilt index.^[2]

That distinction matters here. A troubleshooting note shares legacy token endpoint, migration, and 14 days with Luna's question, but it also says doesn't authorize temporary access. The legacy-token rule must match the requested endpoint, temporary-access remedy, migration window, and audit condition together.

The next fixture is deliberately transparent. It isn't a trained transformer, and it doesn't claim real model accuracy. It makes the pairwise requirements inspectable before you plug in a model endpoint.

inspect-pairwise-requirements.py

@dataclass(frozen=True)
class Request:
    endpoint: str
    remedy: str
    days_since_deprecation: int
    audit_enabled: bool

REQUEST = Request(
    endpoint="legacy token endpoint",
    remedy="temporary access",
    days_since_deprecation=10,
    audit_enabled=True,
)
requirements = [
    REQUEST.endpoint,
    REQUEST.remedy,
    "migration window covers 10 days",
    "audit logging is enabled",
]
print("Pairwise requirements:", requirements)
assert REQUEST.days_since_deprecation <= 14
assert REQUEST.audit_enabled

Output

Pairwise requirements: ['legacy token endpoint', 'temporary access', 'migration window covers 10 days', 'audit logging is enabled']

The score below rewards a chunk only when its policy can support the developer's constraints. It also applies an explicit contradiction penalty. A learned cross-encoder would learn a scoring function from labeled query-passage pairs; the fixture gives the lab a stable expected result.

pairwise-reranker.py

@dataclass(frozen=True)
class PairScore:
    candidate: Candidate
    score: int
    reasons: tuple[str, ...]

class ConstraintAwarePairScorer:
    def score(self, request: Request, candidate: Candidate) -> PairScore:
        text = candidate.text.lower()
        score = 0
        reasons: list[str] = []

        if request.endpoint in text:
            score += 2
            reasons.append("endpoint")
        if request.remedy in text:
            score += 3
            reasons.append("remedy")
        if "service accounts" in text:
            score += 1
            reasons.append("principal")
        if "within 14 days" in text and request.days_since_deprecation <= 14:
            score += 2
            reasons.append("migration-window")
        if "audit logging is enabled" in text and request.audit_enabled:
            score += 2
            reasons.append("audit-condition")
        if "does not authorize temporary access" in text:
            score -= 6
            reasons.append("contradiction")
        return PairScore(candidate, score, tuple(reasons))

def rerank(
    request: Request,
    candidates: list[Candidate],
    scorer: ConstraintAwarePairScorer,
) -> list[PairScore]:
    scored = [scorer.score(request, candidate) for candidate in candidates]
    return sorted(
        scored,
        key=lambda result: (result.score, -result.candidate.first_stage_rank),
        reverse=True,
    )

reranked = rerank(REQUEST, rerankable, ConstraintAwarePairScorer())
reranked_ids = [result.candidate.chunk_id for result in reranked]
for result in reranked:
    print(result.candidate.chunk_id, result.score, result.reasons)
assert reranked_ids[0] == TARGET_ID
assert rerank(REQUEST, [], ConstraintAwarePairScorer()) == []

Output

api-token-legacy-v2-rule 7 ('endpoint', 'principal', 'migration-window', 'audit-condition')
api-audit-export-v1 2 ('migration-window',)
api-password-reset-v1 0 ()
api-token-troubleshooting-v1 -1 ('endpoint', 'remedy', 'contradiction')

Now context selection changes for the right reason: the candidate set stays fixed while the supported rule moves to rank 1. A context budget is a maximum, not a quota, so low-scoring near matches aren't admitted merely to fill space.

context-after-reranking.py

reranked_candidates = [result.candidate for result in reranked]
MIN_CONTEXT_SCORE = 5
selected_after = [
    result.candidate
    for result in reranked
    if result.score >= MIN_CONTEXT_SCORE
][:CONTEXT_BUDGET]
context_after = [candidate.chunk_id for candidate in selected_after]
print("Context before:", context_before)
print("Context after:", context_after)
assert TARGET_ID in context_after
assert "api-token-troubleshooting-v1" not in context_after
assert set(reranked_ids) == set(first_stage_ids)

Output

Context before: ['api-token-troubleshooting-v1', 'api-password-reset-v1']
Context after: ['api-token-legacy-v2-rule']

Measure ordering, not vibes

For this stage, keep evaluation narrow:

Recall@candidate_k checks whether first-stage retrieval gave the reranker a chance.
MRR (Mean Reciprocal Rank) averages how early the first relevant chunk appears across a release suite.
NDCG@context_k (Normalized Discounted Cumulative Gain) checks whether relevant chunks fit near the top of the context budget.^[3]

For this one request, reciprocal rank (RR) is easy to read: rank 3 gives 1 / 3; rank 1 gives 1. MRR is the mean of those per-request values. NDCG supports graded relevance; this lab uses binary relevance, where a relevant chunk has rel_i = 1 and an irrelevant chunk has rel_i = 0:

\operatorname{DCG}_k = \sum_{i=1}^{k} \frac{2^{rel_i}-1}{\log_2(i+1)} \qquad \operatorname{NDCG}_k = \frac{\operatorname{DCG}_k}{\operatorname{IDCG}_k}

ordering-gate.py

RELEVANT_IDS = {TARGET_ID}

def reciprocal_rank(ids: list[str], relevant_ids: set[str]) -> float:
    for rank, chunk_id in enumerate(ids, start=1):
        if chunk_id in relevant_ids:
            return 1.0 / rank
    return 0.0

def ndcg_at_k(ids: list[str], relevant_ids: set[str], k: int) -> float:
    dcg = sum(
        (2 ** int(chunk_id in relevant_ids) - 1) / log2(rank + 1)
        for rank, chunk_id in enumerate(ids[:k], start=1)
    )
    ideal_hits = min(len(relevant_ids), k)
    idcg = sum((2**1 - 1) / log2(rank + 1) for rank in range(1, ideal_hits + 1))
    return dcg / idcg if idcg else 0.0

before_rr = reciprocal_rank(first_stage_ids, RELEVANT_IDS)
after_rr = reciprocal_rank(reranked_ids, RELEVANT_IDS)
before_ndcg = ndcg_at_k(first_stage_ids, RELEVANT_IDS, CONTEXT_BUDGET)
after_ndcg = ndcg_at_k(reranked_ids, RELEVANT_IDS, CONTEXT_BUDGET)
print(f"RR: {before_rr:.2f} -> {after_rr:.2f}")
print(f"NDCG@{CONTEXT_BUDGET}: {before_ndcg:.2f} -> {after_ndcg:.2f}")
assert after_rr > before_rr
assert before_ndcg == 0.0 and after_ndcg == 1.0

Output

RR: 0.33 -> 1.00
NDCG@2: 0.00 -> 1.00

Don't confuse improved evidence ordering with a correct generated answer. The next lesson will evaluate faithfulness and citation agreement after context reaches the generator.

Candidate count sets the ceiling and the bill

If the reranker receives only the first two candidates, it can't select api-token-legacy-v2-rule: the rule was cut before pair scoring. Use a candidate-recall gate before comparing reranker models.

candidate-count-gate.py

def target_present(candidates: list[Candidate]) -> bool:
    return TARGET_ID in [candidate.chunk_id for candidate in candidates]

RERANK_CANDIDATE_BUDGET = 3
too_small = first_stage[:2]
release_input = first_stage[:RERANK_CANDIDATE_BUDGET]
limited_ids = [result.candidate.chunk_id for result in rerank(
    REQUEST, too_small, ConstraintAwarePairScorer()
)]
release_reranked = rerank(REQUEST, release_input, ConstraintAwarePairScorer())
release_reranked_ids = [result.candidate.chunk_id for result in release_reranked]
print("top-2 includes target:", target_present(too_small), limited_ids)
print("top-3 includes target:", target_present(release_input), release_reranked_ids)
assert not target_present(too_small)
assert release_reranked_ids[0] == TARGET_ID

Output

top-2 includes target: False ['api-password-reset-v1', 'api-token-troubleshooting-v1']
top-3 includes target: True ['api-token-legacy-v2-rule', 'api-password-reset-v1', 'api-token-troubleshooting-v1']

A cross-encoder scores every query-candidate pair online. This fixture uses a deliberately sequential cost model so the candidate-count tradeoff stays visible. It isn't a p95 estimator or a hardware benchmark: production serving may batch pairs, and percentiles must be measured end to end.

latency-budget-gate.py

SEQUENTIAL_PAIR_COST_MS = 7.5
REQUEST_OVERHEAD_MS = 4.0
FIXTURE_LATENCY_BUDGET_MS = 30.0

def estimated_fixture_latency_ms(candidate_count: int) -> float:
    return REQUEST_OVERHEAD_MS + SEQUENTIAL_PAIR_COST_MS * candidate_count

release_fixture_latency = estimated_fixture_latency_ms(len(release_input))
too_wide_fixture_latency = estimated_fixture_latency_ms(len(first_stage))
print(
    f"top-3 fixture latency: {release_fixture_latency:.1f} ms, "
    f"pass={release_fixture_latency <= FIXTURE_LATENCY_BUDGET_MS}"
)
print(
    f"top-4 fixture latency: {too_wide_fixture_latency:.1f} ms, "
    f"pass={too_wide_fixture_latency <= FIXTURE_LATENCY_BUDGET_MS}"
)
assert release_fixture_latency <= FIXTURE_LATENCY_BUDGET_MS
assert too_wide_fixture_latency > FIXTURE_LATENCY_BUDGET_MS

Output

top-3 fixture latency: 26.5 ms, pass=True
top-4 fixture latency: 34.0 ms, pass=False

The lab result is intentionally specific: retrieve at least three candidates, rerank three, and pass two chunks. A real service chooses these numbers from held-out traces and measured end-to-end p95 latency by candidate count, input length, and batch strategy, not from a generic default.

Choose an interaction design deliberately

Three ranking designs differ in when the query and chunk can interact:

Design	Stored before request	Request-time work	Role in this pipeline
Bi-encoder retrieval	One embedding per chunk	Compare query vector to indexed chunk vectors	Build a broad candidate set
Cross-encoder reranking	No query-specific pair score	Jointly score each retrieved query-chunk pair	Precision stage here
ColBERT late interaction	Contextual token vectors per chunk	Match query token vectors against chunk token vectors	Alternative to evaluate under a tighter latency budget

ColBERT keeps document-side representations indexable while applying token-level late interaction at request time.^[4] For each query token, MaxSim keeps its highest similarity to any document token, then sums those per-token maxima into a relevance score. It isn't an automatic upgrade. Compare it against a cross-encoder on the same held-out candidate lists, ordering metric, memory cost, and p95 latency.

Three-way interaction comparison where a bi-encoder compresses query and chunk separately into one vector each, a cross-encoder lets all query and chunk tokens interact in one joint sequence, and ColBERT keeps chunk token vectors indexed while matching query tokens online with MaxSim. — Bi-encoders interact only after separate compression, cross-encoders interact before scoring, and ColBERT keeps document tokens indexed while matching query tokens online.

Emit the trace that evaluation needs

Shipping a reranker without evidence-level traces makes failures ambiguous. Record first-stage IDs, actual reranker input, candidate source identity, policy versions, pair scores, model versions, selected context IDs, and release gates. A document version is part of correctness: a well-ranked stale rule is still unsafe context.

reranking-release-trace.py

SCORER_VERSION = "fixture-cross-encoder-v1"
selected_context = [
    result.candidate
    for result in release_reranked
    if result.score >= MIN_CONTEXT_SCORE
][:CONTEXT_BUDGET]
release_trace = {
    "query_id": "api-token-legacy-access-001",
    "versions": {
        "retriever": "policy-retriever-v2",
        "index": "policy-index/2026-05-27",
        "sparse": "bm25-tokenizer-v1",
        "dense": "fixture-embeddings-v1",
        "fusion": "rrf-k60",
        "reranker": SCORER_VERSION,
    },
    "first_stage_ids": first_stage_ids,
    "rerank_input_ids": [candidate.chunk_id for candidate in release_input],
    "reranked_ids": release_reranked_ids,
    "candidate_records": [
        {
            "chunk_id": result.candidate.chunk_id,
            "document_id": result.candidate.document_id,
            "parent_id": result.candidate.parent_id,
            "version": result.candidate.version,
            "first_stage_rank": result.candidate.first_stage_rank,
            "rerank_score": result.score,
            "reasons": result.reasons,
        }
        for result in release_reranked
    ],
    "selected_context_ids": [candidate.chunk_id for candidate in selected_context],
    "selected_versions": [candidate.version for candidate in selected_context],
    "gates": {
        "permitted_and_current": all(
            candidate.permitted and candidate.current for candidate in release_input
        ),
        "target_in_context": TARGET_ID in [
            candidate.chunk_id for candidate in selected_context
        ],
        "ordering_lift": after_ndcg > before_ndcg,
        "latency_budget": release_fixture_latency <= FIXTURE_LATENCY_BUDGET_MS,
    },
}
stores_raw_policy_text = any(
    candidate.text in str(release_trace)
    for candidate in SOURCE_STORE
)
print("Versions:", release_trace["versions"])
print("Rerank input:", release_trace["rerank_input_ids"])
print("Selected context:", release_trace["selected_context_ids"])
print("Trace stores raw policy text:", stores_raw_policy_text)
print("Gates:", release_trace["gates"])
assert set(release_trace["reranked_ids"]) == set(release_trace["rerank_input_ids"])
assert not stores_raw_policy_text
assert all(release_trace["gates"].values())

Output

Versions: {'retriever': 'policy-retriever-v2', 'index': 'policy-index/2026-05-27', 'sparse': 'bm25-tokenizer-v1', 'dense': 'fixture-embeddings-v1', 'fusion': 'rrf-k60', 'reranker': 'fixture-cross-encoder-v1'}
Rerank input: ['api-token-troubleshooting-v1', 'api-password-reset-v1', 'api-token-legacy-v2-rule']
Selected context: ['api-token-legacy-v2-rule']
Trace stores raw policy text: False
Gates: {'permitted_and_current': True, 'target_in_context': True, 'ordering_lift': True, 'latency_budget': True}

This trace proves that reranking improved the evidence presented to generation. It doesn't yet prove that an answer states the rule faithfully or cites it correctly. That's exactly the boundary for the next lesson.

Production checks

Before releasing a real reranker:

Gate	Evidence to log	Failure response
Candidate recall	Gold chunk ID in first-stage top `k`	Fix retrieval or raise measured candidate budget
Authorization and freshness	ACL decision, chunk version, effective date	Reject request context; never score blocked evidence
Ordering lift	Before/after MRR or NDCG on held-out traces	Retrain, replace, or remove reranker
Serving budget	Candidate count, token length, model version, p95 latency	Batch, cap input, or choose a tested alternative
Downstream grounding	Selected chunk IDs and generated citations	Evaluate in the next pipeline stage

One attractive mistake is caching a reranker result without its source version. If api-token-legacy-v2-rule changes, a cached score for an older chunk can keep stale evidence at rank 1. Key caches and traces by chunk checksum or policy version, then rerun golden traces after source updates.

Mastery check

You're ready to add a reranker to a production RAG pipeline when you can:

Preserve authorization and freshness filtering before any pair scoring.
Explain why first-stage recall and final evidence order need separate gates.
Compare bi-encoder retrieval, cross-encoder reranking, and ColBERT-style late interaction.
Measure ordering improvement with MRR or NDCG inside the context budget.
Choose candidate count using candidate recall and measured end-to-end latency.
Emit a trace suitable for downstream faithfulness and citation evaluation.

Evaluation rubric

Level	Evidence in submission
Foundational	Preserves authorization and freshness filtering before pair scoring.
Applied	Reranks only first-stage candidates and proves ordering lift with MRR or NDCG.
Strong	Chooses candidate count using candidate recall and measured end-to-end latency.
Production-ready	Emits versioned evidence traces for downstream faithfulness and citation evaluation.

Follow-up questions

Common pitfalls

Reranking is used to hide recall failure

Symptom: Teams swap reranker models, but a gold policy chunk never appears in final context.
Cause: First-stage retrieval did not include the chunk in its candidate set.
Fix: Gate on Recall@candidate_k first, then evaluate ordering only on candidate sets that contain the answer.

Candidate count grows without a serving measurement

Symptom: Ordering metrics improve while request latency exceeds the product budget.
Cause: Each additional cross-encoder pair consumes request-time inference.
Fix: Benchmark p95 latency by candidate count and input length, then choose the smallest candidate set that clears recall and ordering gates.

Evidence order improves but grounding remains untested

Symptom: NDCG rises, but users still receive unsupported answers.
Cause: Reranking evaluates selected context, not whether generation follows it.
Fix: Carry the reranking trace into answer faithfulness and citation checks in the next evaluation stage.

Next Step

Continue to RAG Evaluation for Reliable Answers

`policy-answerer-v3` now selects evidence precisely; the next lesson tests whether generation uses that evidence faithfully.

PreviousHybrid Search: Dense + Sparse

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.

Reimers, N., & Gurevych, I. · 2019 · EMNLP 2019

Passage Re-ranking with BERT.

Nogueira, R. & Cho, K. · 2019 · arXiv preprint

Introduction to Information Retrieval.

Manning, C. D., Raghavan, P., Schutze, H. · 2008 · Cambridge University Press

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.

Khattab, O., & Zaharia, M. · 2020 · SIGIR 2020

Back to Topics

LearnApplied LLM EngineeringReranking and Cross-Encoders for RAG

🔍MediumRAG & Retrieval

Reranking and Cross-Encoders for RAG

Turn a permission-safe hybrid candidate list into precise context using cross-encoder reasoning, ordering metrics, latency gates, and traceable evidence selection.

14 min read

Learning path

Step 66 of 158 in the full curriculum

Hybrid Search: Dense + Sparse RAG Evaluation for Reliable Answers

policy-answerer-v3 adds the missing reranking step. Luna is answering a developer's question:

Can a service account use the legacy token endpoint for 10 more days if audit logging is enabled?

The boundary: retrieve candidates, then improve their order

A two-stage retriever separates two problems:

Stage	Question	Optimized signal	Must never change
Hybrid retrieval	Is useful evidence in the candidate set?	Recall over current permitted chunks	Authorization and freshness boundary
Reranking	Which retrieved chunks best answer this query?	Precision near the top	Candidate membership
Generation	What answer can be supported?	Grounded response with citations	Selected evidence only

The reranker can't restore a missing document. It must stay inside the same permission boundary: never search a restricted or superseded document as a shortcut.

The runtime contract is compact: accept the permitted hybrid candidates, score each query-chunk pair, admit only score-gated top context, and write an evidence-level trace.

permitted-hybrid-candidates.py

from __future__ import annotations

from dataclasses import dataclass
from math import log2

@dataclass(frozen=True)
class Candidate:
    chunk_id: str
    document_id: str
    parent_id: str
    version: str
    permitted: bool
    current: bool
    first_stage_rank: int
    text: str

QUERY = (
    "legacy token endpoint for service account during 10 day migration "
    "with audit logging enabled"
)
TARGET_ID = "api-token-legacy-v2-rule"
CANDIDATES = [
    Candidate(
        "api-token-troubleshooting-v1",
        "api-token-troubleshooting",
        "api-token-troubleshooting-v1",
        "api-token-troubleshooting/2026-04-20",
        True,
        True,
        1,
        (
            "Legacy token endpoint errors can be inspected during migration. "
            "This note does not authorize temporary access."
        ),
    ),
    Candidate(
        "api-password-reset-v1",
        "api-password-reset",
        "api-password-reset-v1",
        "api-password-reset/2026-01-03",
        True,
        True,
        2,
        "Password reset tokens expire after 30 minutes.",
    ),
    Candidate(
        "api-token-legacy-v2-rule",
        "api-auth",
        "api-auth-v2",
        "api-auth/2026-04-01",
        True,
        True,
        3,
        (
            "Rule AUTH-14. Service accounts may use the legacy token endpoint "
            "within 14 days of deprecation when audit logging is enabled."
        ),
    ),
    Candidate(
        "api-audit-export-v1",
        "api-audit",
        "api-audit-v1",
        "api-audit/2026-03-12",
        True,
        True,
        4,
        "Audit logs can be exported within 14 days.",
    ),
]

first_stage = sorted(CANDIDATES, key=lambda candidate: candidate.first_stage_rank)
first_stage_ids = [candidate.chunk_id for candidate in first_stage]
print("Hybrid order:", first_stage_ids)
assert TARGET_ID in first_stage_ids
assert first_stage_ids.index(TARGET_ID) + 1 == 3

Output

Hybrid order: ['api-token-troubleshooting-v1', 'api-password-reset-v1', 'api-token-legacy-v2-rule', 'api-audit-export-v1']

The candidate record keeps document_id and parent_id from the earlier RAG pipeline. Reranking changes evidence order, not the source identity that citation packing needs later.

Keep the evidence boundary executable

reranker-boundary-regression.py

SOURCE_STORE = CANDIDATES + [
    Candidate(
        "admin-token-legacy",
        "admin-token-terms",
        "admin-token-terms",
        "admin-token/2026-05-01",
        False,
        True,
        0,
        "Admin service accounts receive immediate legacy token access.",
    ),
    Candidate(
        "api-token-legacy-v1-rule",
        "api-auth",
        "api-auth-v1",
        "api-auth/2025-02-01",
        True,
        False,
        0,
        "Service accounts may use the legacy token endpoint within 30 days.",
    ),
]
FIRST_STAGE_ID_SET = set(first_stage_ids)

def rerankable_candidates(store: list[Candidate]) -> list[Candidate]:
    return [
        candidate
        for candidate in store
        if (
            candidate.permitted
            and candidate.current
            and candidate.chunk_id in FIRST_STAGE_ID_SET
        )
    ]

rerankable = rerankable_candidates(SOURCE_STORE)
blocked_ids = sorted(
    candidate.chunk_id
    for candidate in SOURCE_STORE
    if candidate not in rerankable
)
print("Rerankable:", [candidate.chunk_id for candidate in rerankable])
print("Blocked:", blocked_ids)
assert rerankable == CANDIDATES
assert blocked_ids == ["admin-token-legacy", "api-token-legacy-v1-rule"]

Output

Rerankable: ['api-token-troubleshooting-v1', 'api-password-reset-v1', 'api-token-legacy-v2-rule', 'api-audit-export-v1']
Blocked: ['admin-token-legacy', 'api-token-legacy-v1-rule']

The allowlist is the first-stage chunk-ID set, not a fresh corpus scan. Stable IDs keep candidate membership explicit even if records are loaded from another store layer.

context-before-reranking.py

CONTEXT_BUDGET = 2

def pack_context(candidates: list[Candidate], budget: int) -> list[str]:
    return [candidate.chunk_id for candidate in candidates[:budget]]

context_before = pack_context(first_stage, CONTEXT_BUDGET)
print("Context before rerank:", context_before)
print("Target reaches generation:", TARGET_ID in context_before)
assert TARGET_ID not in context_before

Output

Context before rerank: ['api-token-troubleshooting-v1', 'api-password-reset-v1']
Target reaches generation: False

Why a cross-encoder helps

inspect-pairwise-requirements.py

@dataclass(frozen=True)
class Request:
    endpoint: str
    remedy: str
    days_since_deprecation: int
    audit_enabled: bool

REQUEST = Request(
    endpoint="legacy token endpoint",
    remedy="temporary access",
    days_since_deprecation=10,
    audit_enabled=True,
)
requirements = [
    REQUEST.endpoint,
    REQUEST.remedy,
    "migration window covers 10 days",
    "audit logging is enabled",
]
print("Pairwise requirements:", requirements)
assert REQUEST.days_since_deprecation <= 14
assert REQUEST.audit_enabled

Output

Pairwise requirements: ['legacy token endpoint', 'temporary access', 'migration window covers 10 days', 'audit logging is enabled']

pairwise-reranker.py

@dataclass(frozen=True)
class PairScore:
    candidate: Candidate
    score: int
    reasons: tuple[str, ...]

class ConstraintAwarePairScorer:
    def score(self, request: Request, candidate: Candidate) -> PairScore:
        text = candidate.text.lower()
        score = 0
        reasons: list[str] = []

        if request.endpoint in text:
            score += 2
            reasons.append("endpoint")
        if request.remedy in text:
            score += 3
            reasons.append("remedy")
        if "service accounts" in text:
            score += 1
            reasons.append("principal")
        if "within 14 days" in text and request.days_since_deprecation <= 14:
            score += 2
            reasons.append("migration-window")
        if "audit logging is enabled" in text and request.audit_enabled:
            score += 2
            reasons.append("audit-condition")
        if "does not authorize temporary access" in text:
            score -= 6
            reasons.append("contradiction")
        return PairScore(candidate, score, tuple(reasons))

def rerank(
    request: Request,
    candidates: list[Candidate],
    scorer: ConstraintAwarePairScorer,
) -> list[PairScore]:
    scored = [scorer.score(request, candidate) for candidate in candidates]
    return sorted(
        scored,
        key=lambda result: (result.score, -result.candidate.first_stage_rank),
        reverse=True,
    )

reranked = rerank(REQUEST, rerankable, ConstraintAwarePairScorer())
reranked_ids = [result.candidate.chunk_id for result in reranked]
for result in reranked:
    print(result.candidate.chunk_id, result.score, result.reasons)
assert reranked_ids[0] == TARGET_ID
assert rerank(REQUEST, [], ConstraintAwarePairScorer()) == []

Output

api-token-legacy-v2-rule 7 ('endpoint', 'principal', 'migration-window', 'audit-condition')
api-audit-export-v1 2 ('migration-window',)
api-password-reset-v1 0 ()
api-token-troubleshooting-v1 -1 ('endpoint', 'remedy', 'contradiction')

context-after-reranking.py

reranked_candidates = [result.candidate for result in reranked]
MIN_CONTEXT_SCORE = 5
selected_after = [
    result.candidate
    for result in reranked
    if result.score >= MIN_CONTEXT_SCORE
][:CONTEXT_BUDGET]
context_after = [candidate.chunk_id for candidate in selected_after]
print("Context before:", context_before)
print("Context after:", context_after)
assert TARGET_ID in context_after
assert "api-token-troubleshooting-v1" not in context_after
assert set(reranked_ids) == set(first_stage_ids)

Output

Context before: ['api-token-troubleshooting-v1', 'api-password-reset-v1']
Context after: ['api-token-legacy-v2-rule']

Measure ordering, not vibes

For this stage, keep evaluation narrow:

Recall@candidate_k checks whether first-stage retrieval gave the reranker a chance.
MRR (Mean Reciprocal Rank) averages how early the first relevant chunk appears across a release suite.
NDCG@context_k (Normalized Discounted Cumulative Gain) checks whether relevant chunks fit near the top of the context budget.^[3]

\operatorname{DCG}_k = \sum_{i=1}^{k} \frac{2^{rel_i}-1}{\log_2(i+1)} \qquad \operatorname{NDCG}_k = \frac{\operatorname{DCG}_k}{\operatorname{IDCG}_k}

ordering-gate.py

RELEVANT_IDS = {TARGET_ID}

def reciprocal_rank(ids: list[str], relevant_ids: set[str]) -> float:
    for rank, chunk_id in enumerate(ids, start=1):
        if chunk_id in relevant_ids:
            return 1.0 / rank
    return 0.0

def ndcg_at_k(ids: list[str], relevant_ids: set[str], k: int) -> float:
    dcg = sum(
        (2 ** int(chunk_id in relevant_ids) - 1) / log2(rank + 1)
        for rank, chunk_id in enumerate(ids[:k], start=1)
    )
    ideal_hits = min(len(relevant_ids), k)
    idcg = sum((2**1 - 1) / log2(rank + 1) for rank in range(1, ideal_hits + 1))
    return dcg / idcg if idcg else 0.0

before_rr = reciprocal_rank(first_stage_ids, RELEVANT_IDS)
after_rr = reciprocal_rank(reranked_ids, RELEVANT_IDS)
before_ndcg = ndcg_at_k(first_stage_ids, RELEVANT_IDS, CONTEXT_BUDGET)
after_ndcg = ndcg_at_k(reranked_ids, RELEVANT_IDS, CONTEXT_BUDGET)
print(f"RR: {before_rr:.2f} -> {after_rr:.2f}")
print(f"NDCG@{CONTEXT_BUDGET}: {before_ndcg:.2f} -> {after_ndcg:.2f}")
assert after_rr > before_rr
assert before_ndcg == 0.0 and after_ndcg == 1.0

Output

RR: 0.33 -> 1.00
NDCG@2: 0.00 -> 1.00

Don't confuse improved evidence ordering with a correct generated answer. The next lesson will evaluate faithfulness and citation agreement after context reaches the generator.

Candidate count sets the ceiling and the bill

If the reranker receives only the first two candidates, it can't select api-token-legacy-v2-rule: the rule was cut before pair scoring. Use a candidate-recall gate before comparing reranker models.

candidate-count-gate.py

def target_present(candidates: list[Candidate]) -> bool:
    return TARGET_ID in [candidate.chunk_id for candidate in candidates]

RERANK_CANDIDATE_BUDGET = 3
too_small = first_stage[:2]
release_input = first_stage[:RERANK_CANDIDATE_BUDGET]
limited_ids = [result.candidate.chunk_id for result in rerank(
    REQUEST, too_small, ConstraintAwarePairScorer()
)]
release_reranked = rerank(REQUEST, release_input, ConstraintAwarePairScorer())
release_reranked_ids = [result.candidate.chunk_id for result in release_reranked]
print("top-2 includes target:", target_present(too_small), limited_ids)
print("top-3 includes target:", target_present(release_input), release_reranked_ids)
assert not target_present(too_small)
assert release_reranked_ids[0] == TARGET_ID

Output

top-2 includes target: False ['api-password-reset-v1', 'api-token-troubleshooting-v1']
top-3 includes target: True ['api-token-legacy-v2-rule', 'api-password-reset-v1', 'api-token-troubleshooting-v1']

latency-budget-gate.py

SEQUENTIAL_PAIR_COST_MS = 7.5
REQUEST_OVERHEAD_MS = 4.0
FIXTURE_LATENCY_BUDGET_MS = 30.0

def estimated_fixture_latency_ms(candidate_count: int) -> float:
    return REQUEST_OVERHEAD_MS + SEQUENTIAL_PAIR_COST_MS * candidate_count

release_fixture_latency = estimated_fixture_latency_ms(len(release_input))
too_wide_fixture_latency = estimated_fixture_latency_ms(len(first_stage))
print(
    f"top-3 fixture latency: {release_fixture_latency:.1f} ms, "
    f"pass={release_fixture_latency <= FIXTURE_LATENCY_BUDGET_MS}"
)
print(
    f"top-4 fixture latency: {too_wide_fixture_latency:.1f} ms, "
    f"pass={too_wide_fixture_latency <= FIXTURE_LATENCY_BUDGET_MS}"
)
assert release_fixture_latency <= FIXTURE_LATENCY_BUDGET_MS
assert too_wide_fixture_latency > FIXTURE_LATENCY_BUDGET_MS

Output

top-3 fixture latency: 26.5 ms, pass=True
top-4 fixture latency: 34.0 ms, pass=False

Choose an interaction design deliberately

Three ranking designs differ in when the query and chunk can interact:

Design	Stored before request	Request-time work	Role in this pipeline
Bi-encoder retrieval	One embedding per chunk	Compare query vector to indexed chunk vectors	Build a broad candidate set
Cross-encoder reranking	No query-specific pair score	Jointly score each retrieved query-chunk pair	Precision stage here
ColBERT late interaction	Contextual token vectors per chunk	Match query token vectors against chunk token vectors	Alternative to evaluate under a tighter latency budget

Emit the trace that evaluation needs

reranking-release-trace.py

SCORER_VERSION = "fixture-cross-encoder-v1"
selected_context = [
    result.candidate
    for result in release_reranked
    if result.score >= MIN_CONTEXT_SCORE
][:CONTEXT_BUDGET]
release_trace = {
    "query_id": "api-token-legacy-access-001",
    "versions": {
        "retriever": "policy-retriever-v2",
        "index": "policy-index/2026-05-27",
        "sparse": "bm25-tokenizer-v1",
        "dense": "fixture-embeddings-v1",
        "fusion": "rrf-k60",
        "reranker": SCORER_VERSION,
    },
    "first_stage_ids": first_stage_ids,
    "rerank_input_ids": [candidate.chunk_id for candidate in release_input],
    "reranked_ids": release_reranked_ids,
    "candidate_records": [
        {
            "chunk_id": result.candidate.chunk_id,
            "document_id": result.candidate.document_id,
            "parent_id": result.candidate.parent_id,
            "version": result.candidate.version,
            "first_stage_rank": result.candidate.first_stage_rank,
            "rerank_score": result.score,
            "reasons": result.reasons,
        }
        for result in release_reranked
    ],
    "selected_context_ids": [candidate.chunk_id for candidate in selected_context],
    "selected_versions": [candidate.version for candidate in selected_context],
    "gates": {
        "permitted_and_current": all(
            candidate.permitted and candidate.current for candidate in release_input
        ),
        "target_in_context": TARGET_ID in [
            candidate.chunk_id for candidate in selected_context
        ],
        "ordering_lift": after_ndcg > before_ndcg,
        "latency_budget": release_fixture_latency <= FIXTURE_LATENCY_BUDGET_MS,
    },
}
stores_raw_policy_text = any(
    candidate.text in str(release_trace)
    for candidate in SOURCE_STORE
)
print("Versions:", release_trace["versions"])
print("Rerank input:", release_trace["rerank_input_ids"])
print("Selected context:", release_trace["selected_context_ids"])
print("Trace stores raw policy text:", stores_raw_policy_text)
print("Gates:", release_trace["gates"])
assert set(release_trace["reranked_ids"]) == set(release_trace["rerank_input_ids"])
assert not stores_raw_policy_text
assert all(release_trace["gates"].values())

Output

Versions: {'retriever': 'policy-retriever-v2', 'index': 'policy-index/2026-05-27', 'sparse': 'bm25-tokenizer-v1', 'dense': 'fixture-embeddings-v1', 'fusion': 'rrf-k60', 'reranker': 'fixture-cross-encoder-v1'}
Rerank input: ['api-token-troubleshooting-v1', 'api-password-reset-v1', 'api-token-legacy-v2-rule']
Selected context: ['api-token-legacy-v2-rule']
Trace stores raw policy text: False
Gates: {'permitted_and_current': True, 'target_in_context': True, 'ordering_lift': True, 'latency_budget': True}

Production checks

Before releasing a real reranker:

Gate	Evidence to log	Failure response
Candidate recall	Gold chunk ID in first-stage top `k`	Fix retrieval or raise measured candidate budget
Authorization and freshness	ACL decision, chunk version, effective date	Reject request context; never score blocked evidence
Ordering lift	Before/after MRR or NDCG on held-out traces	Retrain, replace, or remove reranker
Serving budget	Candidate count, token length, model version, p95 latency	Batch, cap input, or choose a tested alternative
Downstream grounding	Selected chunk IDs and generated citations	Evaluate in the next pipeline stage

Mastery check

You're ready to add a reranker to a production RAG pipeline when you can:

Preserve authorization and freshness filtering before any pair scoring.
Explain why first-stage recall and final evidence order need separate gates.
Compare bi-encoder retrieval, cross-encoder reranking, and ColBERT-style late interaction.
Measure ordering improvement with MRR or NDCG inside the context budget.
Choose candidate count using candidate recall and measured end-to-end latency.
Emit a trace suitable for downstream faithfulness and citation evaluation.

Evaluation rubric

Level	Evidence in submission
Foundational	Preserves authorization and freshness filtering before pair scoring.
Applied	Reranks only first-stage candidates and proves ordering lift with MRR or NDCG.
Strong	Chooses candidate count using candidate recall and measured end-to-end latency.
Production-ready	Emits versioned evidence traces for downstream faithfulness and citation evaluation.

Follow-up questions

Common pitfalls

Reranking is used to hide recall failure

Symptom: Teams swap reranker models, but a gold policy chunk never appears in final context.
Cause: First-stage retrieval did not include the chunk in its candidate set.
Fix: Gate on Recall@candidate_k first, then evaluate ordering only on candidate sets that contain the answer.

Candidate count grows without a serving measurement

Symptom: Ordering metrics improve while request latency exceeds the product budget.
Cause: Each additional cross-encoder pair consumes request-time inference.
Fix: Benchmark p95 latency by candidate count and input length, then choose the smallest candidate set that clears recall and ordering gates.

Evidence order improves but grounding remains untested

Symptom: NDCG rises, but users still receive unsupported answers.
Cause: Reranking evaluates selected context, not whether generation follows it.
Fix: Carry the reranking trace into answer faithfulness and citation checks in the next evaluation stage.

Next Step

Continue to RAG Evaluation for Reliable Answers

`policy-answerer-v3` now selects evidence precisely; the next lesson tests whether generation uses that evidence faithfully.

PreviousHybrid Search: Dense + Sparse

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.

Reimers, N., & Gurevych, I. · 2019 · EMNLP 2019

Passage Re-ranking with BERT.

Nogueira, R. & Cho, K. · 2019 · arXiv preprint

Introduction to Information Retrieval.

Manning, C. D., Raghavan, P., Schutze, H. · 2008 · Cambridge University Press

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.

Khattab, O., & Zaharia, M. · 2020 · SIGIR 2020

Reranking and Cross-Encoders for RAG

The boundary: retrieve candidates, then improve their order

Keep the evidence boundary executable

Why a cross-encoder helps

Measure ordering, not vibes

Candidate count sets the ceiling and the bill

Choose an interaction design deliberately

Emit the trace that evaluation needs

Production checks

Mastery check

Evaluation rubric

Follow-up questions

Common pitfalls

Reranking is used to hide recall failure

Candidate count grows without a serving measurement

Evidence order improves but grounding remains untested

Mastery Check

Reranking and Cross-Encoders for RAG

The boundary: retrieve candidates, then improve their order

Keep the evidence boundary executable

Why a cross-encoder helps

Measure ordering, not vibes

Candidate count sets the ceiling and the bill

Choose an interaction design deliberately

Emit the trace that evaluation needs

Production checks

Mastery check

Evaluation rubric

Follow-up questions

Common pitfalls

Reranking is used to hide recall failure

Candidate count grows without a serving measurement

Evidence order improves but grounding remains untested

Mastery Check

Reranking and Cross-Encoders for RAG

The boundary: retrieve candidates, then improve their order

Why does the reranker receive the permitted hybrid candidate list instead of every policy chunk?

Keep the evidence boundary executable

Why a cross-encoder helps

The target policy is retrieved at rank 3 but the prompt accepts two chunks. Is this a recall failure or an ordering failure?

Measure ordering, not vibes

Candidate count sets the ceiling and the bill

Can this sequential fixture prove deployed p95 latency for a batched cross-encoder endpoint?

Choose an interaction design deliberately

Why can a document embedding be indexed before a request, while a cross-encoder relevance score can't?

Emit the trace that evaluation needs

Production checks

Mastery check

Evaluation rubric

Follow-up questions

Your relevant rule is absent from top 20 hybrid candidates. Will a better cross-encoder fix the request?

The relevant rule is in candidates but falls below the prompt's two-chunk budget. Which metric targets that failure?

A reranker ranks an expired rule first with a high score. Is that a quality win?

Common pitfalls

Reranking is used to hide recall failure

Candidate count grows without a serving measurement

Evidence order improves but grounding remains untested

Mastery Check

Reranking and Cross-Encoders for RAG

The boundary: retrieve candidates, then improve their order

Why does the reranker receive the permitted hybrid candidate list instead of every policy chunk?

Keep the evidence boundary executable

Why a cross-encoder helps

The target policy is retrieved at rank 3 but the prompt accepts two chunks. Is this a recall failure or an ordering failure?

Measure ordering, not vibes

Candidate count sets the ceiling and the bill

Can this sequential fixture prove deployed p95 latency for a batched cross-encoder endpoint?

Choose an interaction design deliberately

Why can a document embedding be indexed before a request, while a cross-encoder relevance score can't?

Emit the trace that evaluation needs

Production checks

Mastery check

Evaluation rubric

Follow-up questions

Your relevant rule is absent from top 20 hybrid candidates. Will a better cross-encoder fix the request?

The relevant rule is in candidates but falls below the prompt's two-chunk budget. Which metric targets that failure?

A reranker ranks an expired rule first with a high score. Is that a quality win?

Common pitfalls

Reranking is used to hide recall failure

Candidate count grows without a serving measurement

Evidence order improves but grounding remains untested

Mastery Check