LearnProduction ML SystemsRanking and Recommendation Systems

📊MediumEvaluation & Benchmarks

Ranking and Recommendation Systems

Rank documents for a developer using candidate retrieval, relevance metrics, and feedback-loop safeguards.

14 min read

Learning path

Step 45 of 158 in the full curriculum

Gradient Boosted Trees in Production Forecasting and Anomaly Detection

An incident classifier chooses one action for one alert. A search ranker makes a different prediction: among thousands of documents, which few should appear first for a query or developer?

This is a ranking problem. Retrieval finds plausible candidates quickly; a ranking model orders them using query, document, freshness, and source-trust features. A recommendation feed uses the same shape without a typed query: developer and session context replace query features. In both cases, the displayed order becomes part of the future training data.

Ranking evidence dashboard showing eligibility, candidate recall, reranking, and click-position correction. — Eligibility removes two blocked documents, a candidate budget of `5` restores all three relevant documents, reranking reaches NDCG@5 of `1.000`, and propensity correction shows why raw click-through rate isn't a relevance label.

Separate eligibility, retrieval, and ranking

Suppose a developer searches for retry idempotency key. The corpus contains one million documents. A feature-heavy model shouldn't score every document on every request.

Stage	Job	Example guard
eligibility filter	remove forbidden or stale documents	current, permitted source
candidate generation	recover perhaps 200 plausible documents	text or embedding retrieval
ranking	order candidates precisely	relevance, freshness, source trust
business policy	enforce final constraints	sponsored labels, diversity, safety

A document absent from the candidate set can't be rescued by a perfect ranker. A forbidden document shouldn't enter the candidate set at all. Candidate recall, policy filtering, and ranking quality are separate obligations.

Build a seven-document fixture. The relevance value is an offline human judgment for one frozen query. It belongs in evaluation data, not in online model features.

build-doc-search-fixture.py

from hashlib import sha256
from json import dumps
from math import log2

query = "retry idempotency key"
corpus = [
    {"id": "api-guide", "title": "retry idempotency API guide", "current": True, "permitted": True, "retrieval": 0.96, "query_fit": 0.98, "reliability": 0.88, "relevance": 3},
    {"id": "blog-post", "title": "general release blog", "current": True, "permitted": True, "retrieval": 0.91, "query_fit": 0.35, "reliability": 0.97, "relevance": 0},
    {"id": "stale-changelog", "title": "legacy changelog", "current": True, "permitted": True, "retrieval": 0.90, "query_fit": 0.15, "reliability": 0.99, "relevance": 0},
    {"id": "retry-runbook", "title": "retry policy runbook", "current": True, "permitted": True, "retrieval": 0.84, "query_fit": 0.92, "reliability": 0.98, "relevance": 2},
    {"id": "sdk-example", "title": "idempotency SDK example", "current": True, "permitted": True, "retrieval": 0.80, "query_fit": 0.86, "reliability": 0.96, "relevance": 2},
    {"id": "private-draft", "title": "private retry incident draft", "current": True, "permitted": False, "retrieval": 0.99, "query_fit": 0.99, "reliability": 0.99, "relevance": 3},
    {"id": "stale-guide", "title": "stale retry migration guide", "current": False, "permitted": True, "retrieval": 0.95, "query_fit": 0.94, "reliability": 0.70, "relevance": 2},
]

print("corpus rows:", len(corpus))
print("query:", query)

Output

corpus rows: 7
query: retry idempotency key

The corpus intentionally includes a private draft and a stale guide. Remove both before retrieval. A learned score can't override access policy.

filter-eligible-listings.py

eligible = [
    item
    for item in corpus
    if item["current"] and item["permitted"]
]
blocked = sorted(item["id"] for item in corpus if item not in eligible)

print("eligible:", [item["id"] for item in eligible])
print("blocked:", blocked)

Output

eligible: ['api-guide', 'blog-post', 'stale-changelog', 'retry-runbook', 'sdk-example']
blocked: ['private-draft', 'stale-guide']

Measure candidate recall before ordering quality

Candidate generation runs a cheap retrieval stage. This lab uses a precomputed retrieval score so you can focus on the contract between stages. A real system might combine lexical retrieval, embeddings, popularity, and personalized retrieval.

The judged relevant eligible documents are api-guide, retry-runbook, and sdk-example. Measure how many survive candidate generation.

measure-candidate-recall.py

relevant_ids = {
    item["id"]
    for item in eligible
    if item["relevance"] > 0
}

def candidates_at(limit):
    return sorted(
        eligible,
        key=lambda item: (-item["retrieval"], item["id"]),
    )[:limit]

def candidate_recall(items):
    returned_ids = {item["id"] for item in items}
    return len(returned_ids & relevant_ids) / len(relevant_ids)

for limit in (3, 5):
    items = candidates_at(limit)
    print(
        f"k={limit} ids={[item['id'] for item in items]} "
        f"recall={candidate_recall(items):.3f}"
    )

Output

k=3 ids=['api-guide', 'blog-post', 'stale-changelog'] recall=0.333
k=5 ids=['api-guide', 'blog-post', 'stale-changelog', 'retry-runbook', 'sdk-example'] recall=1.000

At k=3, retrieval keeps only one of three relevant eligible documents. No downstream ranker can recover the missing runbook or SDK example. Raising this tiny lab's candidate budget to 5 restores candidate recall before the expensive scorer runs.

Pause and predict: If the ranker becomes perfect while candidate budget stays at 3, can the runbook appear? No. It never reaches the ranker.

Rerank the surviving candidates

Keep five candidates. The first-stage retrieval order is fast but imprecise: blog-post and stale-changelog sit near the top on retrieval alone, yet their query_fit stays low for the idempotency intent.

The scorer below is hand-written so each feature remains visible. query_fit represents a richer query-document match feature. reliability represents source trust. Neither uses the offline relevance label.

score-and-rerank.py

candidates = candidates_at(5)

def rank_score(item):
    return 2.0 * item["query_fit"] + 0.6 * item["reliability"]

ranked = sorted(
    candidates,
    key=lambda item: (-rank_score(item), item["id"]),
)

print("retrieval order:", [item["id"] for item in candidates])
print("ranked order:", [item["id"] for item in ranked])
print("rank scores:", [(item["id"], round(rank_score(item), 3)) for item in ranked])

Output

retrieval order: ['api-guide', 'blog-post', 'stale-changelog', 'retry-runbook', 'sdk-example']
ranked order: ['api-guide', 'retry-runbook', 'sdk-example', 'blog-post', 'stale-changelog']
rank scores: [('api-guide', 2.488), ('retry-runbook', 2.428), ('sdk-example', 2.296), ('blog-post', 1.282), ('stale-changelog', 0.894)]

A production scorer learns weights or nonlinear interactions from labeled examples. The request path remains the same: retrieve cheaply, score the bounded candidate set, then apply policy.

A metric that values the top slots

In a ranked list, placing the best result first matters more than moving it from position 40 to position 39. Discounted Cumulative Gain (DCG) gives graded relevance near the top more weight:

\text{DCG@k} = \sum_{i=1}^{k} \frac{2^{rel_i} - 1}{\log_2(i+1)}

Here $rel_i$ is the judged relevance at position $i$ . Position 1 has denominator $\log_2(2)=1$ , so it keeps full weight. Lower positions receive progressively smaller weight. Normalized DCG (NDCG) divides by the score of the ideal ordering, producing a value between zero and one for a query.^{[1]Reference 1Introduction to Information Retrieval.https://nlp.stanford.edu/IR-book/}

Calculate NDCG@5 for both orders. Build the ideal top five by sorting all candidate relevances before truncating; otherwise a relevant candidate below the evaluated slate could disappear from the denominator. The guard for best == 0 handles a query with no judged relevant documents instead of dividing by zero.

This lab computes reranker NDCG over documents that survived retrieval. That boundary is deliberate: candidate recall measures misses before ranking, while reranker NDCG measures ordering quality inside the bounded candidate set. A whole-system evaluation can also compute NDCG against the eligible judged corpus so retrieval misses reduce the final score.

compare-ndcg.py

def dcg(relevances):
    return sum(
        (2**relevance - 1) / log2(rank + 2)
        for rank, relevance in enumerate(relevances)
    )

def reranker_ndcg(items, limit):
    relevances = [item["relevance"] for item in items[:limit]]
    ideal = sorted(
        (item["relevance"] for item in items),
        reverse=True,
    )[:limit]
    best = dcg(ideal)
    return dcg(relevances) / best if best else 0.0

print("retrieval-order reranker ndcg@5:", round(reranker_ndcg(candidates, 5), 3))
print("ranked-order reranker ndcg@5:", round(reranker_ndcg(ranked, 5), 3))

Output

retrieval-order reranker ndcg@5: 0.91
ranked-order reranker ndcg@5: 1.0

The scorer lifts reranker NDCG@5 from 0.910 to 1.000. That doesn't prove the model will improve a live docs portal. It only proves this frozen judged query improved after reranking.

One query isn't an evaluation set. Average reranker NDCG over a frozen query set, report slices such as new documents and languages, and inspect failed queries. Keep candidate recall separate so a strong reranker score can't hide retrieval loss.

Turn judgments into pair preferences

Pairwise learning-to-rank methods such as RankNet train from preferences so a relevant item scores above a less relevant one.^{[2]Reference 2Learning to Rank using Gradient Descent.https://www.microsoft.com/en-us/research/publication/learning-to-rank-using-gradient-descent/} For this one query, create the preference pairs from offline judgments.

build-pair-preferences.py

pairs = [
    (preferred["id"], other["id"])
    for preferred in ranked
    for other in ranked
    if preferred["relevance"] > other["relevance"]
]

print("pair count:", len(pairs))
print("first pairs:", pairs[:5])

Output

pair count: 8
first pairs: [('api-guide', 'retry-runbook'), ('api-guide', 'sdk-example'), ('api-guide', 'blog-post'), ('api-guide', 'stale-changelog'), ('retry-runbook', 'blog-post')]

The pair ("api-guide", "blog-post") says the API guide should score higher for this query. A learned pairwise model uses features to reduce preference mistakes across many queries. The hand-written scorer stays useful as a transparent baseline.

Diagram showing Eligible corpus freshness + policy, Candidates fast retrieval, Ranker query + document features, and Top results logged exposure. — Eligible corpus freshness + policy, Candidates fast retrieval, Ranker query + document features, and Top results logged exposure.

Log exposure before reading clicks

The ranked slate is now ready to display. Persist one immutable impression row per exposed document before reading clicks. Later click and resolved events join back through request and document IDs. Without the slate, position, eligibility-policy version, candidate-set version, and ranker version, you can't reproduce what the developer saw.

write-impression-log.py

slate = ranked[:3]
impressions = [
    {
        "request_id": "req-1042",
        "query": query,
        "eligibility_policy": "docs-access-policy-v4",
        "candidate_set": "retrieval-v3",
        "ranker": "docs-ranker-v7",
        "document_id": item["id"],
        "position": position,
    }
    for position, item in enumerate(slate, start=1)
]
outcomes = [
    {
        "event_id": "evt-click-1042",
        "occurred_at": "2026-05-01T12:00:03Z",
        "request_id": "req-1042",
        "document_id": "api-guide",
        "event": "click",
    },
    {
        "event_id": "evt-resolved-1042",
        "occurred_at": "2026-05-01T12:04:18Z",
        "request_id": "req-1042",
        "document_id": "api-guide",
        "event": "resolved",
    },
]

print("slate:", [item["id"] for item in slate])
print("logged positions:", [
    (row["document_id"], row["position"])
    for row in impressions
])
print("later outcomes:", [
    (row["document_id"], row["event"])
    for row in outcomes
])

Output

slate: ['api-guide', 'retry-runbook', 'sdk-example']
logged positions: [('api-guide', 1), ('retry-runbook', 2), ('sdk-example', 3)]
later outcomes: [('api-guide', 'click'), ('api-guide', 'resolved')]

The first row doesn't mean api-guide is universally best. It means this request exposed it at position 1, and one developer clicked it.

Log enough context to interpret later outcomes:

Logged field	Reason
request and query ID	group displayed slate and join later outcomes
eligibility-policy version	know which documents were allowed
candidate set version	know what could have ranked
final positions	measure exposure
ranker version	attribute outcomes
timestamped click, save, resolved, bad-result events	distinguish curiosity from value
freshness and permission snapshot	reproduce context

Clicks are biased labels

When a model places a document first, it receives more attention and therefore more clicks. Training directly on clicks rewards past ranking position as if it were document relevance. This is a feedback loop.

Take two documents with equal underlying appeal. Historical ranker always places same-quality-a first and same-quality-b second. Slot one is examined more often.

reproduce-position-bias.py

historical = [
    {"document": "same-quality-a", "position": 1, "impressions": 100, "clicks": 40},
    {"document": "same-quality-b", "position": 2, "impressions": 100, "clicks": 20},
]

for row in historical:
    raw_ctr = row["clicks"] / row["impressions"]
    print(
        f"{row['document']}: position={row['position']} "
        f"raw_ctr={raw_ctr:.2f}"
    )

Output

same-quality-a: position=1 raw_ctr=0.40
same-quality-b: position=2 raw_ctr=0.20

Naive training labels say document a is twice as attractive. The log doesn't justify that conclusion because document and position never vary independently.

A simple counterfactual correction weights each click by the inverse of the probability that its position was examined. The propensities below are deliberately fixed for the miniature example: slot one is examined with probability 1.0, slot two with probability 0.5.

correct-toy-position-bias.py

examination_propensity = {1: 1.0, 2: 0.5}

for row in historical:
    corrected_signal = (
        row["clicks"]
        / examination_propensity[row["position"]]
        / row["impressions"]
    )
    print(f"{row['document']}: corrected_signal={corrected_signal:.2f}")

Output

same-quality-a: corrected_signal=0.40
same-quality-b: corrected_signal=0.40

Both corrected signals become 0.40. The arithmetic exposes the idea, not a production estimator. Real propensities need careful estimation, often through randomized interventions or a validated click model. Very small propensities also create large, noisy weights, so a production estimator needs variance controls and evaluation. Joachims, Swaminathan, and Schnabel describe propensity-weighted learning-to-rank as a counterfactual correction for biased implicit feedback.

Protect users and the dataset

Ranking can narrow what developers see. Add constraints for stale documents, private drafts, source policy, and overconcentration. Monitor slices such as new documents, small doc sets, languages, and stale categories rather than accepting one portal-wide NDCG.

Policy must stay outside the learned score. Reproduce a bad integration that ranks the raw corpus without applying eligibility first.

reproduce-policy-bypass.py

unsafe_order = sorted(
    corpus,
    key=lambda item: (-rank_score(item), item["id"]),
)

print("unsafe first result:", unsafe_order[0]["id"])
print("permitted:", unsafe_order[0]["permitted"])

Output

unsafe first result: private-draft
permitted: False

The private draft ranks first because its query match is strong. Better model training won't fix this bug. Eligibility filtering must run before retrieval and remain covered by release checks.

Gate an offline candidate

A useful release packet includes offline query judgments, candidate recall, reranker NDCG, p95 scoring latency, diversity checks, impression and outcome schemas, and an A/B stopping rule. A list that looks good offline isn't allowed to silently write its own future training labels.

Use concrete gates for this candidate:

candidate recall at 5 must be 1.0
reranker NDCG@5 must be at least 0.95
p95 scorer latency must be at most 20 milliseconds
blocked documents must stay absent
impression rows must include fields needed for replay
outcome events must join a logged impression

check-offline-ranking-gates.py

latencies_ms = [11, 13, 12, 15, 14, 16, 13, 12, 17, 18]

def percentile_nearest_rank(values, percentile):
    rank = max(1, int(len(values) * percentile + 0.999999))
    return sorted(values)[rank - 1]

required_impression_fields = {
    "request_id",
    "query",
    "eligibility_policy",
    "candidate_set",
    "ranker",
    "document_id",
    "position",
}
required_outcome_fields = {
    "event_id",
    "occurred_at",
    "request_id",
    "document_id",
    "event",
}
exposed_keys = {
    (row["request_id"], row["document_id"])
    for row in impressions
}
release_checks = {
    "candidate_recall_at_5": candidate_recall(candidates) >= 1.0,
    "reranker_ndcg_at_5": reranker_ndcg(ranked, 5) >= 0.95,
    "p95_latency_ms": percentile_nearest_rank(latencies_ms, 0.95) <= 20,
    "blocked_items_absent": not (
        {"private-draft", "stale-guide"}
        & {item["id"] for item in ranked}
    ),
    "impression_schema": all(
        required_impression_fields <= row.keys()
        for row in impressions
    ),
    "outcome_schema": all(
        required_outcome_fields <= row.keys()
        for row in outcomes
    ),
    "outcomes_join_impressions": all(
        (row["request_id"], row["document_id"]) in exposed_keys
        for row in outcomes
    ),
}

for name, passed in release_checks.items():
    print(f"{name}: {passed}")
print("release gate:", all(release_checks.values()))

Output

candidate_recall_at_5: True
reranker_ndcg_at_5: True
p95_latency_ms: True
blocked_items_absent: True
impression_schema: True
outcome_schema: True
outcomes_join_impressions: True
release gate: True

Passing offline checks earns a controlled online experiment, not an immediate global rollout. Engagement changes can reflect latency, layout, freshness, source trust, and ranking behavior. Randomized A/B assignment isolates the candidate change more reliably than comparing two time windows.^{[3]Reference 3Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing.https://experimentguide.com/} Predeclare minimum sample size, minimum duration, and safety stops before exposure starts so the team doesn't stop when a noisy result happens to look favorable. The miniature receipt below uses illustrative commitments; real values come from traffic, power analysis, and risk policy.

Publish a receipt before exposure begins.

publish-ranking-candidate-receipt.py

receipt = {
    "artifact": "docs-ranker-v7",
    "eligibility_policy": "docs-access-policy-v4",
    "candidate_set": "retrieval-v3",
    "offline": {
        "candidate_recall_at_5": round(candidate_recall(candidates), 3),
        "reranker_ndcg_at_5": round(reranker_ndcg(ranked, 5), 3),
        "p95_latency_ms": percentile_nearest_rank(latencies_ms, 0.95),
    },
    "release_checks": release_checks,
    "required_checks_pass": all(release_checks.values()),
    "experiment": {
        "assignment_unit": "developer_id",
        "control": "docs-ranker-v6",
        "treatment": "docs-ranker-v7",
        "primary_metric": "successful_resolutions_per_session",
        "minimum_duration_days": 14,
        "minimum_requests_per_arm": 5000,
        "guardrails": [
            "bad_result_reports_per_session",
            "p95_latency_ms",
            "blocked_document_impressions",
        ],
        "safety_stops": {
            "blocked_document_impressions": 0,
            "p95_latency_ms": 20,
        },
    },
    "status": "candidate_for_ab_test" if all(release_checks.values()) else "blocked",
}
payload = dumps(receipt, sort_keys=True)
print(dumps(receipt, indent=2, sort_keys=True))
print("receipt sha256:", sha256(payload.encode()).hexdigest()[:12])

Output

{
  "artifact": "docs-ranker-v7",
  "candidate_set": "retrieval-v3",
  "eligibility_policy": "docs-access-policy-v4",
  "experiment": {
    "assignment_unit": "developer_id",
    "control": "docs-ranker-v6",
    "guardrails": [
      "bad_result_reports_per_session",
      "p95_latency_ms",
      "blocked_document_impressions"
    ],
    "minimum_duration_days": 14,
    "minimum_requests_per_arm": 5000,
    "primary_metric": "successful_resolutions_per_session",
    "safety_stops": {
      "blocked_document_impressions": 0,
      "p95_latency_ms": 20
    },
    "treatment": "docs-ranker-v7"
  },
  "offline": {
    "candidate_recall_at_5": 1.0,
    "p95_latency_ms": 18,
    "reranker_ndcg_at_5": 1.0
  },
  "release_checks": {
    "blocked_items_absent": true,
    "candidate_recall_at_5": true,
    "impression_schema": true,
    "outcome_schema": true,
    "outcomes_join_impressions": true,
    "p95_latency_ms": true,
    "reranker_ndcg_at_5": true
  },
  "required_checks_pass": true,
  "status": "candidate_for_ab_test"
}
receipt sha256: d0fe7347e924

The receipt binds eligibility policy, candidate generator, ranker, offline evidence, release checks, assignment unit, primary metric, minimum exposure, and guardrails. Your later analysis must keep these versions pinned.

Explain the slate without looking back

Practice

Reduce candidate budget from 5 to 4. Predict which relevant document disappears, then rerun candidate recall and reranker NDCG. Explain why both metrics matter.
Change blog-post relevance from 0 to 1. Recompute reranker NDCG and explain how graded judgments change evaluation.
Add a timestamped save outcome event. Decide whether saves should replace resolved sessions as the primary online metric or remain a diagnostic signal.
Change slot-two propensity from 0.5 to 0.25. Recompute corrected signal and explain why an incorrect propensity model creates a different bias.
Add a diversity gate that allows at most two documents from one source in top three. Extend fixture with source_id, then reproduce one failure.

What strong answers show

Evidence	What a strong answer shows
candidate contract	separates eligibility and retrieval from learned reranking
offline evaluation	measures candidate recall, graded reranker NDCG, slices, and scorer latency
feedback safety	persists immutable impressions before joining later outcome events
debiasing judgment	treats toy propensity weighting as a concept demonstration, not an automatic production fix
experiment plan	pins versions, assignment unit, primary metric, minimum exposure, guardrails, and safety stops

When ranking breaks

Symptom	Cause	Fix
NDCG looks strong but desired document never appears	candidates lost recall	measure retrieval recall separately
Prohibited document appears first	policy filter ran after scoring or not at all	filter eligibility before retrieval and gate blocked impressions
Popular documents dominate forever	click-position loop	log exposure and use controlled evaluation
Corrected click score still looks suspicious	propensity estimate is wrong	validate click model or collect randomized intervention data
Corrected click score becomes unstable	tiny propensities create large inverse weights	add variance controls and evaluate estimator behavior
Relevance improves while bad-result reports rise	wrong online objective	pair engagement with success and bad-result guardrails

Next Step

Continue to Forecasting and Anomaly Detection

Ranking orders choices at one moment. Next you'll predict values over future time periods and distinguish recurring demand patterns from operational surprises.

PreviousGradient Boosted Trees in Production

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Introduction to Information Retrieval.

Manning, C. D., Raghavan, P., Schutze, H. · 2008 · Cambridge University Press

Learning to Rank using Gradient Descent.

Burges, C. J. C., et al. · 2005 · ICML 2005

Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing.

Kohavi, R., Tang, D., Xu, Y. · 2020

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Back to Topics

LearnProduction ML SystemsRanking and Recommendation Systems

📊MediumEvaluation & Benchmarks

Ranking and Recommendation Systems

Rank documents for a developer using candidate retrieval, relevance metrics, and feedback-loop safeguards.

14 min read

Learning path

Step 45 of 158 in the full curriculum

Gradient Boosted Trees in Production Forecasting and Anomaly Detection

An incident classifier chooses one action for one alert. A search ranker makes a different prediction: among thousands of documents, which few should appear first for a query or developer?

Separate eligibility, retrieval, and ranking

Suppose a developer searches for retry idempotency key. The corpus contains one million documents. A feature-heavy model shouldn't score every document on every request.

Stage	Job	Example guard
eligibility filter	remove forbidden or stale documents	current, permitted source
candidate generation	recover perhaps 200 plausible documents	text or embedding retrieval
ranking	order candidates precisely	relevance, freshness, source trust
business policy	enforce final constraints	sponsored labels, diversity, safety

Build a seven-document fixture. The relevance value is an offline human judgment for one frozen query. It belongs in evaluation data, not in online model features.

build-doc-search-fixture.py

from hashlib import sha256
from json import dumps
from math import log2

query = "retry idempotency key"
corpus = [
    {"id": "api-guide", "title": "retry idempotency API guide", "current": True, "permitted": True, "retrieval": 0.96, "query_fit": 0.98, "reliability": 0.88, "relevance": 3},
    {"id": "blog-post", "title": "general release blog", "current": True, "permitted": True, "retrieval": 0.91, "query_fit": 0.35, "reliability": 0.97, "relevance": 0},
    {"id": "stale-changelog", "title": "legacy changelog", "current": True, "permitted": True, "retrieval": 0.90, "query_fit": 0.15, "reliability": 0.99, "relevance": 0},
    {"id": "retry-runbook", "title": "retry policy runbook", "current": True, "permitted": True, "retrieval": 0.84, "query_fit": 0.92, "reliability": 0.98, "relevance": 2},
    {"id": "sdk-example", "title": "idempotency SDK example", "current": True, "permitted": True, "retrieval": 0.80, "query_fit": 0.86, "reliability": 0.96, "relevance": 2},
    {"id": "private-draft", "title": "private retry incident draft", "current": True, "permitted": False, "retrieval": 0.99, "query_fit": 0.99, "reliability": 0.99, "relevance": 3},
    {"id": "stale-guide", "title": "stale retry migration guide", "current": False, "permitted": True, "retrieval": 0.95, "query_fit": 0.94, "reliability": 0.70, "relevance": 2},
]

print("corpus rows:", len(corpus))
print("query:", query)

Output

corpus rows: 7
query: retry idempotency key

The corpus intentionally includes a private draft and a stale guide. Remove both before retrieval. A learned score can't override access policy.

filter-eligible-listings.py

eligible = [
    item
    for item in corpus
    if item["current"] and item["permitted"]
]
blocked = sorted(item["id"] for item in corpus if item not in eligible)

print("eligible:", [item["id"] for item in eligible])
print("blocked:", blocked)

Output

eligible: ['api-guide', 'blog-post', 'stale-changelog', 'retry-runbook', 'sdk-example']
blocked: ['private-draft', 'stale-guide']

Measure candidate recall before ordering quality

The judged relevant eligible documents are api-guide, retry-runbook, and sdk-example. Measure how many survive candidate generation.

measure-candidate-recall.py

relevant_ids = {
    item["id"]
    for item in eligible
    if item["relevance"] > 0
}

def candidates_at(limit):
    return sorted(
        eligible,
        key=lambda item: (-item["retrieval"], item["id"]),
    )[:limit]

def candidate_recall(items):
    returned_ids = {item["id"] for item in items}
    return len(returned_ids & relevant_ids) / len(relevant_ids)

for limit in (3, 5):
    items = candidates_at(limit)
    print(
        f"k={limit} ids={[item['id'] for item in items]} "
        f"recall={candidate_recall(items):.3f}"
    )

Output

k=3 ids=['api-guide', 'blog-post', 'stale-changelog'] recall=0.333
k=5 ids=['api-guide', 'blog-post', 'stale-changelog', 'retry-runbook', 'sdk-example'] recall=1.000

Pause and predict: If the ranker becomes perfect while candidate budget stays at 3, can the runbook appear? No. It never reaches the ranker.

Rerank the surviving candidates

score-and-rerank.py

candidates = candidates_at(5)

def rank_score(item):
    return 2.0 * item["query_fit"] + 0.6 * item["reliability"]

ranked = sorted(
    candidates,
    key=lambda item: (-rank_score(item), item["id"]),
)

print("retrieval order:", [item["id"] for item in candidates])
print("ranked order:", [item["id"] for item in ranked])
print("rank scores:", [(item["id"], round(rank_score(item), 3)) for item in ranked])

Output

retrieval order: ['api-guide', 'blog-post', 'stale-changelog', 'retry-runbook', 'sdk-example']
ranked order: ['api-guide', 'retry-runbook', 'sdk-example', 'blog-post', 'stale-changelog']
rank scores: [('api-guide', 2.488), ('retry-runbook', 2.428), ('sdk-example', 2.296), ('blog-post', 1.282), ('stale-changelog', 0.894)]

A production scorer learns weights or nonlinear interactions from labeled examples. The request path remains the same: retrieve cheaply, score the bounded candidate set, then apply policy.

A metric that values the top slots

In a ranked list, placing the best result first matters more than moving it from position 40 to position 39. Discounted Cumulative Gain (DCG) gives graded relevance near the top more weight:

\text{DCG@k} = \sum_{i=1}^{k} \frac{2^{rel_i} - 1}{\log_2(i+1)}

compare-ndcg.py

def dcg(relevances):
    return sum(
        (2**relevance - 1) / log2(rank + 2)
        for rank, relevance in enumerate(relevances)
    )

def reranker_ndcg(items, limit):
    relevances = [item["relevance"] for item in items[:limit]]
    ideal = sorted(
        (item["relevance"] for item in items),
        reverse=True,
    )[:limit]
    best = dcg(ideal)
    return dcg(relevances) / best if best else 0.0

print("retrieval-order reranker ndcg@5:", round(reranker_ndcg(candidates, 5), 3))
print("ranked-order reranker ndcg@5:", round(reranker_ndcg(ranked, 5), 3))

Output

retrieval-order reranker ndcg@5: 0.91
ranked-order reranker ndcg@5: 1.0

The scorer lifts reranker NDCG@5 from 0.910 to 1.000. That doesn't prove the model will improve a live docs portal. It only proves this frozen judged query improved after reranking.

Turn judgments into pair preferences

build-pair-preferences.py

pairs = [
    (preferred["id"], other["id"])
    for preferred in ranked
    for other in ranked
    if preferred["relevance"] > other["relevance"]
]

print("pair count:", len(pairs))
print("first pairs:", pairs[:5])

Output

pair count: 8
first pairs: [('api-guide', 'retry-runbook'), ('api-guide', 'sdk-example'), ('api-guide', 'blog-post'), ('api-guide', 'stale-changelog'), ('retry-runbook', 'blog-post')]

Log exposure before reading clicks

write-impression-log.py

slate = ranked[:3]
impressions = [
    {
        "request_id": "req-1042",
        "query": query,
        "eligibility_policy": "docs-access-policy-v4",
        "candidate_set": "retrieval-v3",
        "ranker": "docs-ranker-v7",
        "document_id": item["id"],
        "position": position,
    }
    for position, item in enumerate(slate, start=1)
]
outcomes = [
    {
        "event_id": "evt-click-1042",
        "occurred_at": "2026-05-01T12:00:03Z",
        "request_id": "req-1042",
        "document_id": "api-guide",
        "event": "click",
    },
    {
        "event_id": "evt-resolved-1042",
        "occurred_at": "2026-05-01T12:04:18Z",
        "request_id": "req-1042",
        "document_id": "api-guide",
        "event": "resolved",
    },
]

print("slate:", [item["id"] for item in slate])
print("logged positions:", [
    (row["document_id"], row["position"])
    for row in impressions
])
print("later outcomes:", [
    (row["document_id"], row["event"])
    for row in outcomes
])

Output

slate: ['api-guide', 'retry-runbook', 'sdk-example']
logged positions: [('api-guide', 1), ('retry-runbook', 2), ('sdk-example', 3)]
later outcomes: [('api-guide', 'click'), ('api-guide', 'resolved')]

The first row doesn't mean api-guide is universally best. It means this request exposed it at position 1, and one developer clicked it.

Log enough context to interpret later outcomes:

Logged field	Reason
request and query ID	group displayed slate and join later outcomes
eligibility-policy version	know which documents were allowed
candidate set version	know what could have ranked
final positions	measure exposure
ranker version	attribute outcomes
timestamped click, save, resolved, bad-result events	distinguish curiosity from value
freshness and permission snapshot	reproduce context

Clicks are biased labels

Take two documents with equal underlying appeal. Historical ranker always places same-quality-a first and same-quality-b second. Slot one is examined more often.

reproduce-position-bias.py

historical = [
    {"document": "same-quality-a", "position": 1, "impressions": 100, "clicks": 40},
    {"document": "same-quality-b", "position": 2, "impressions": 100, "clicks": 20},
]

for row in historical:
    raw_ctr = row["clicks"] / row["impressions"]
    print(
        f"{row['document']}: position={row['position']} "
        f"raw_ctr={raw_ctr:.2f}"
    )

Output

same-quality-a: position=1 raw_ctr=0.40
same-quality-b: position=2 raw_ctr=0.20

Naive training labels say document a is twice as attractive. The log doesn't justify that conclusion because document and position never vary independently.

correct-toy-position-bias.py

examination_propensity = {1: 1.0, 2: 0.5}

for row in historical:
    corrected_signal = (
        row["clicks"]
        / examination_propensity[row["position"]]
        / row["impressions"]
    )
    print(f"{row['document']}: corrected_signal={corrected_signal:.2f}")

Output

same-quality-a: corrected_signal=0.40
same-quality-b: corrected_signal=0.40

Protect users and the dataset

Policy must stay outside the learned score. Reproduce a bad integration that ranks the raw corpus without applying eligibility first.

reproduce-policy-bypass.py

unsafe_order = sorted(
    corpus,
    key=lambda item: (-rank_score(item), item["id"]),
)

print("unsafe first result:", unsafe_order[0]["id"])
print("permitted:", unsafe_order[0]["permitted"])

Output

unsafe first result: private-draft
permitted: False

The private draft ranks first because its query match is strong. Better model training won't fix this bug. Eligibility filtering must run before retrieval and remain covered by release checks.

Gate an offline candidate

Use concrete gates for this candidate:

candidate recall at 5 must be 1.0
reranker NDCG@5 must be at least 0.95
p95 scorer latency must be at most 20 milliseconds
blocked documents must stay absent
impression rows must include fields needed for replay
outcome events must join a logged impression

check-offline-ranking-gates.py

latencies_ms = [11, 13, 12, 15, 14, 16, 13, 12, 17, 18]

def percentile_nearest_rank(values, percentile):
    rank = max(1, int(len(values) * percentile + 0.999999))
    return sorted(values)[rank - 1]

required_impression_fields = {
    "request_id",
    "query",
    "eligibility_policy",
    "candidate_set",
    "ranker",
    "document_id",
    "position",
}
required_outcome_fields = {
    "event_id",
    "occurred_at",
    "request_id",
    "document_id",
    "event",
}
exposed_keys = {
    (row["request_id"], row["document_id"])
    for row in impressions
}
release_checks = {
    "candidate_recall_at_5": candidate_recall(candidates) >= 1.0,
    "reranker_ndcg_at_5": reranker_ndcg(ranked, 5) >= 0.95,
    "p95_latency_ms": percentile_nearest_rank(latencies_ms, 0.95) <= 20,
    "blocked_items_absent": not (
        {"private-draft", "stale-guide"}
        & {item["id"] for item in ranked}
    ),
    "impression_schema": all(
        required_impression_fields <= row.keys()
        for row in impressions
    ),
    "outcome_schema": all(
        required_outcome_fields <= row.keys()
        for row in outcomes
    ),
    "outcomes_join_impressions": all(
        (row["request_id"], row["document_id"]) in exposed_keys
        for row in outcomes
    ),
}

for name, passed in release_checks.items():
    print(f"{name}: {passed}")
print("release gate:", all(release_checks.values()))

Output

candidate_recall_at_5: True
reranker_ndcg_at_5: True
p95_latency_ms: True
blocked_items_absent: True
impression_schema: True
outcome_schema: True
outcomes_join_impressions: True
release gate: True

Publish a receipt before exposure begins.

publish-ranking-candidate-receipt.py

receipt = {
    "artifact": "docs-ranker-v7",
    "eligibility_policy": "docs-access-policy-v4",
    "candidate_set": "retrieval-v3",
    "offline": {
        "candidate_recall_at_5": round(candidate_recall(candidates), 3),
        "reranker_ndcg_at_5": round(reranker_ndcg(ranked, 5), 3),
        "p95_latency_ms": percentile_nearest_rank(latencies_ms, 0.95),
    },
    "release_checks": release_checks,
    "required_checks_pass": all(release_checks.values()),
    "experiment": {
        "assignment_unit": "developer_id",
        "control": "docs-ranker-v6",
        "treatment": "docs-ranker-v7",
        "primary_metric": "successful_resolutions_per_session",
        "minimum_duration_days": 14,
        "minimum_requests_per_arm": 5000,
        "guardrails": [
            "bad_result_reports_per_session",
            "p95_latency_ms",
            "blocked_document_impressions",
        ],
        "safety_stops": {
            "blocked_document_impressions": 0,
            "p95_latency_ms": 20,
        },
    },
    "status": "candidate_for_ab_test" if all(release_checks.values()) else "blocked",
}
payload = dumps(receipt, sort_keys=True)
print(dumps(receipt, indent=2, sort_keys=True))
print("receipt sha256:", sha256(payload.encode()).hexdigest()[:12])

Output

{
  "artifact": "docs-ranker-v7",
  "candidate_set": "retrieval-v3",
  "eligibility_policy": "docs-access-policy-v4",
  "experiment": {
    "assignment_unit": "developer_id",
    "control": "docs-ranker-v6",
    "guardrails": [
      "bad_result_reports_per_session",
      "p95_latency_ms",
      "blocked_document_impressions"
    ],
    "minimum_duration_days": 14,
    "minimum_requests_per_arm": 5000,
    "primary_metric": "successful_resolutions_per_session",
    "safety_stops": {
      "blocked_document_impressions": 0,
      "p95_latency_ms": 20
    },
    "treatment": "docs-ranker-v7"
  },
  "offline": {
    "candidate_recall_at_5": 1.0,
    "p95_latency_ms": 18,
    "reranker_ndcg_at_5": 1.0
  },
  "release_checks": {
    "blocked_items_absent": true,
    "candidate_recall_at_5": true,
    "impression_schema": true,
    "outcome_schema": true,
    "outcomes_join_impressions": true,
    "p95_latency_ms": true,
    "reranker_ndcg_at_5": true
  },
  "required_checks_pass": true,
  "status": "candidate_for_ab_test"
}
receipt sha256: d0fe7347e924

Explain the slate without looking back

Practice

Reduce candidate budget from 5 to 4. Predict which relevant document disappears, then rerun candidate recall and reranker NDCG. Explain why both metrics matter.
Change blog-post relevance from 0 to 1. Recompute reranker NDCG and explain how graded judgments change evaluation.
Add a timestamped save outcome event. Decide whether saves should replace resolved sessions as the primary online metric or remain a diagnostic signal.
Change slot-two propensity from 0.5 to 0.25. Recompute corrected signal and explain why an incorrect propensity model creates a different bias.
Add a diversity gate that allows at most two documents from one source in top three. Extend fixture with source_id, then reproduce one failure.

What strong answers show

Evidence	What a strong answer shows
candidate contract	separates eligibility and retrieval from learned reranking
offline evaluation	measures candidate recall, graded reranker NDCG, slices, and scorer latency
feedback safety	persists immutable impressions before joining later outcome events
debiasing judgment	treats toy propensity weighting as a concept demonstration, not an automatic production fix
experiment plan	pins versions, assignment unit, primary metric, minimum exposure, guardrails, and safety stops

When ranking breaks

Symptom	Cause	Fix
NDCG looks strong but desired document never appears	candidates lost recall	measure retrieval recall separately
Prohibited document appears first	policy filter ran after scoring or not at all	filter eligibility before retrieval and gate blocked impressions
Popular documents dominate forever	click-position loop	log exposure and use controlled evaluation
Corrected click score still looks suspicious	propensity estimate is wrong	validate click model or collect randomized intervention data
Corrected click score becomes unstable	tiny propensities create large inverse weights	add variance controls and evaluate estimator behavior
Relevance improves while bad-result reports rise	wrong online objective	pair engagement with success and bad-result guardrails

Next Step

Continue to Forecasting and Anomaly Detection

Ranking orders choices at one moment. Next you'll predict values over future time periods and distinguish recurring demand patterns from operational surprises.

PreviousGradient Boosted Trees in Production

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Introduction to Information Retrieval.

Manning, C. D., Raghavan, P., Schutze, H. · 2008 · Cambridge University Press

Learning to Rank using Gradient Descent.

Burges, C. J. C., et al. · 2005 · ICML 2005

Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing.

Kohavi, R., Tang, D., Xu, Y. · 2020

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Ranking and Recommendation Systems

Separate eligibility, retrieval, and ranking

Measure candidate recall before ordering quality

Rerank the surviving candidates

A metric that values the top slots

Turn judgments into pair preferences

Log exposure before reading clicks

Clicks are biased labels

Protect users and the dataset

Gate an offline candidate

Explain the slate without looking back

Practice

What strong answers show

When ranking breaks

Mastery Check

Discussion

Ranking and Recommendation Systems

Separate eligibility, retrieval, and ranking

Measure candidate recall before ordering quality

Rerank the surviving candidates

A metric that values the top slots

Turn judgments into pair preferences

Log exposure before reading clicks

Clicks are biased labels

Protect users and the dataset

Gate an offline candidate

Explain the slate without looking back

Practice

What strong answers show

When ranking breaks

Mastery Check

Discussion

Ranking and Recommendation Systems

Separate eligibility, retrieval, and ranking

Measure candidate recall before ordering quality

Rerank the surviving candidates

A metric that values the top slots

Turn judgments into pair preferences

Log exposure before reading clicks

Clicks are biased labels

Protect users and the dataset

Gate an offline candidate

Explain the slate without looking back

Why measure candidate recall before ranking NDCG?

Why aren't clicks direct proof of relevance?

Why can't the ranker remove prohibited documents by assigning low scores?

What should an online ranking experiment preserve?

Practice

What strong answers show

When ranking breaks

Mastery Check

Discussion

Ranking and Recommendation Systems

Separate eligibility, retrieval, and ranking

Measure candidate recall before ordering quality

Rerank the surviving candidates

A metric that values the top slots

Turn judgments into pair preferences

Log exposure before reading clicks

Clicks are biased labels

Protect users and the dataset

Gate an offline candidate

Explain the slate without looking back

Why measure candidate recall before ranking NDCG?

Why aren't clicks direct proof of relevance?

Why can't the ranker remove prohibited documents by assigning low scores?

What should an online ranking experiment preserve?

Practice

What strong answers show

When ranking breaks

Mastery Check

Discussion