LearnPortfolio CapstonesCapstone: Product Ranking

📊HardEvaluation & Benchmarks

Capstone: Product Ranking

Ship a marketplace ranking candidate with eligible retrieval, separate recall and NDCG gates, replayable exposure rows, and an A/B-ready rollback receipt.

15 min read

Learning path

Step 80 of 158 in the full curriculum

Capstone: Delivery ETA Prediction Capstone: Demand Forecasting

The ETA project produced one guarded prediction for one parcel. This project makes a decision that shapes its own future data: which products shoppers see near the top of search results.

Your task is to ship a ranking candidate for searches such as insulated delivery bag. It may reorder eligible, in-stock listings. It may not surface blocked listings or call clicks an unbiased truth signal.

Product-ranking release evidence for the label-printer query. Raw high-scoring listings P8 and P11 are rejected before retrieval because P8 is policy-blocked and out of stock while P11 is unavailable in the shopper region. Eligible products P5, P6, and P4 enter ranking. Control orders them P4, P5, P6 with NDCG 0.736; treatment orders P5, P6, P4 with NDCG 1.000. An immutable P5 position-one impression precedes click and purchase outcomes. The release receipt shows recall 1.000, ranker p95 latency 18 milliseconds, 12 of 12 gates, and candidate-for-A-B-test status. — Eligibility removes the two strongest raw matches before ranking. Treatment then improves the displayed order, while the immutable P5 impression preserves the exposure that makes later click and purchase events interpretable.

Specify the Ranking Surface

Define the scope tightly:

Contract	Decision
surface	marketplace search results
query set	frozen judged search queries plus online experiment traffic
eligibility	in-stock, deliverable region, policy-approved listing
offline metric	candidate recall@100 and NDCG@10
online primary metric	purchase conversion per search
guardrails	returns, latency, unsafe listings, seller concentration

Candidate generation and reranking are separate artifacts. The candidate component must retrieve relevant items reliably. The ranker can only adjust order inside that eligible set.

Learning-to-rank methods can learn pairwise preferences so a relevant product receives a higher score than a less relevant candidate.^{[1]Reference 1Learning to Rank using Gradient Descent.https://www.microsoft.com/en-us/research/publication/learning-to-rank-using-gradient-descent/} That modeling detail doesn't excuse missing policy: eligibility filtering must run before scoring, and its version belongs in every impression log.

Diagram showing Catalog snapshot inventory + policy, Eligible candidates recall@100, Ranker v1 NDCG@10, and Displayed slate + impression log position + assignment. — Catalog snapshot inventory + policy, Eligible candidates recall@100, Ranker v1 NDCG@10, and Displayed slate + impression log position + assignment.

Build offline evidence first

Create a judged fixture set. Each query has eligible candidates and graded relevance from human review:

text

ranking-product/
  data/
    catalog_snapshot.jsonl
    judged_queries.jsonl
    eligibility_policy.json
  retrieval/
    candidate_generator.py
  ranking/
    ranker.py
    evaluate_ndcg.py
  experiments/
    impression_schema.json
    ab_plan.md
  tests/
    test_blocked_listing_never_surfaces.py
    test_ndcg_regression_gate.py

Required offline checks:

Check	Why it blocks release
eligible-only results	a relevant prohibited listing still can't display
candidate recall@100	the ranker can't repair missing products
NDCG@10 by query category	overall improvement may hide poor critical categories
scoring latency	a slower list damages shopping experience
diversity/seller concentration	one seller with many feedback events shouldn't crowd out catalog

Filter policy before retrieval

Start with the boundary that a model can't learn safely: listing eligibility. A prohibited or sold-out product may have an excellent text match and a high learned score. It still can't reach the ranker.

The compact fixture below keeps two judged searches in memory. Production might retrieve 100 candidates per query; this local receipt uses 3 so each listing stays visible.

01-marketplace-eligibility-contract.py

from collections import Counter
from dataclasses import dataclass
from datetime import datetime
from math import ceil, log2
import json

@dataclass(frozen=True)
class Listing:
    product_id: str
    query: str
    relevance: int
    retrieval_score: float
    baseline_rank_score: float
    candidate_rank_score: float
    in_stock: bool
    policy_approved: bool
    deliverable_region: bool
    seller_id: str

CATALOG = [
    Listing("P1", "insulated-bag", 3, 0.96, 0.82, 0.97, True, True, True, "S1"),
    Listing("P2", "insulated-bag", 1, 0.88, 0.75, 0.62, True, True, True, "S2"),
    Listing("P3", "insulated-bag", 2, 0.84, 0.65, 0.90, True, True, True, "S3"),
    Listing("P9", "insulated-bag", 3, 0.99, 0.99, 0.99, True, False, True, "S9"),
    Listing("P10", "insulated-bag", 2, 0.95, 0.98, 0.96, False, True, True, "S1"),
    Listing("P4", "label-printer", 1, 0.74, 0.90, 0.45, True, True, True, "S4"),
    Listing("P5", "label-printer", 3, 0.93, 0.72, 0.96, True, True, True, "S5"),
    Listing("P6", "label-printer", 2, 0.86, 0.65, 0.84, True, True, True, "S6"),
    Listing("P8", "label-printer", 3, 0.98, 0.99, 0.99, False, False, True, "S8"),
    Listing("P11", "label-printer", 2, 0.97, 0.97, 0.97, True, True, False, "S11"),
]

QUERIES = ("insulated-bag", "label-printer")

def exclusion_reason(listing: Listing) -> str | None:
    if not listing.policy_approved:
        return "policy_block"
    if not listing.in_stock:
        return "out_of_stock"
    if not listing.deliverable_region:
        return "unavailable_region"
    return None

print("catalog listings:", len(CATALOG))
print("queries:", QUERIES)

Output

catalog listings: 10
queries: ('insulated-bag', 'label-printer')

02-filter-and-retrieve-candidates.py

def retrieve(query: str, budget: int = 3) -> list[Listing]:
    eligible = [
        listing for listing in CATALOG
        if listing.query == query and exclusion_reason(listing) is None
    ]
    return sorted(eligible, key=lambda listing: (-listing.retrieval_score, listing.product_id))[:budget]

def relevant_ids(query: str) -> set[str]:
    return {
        listing.product_id for listing in CATALOG
        if listing.query == query and exclusion_reason(listing) is None and listing.relevance >= 2
    }

def candidate_recall(query: str, candidates: list[Listing]) -> float:
    relevant = relevant_ids(query)
    if not relevant:
        raise ValueError(f"{query}: no relevant eligible judgments")
    return len(relevant & {listing.product_id for listing in candidates}) / len(relevant)

retrieved = {query: retrieve(query) for query in QUERIES}
rejected = {
    listing.product_id: exclusion_reason(listing)
    for listing in CATALOG
    if exclusion_reason(listing) is not None
}

for query, candidates in retrieved.items():
    print(query, [listing.product_id for listing in candidates], f"recall@3={candidate_recall(query, candidates):.3f}")
print("rejected:", rejected)

Output

insulated-bag ['P1', 'P2', 'P3'] recall@3=1.000
label-printer ['P5', 'P6', 'P4'] recall@3=1.000
rejected: {'P9': 'policy_block', 'P10': 'out_of_stock', 'P8': 'policy_block', 'P11': 'unavailable_region'}

Products P9 and P8 have the strongest scores for their searches, but policy excludes them. Product P10 is approved but out of stock. Product P11 can't ship to the shopper's region. Filtering first prevents the ranker from treating a business invariant as a preference it can trade away.

Measure Retrieval and Ordering Separately

Candidate recall asks whether relevant eligible products reached the ranker. Normalized discounted cumulative gain (NDCG) then asks whether stronger judgments appear near the top of the returned list. A ranker can't repair a relevant product that retrieval dropped.

The next cell reranks each eligible candidate set and compares baseline and candidate NDCG. It also rechecks the invariant after scoring.

03-ndcg-ranking-helpers.py

def dcg(rows: list[Listing]) -> float:
    return sum((2**listing.relevance - 1) / log2(rank + 2) for rank, listing in enumerate(rows))

def ndcg(rows: list[Listing]) -> float:
    ideal = sorted(rows, key=lambda listing: (-listing.relevance, listing.product_id))
    ideal_dcg = dcg(ideal)
    return dcg(rows) / ideal_dcg if ideal_dcg else 0.0

def rerank(rows: list[Listing], score_field: str) -> list[Listing]:
    return sorted(rows, key=lambda listing: (-getattr(listing, score_field), listing.product_id))

baseline_ranked = {
    query: rerank(candidates, "baseline_rank_score")
    for query, candidates in retrieved.items()
}
candidate_ranked = {
    query: rerank(candidates, "candidate_rank_score")
    for query, candidates in retrieved.items()
}

print("queries ranked:", list(candidate_ranked))

04-compare-retrieval-and-ranking.py

for query in QUERIES:
    print(
        query,
        f"baseline_ndcg@3={ndcg(baseline_ranked[query]):.3f}",
        f"candidate_ndcg@3={ndcg(candidate_ranked[query]):.3f}",
    )

blocked_hits = [
    listing.product_id
    for rows in candidate_ranked.values()
    for listing in rows
    if exclusion_reason(listing) is not None
]
print("blocked hits:", blocked_hits)

Output

insulated-bag baseline_ndcg@3=0.972 candidate_ndcg@3=1.000
label-printer baseline_ndcg@3=0.736 candidate_ndcg@3=1.000
blocked hits: []

The candidate improves ordering for both query fixtures. The metric helpers also make two edge policies explicit: a query with no relevant eligible judgments is a fixture error for this capstone, while NDCG returns 0.0 when a ranked set has no positive gain. This still isn't a launch decision. Offline judgments support a controlled experiment, and the displayed slate creates the future data used to evaluate or retrain the system.

Log exposure before learning from outcomes

Every displayed slate should log:

Field	Why
`request_id`, `query`, `served_at`	group one displayed decision
`shopper_id`, `experiment_id`, `experiment_arm`	replay stable experiment assignment
`catalog_snapshot`, `eligibility_version`	show available choices
`candidate_version`, `ranker_version`	identify scoring path
`product_id`, `position`	preserve exposure

Clicks alone aren't reliable targets: top-ranked items receive attention because they are top-ranked. Joachims, Swaminathan, and Schnabel show why position bias makes direct click training suboptimal and derive a counterfactual correction for biased implicit feedback.^{[2]Reference 2Unbiased Learning-to-Rank with Biased Feedbackhttps://arxiv.org/abs/1608.04468} Use judged offline sets for regression detection, then use a controlled A/B experiment to compare user outcomes. Kohavi et al. describe online experimentation as a disciplined way to measure product changes.^{[3]Reference 3Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing.https://experimentguide.com/}

Clicks, purchases, and returns arrive later. Append them as outcome events with event_id, occurred_at, request_id, product_id, and outcome. Don't overwrite the served impression row. An A/B plan should state allocation, duration or sample-size rule, primary metric, latency and return-rate guardrails, prohibited-listing invariant, and stop conditions. A conversion lift that increases returns or policy violations isn't a successful ranking release.

The final cell publishes replayable impression rows, illustrative joined outcome events, and one candidate receipt. It doesn't invent live lift. It proves offline evidence, exposure schema, stable assignment, outcome join, and rollback pointer are ready before traffic begins.

05-experiment-arm-config.py

CATALOG_SNAPSHOT = "catalog-2026-05-01"
ELIGIBILITY_VERSION = "market-eligibility-v1"
CANDIDATE_VERSION = "retrieval-v1"
RANKER_VERSION = "market-ranker-v1"
PREVIOUS_RANKER = "market-ranker-v0"
EXPERIMENT_ID = "ranking-ab-2026-05"

ARM_CONFIG = {
    "control": (PREVIOUS_RANKER, baseline_ranked),
    "treatment": (RANKER_VERSION, candidate_ranked),
}

def impression_rows(
    request_id: str,
    shopper_id: str,
    served_at: str,
    query: str,
    arm: str,
) -> list[dict[str, object]]:
    ranker_version, ranked_by_query = ARM_CONFIG[arm]
    return [
        {
            "request_id": request_id,
            "shopper_id": shopper_id,
            "served_at": served_at,
            "query": query,
            "catalog_snapshot": CATALOG_SNAPSHOT,
            "eligibility_version": ELIGIBILITY_VERSION,
            "candidate_version": CANDIDATE_VERSION,
            "ranker_version": ranker_version,
            "experiment_id": EXPERIMENT_ID,
            "experiment_arm": arm,
            "product_id": listing.product_id,
            "position": position,
        }
        for position, listing in enumerate(ranked_by_query[query], start=1)
    ]

print("experiment:", EXPERIMENT_ID, "arms:", list(ARM_CONFIG))

06-log-impression-rows.py

control_impressions = impression_rows("REQ-500", "SHOP-100", "2026-05-01T12:00:00Z", "label-printer", "control")
treatment_impressions = impression_rows("REQ-501", "SHOP-200", "2026-05-01T12:00:05Z", "label-printer", "treatment")
impressions = control_impressions + treatment_impressions
outcomes = [
    {
        "event_id": "EVT-900",
        "occurred_at": "2026-05-01T12:00:08Z",
        "request_id": "REQ-501",
        "product_id": "P5",
        "outcome": "click",
    },
    {
        "event_id": "EVT-901",
        "occurred_at": "2026-05-01T12:04:18Z",
        "request_id": "REQ-501",
        "product_id": "P5",
        "outcome": "purchase",
    },
]

print("impressions:", len(impressions), "outcomes:", len(outcomes))

07-join-outcomes-to-impressions.py

required_impression_fields = {
    "request_id", "shopper_id", "served_at", "query", "catalog_snapshot",
    "eligibility_version", "candidate_version", "ranker_version",
    "experiment_id", "experiment_arm", "product_id", "position",
}
required_outcome_fields = {"event_id", "occurred_at", "request_id", "product_id", "outcome"}
outcome_event_ids = [row["event_id"] for row in outcomes]
latencies_ms = [11, 13, 12, 15, 14, 16, 13, 12, 17, 18]

def nearest_rank(values: list[int], percentile: float) -> int:
    if not values:
        raise ValueError("latency sample must not be empty")
    rank = max(1, ceil(len(values) * percentile))
    return sorted(values)[rank - 1]

baseline_ndcg = sum(ndcg(rows) for rows in baseline_ranked.values()) / len(QUERIES)
candidate_ndcg = sum(ndcg(rows) for rows in candidate_ranked.values()) / len(QUERIES)
max_seller_count = max(
    max(Counter(listing.seller_id for listing in rows).values())
    for rows in candidate_ranked.values()
)
impression_index = {
    (row["request_id"], row["product_id"]): row
    for row in impressions
}
shopper_arms: dict[str, set[str]] = {}
for row in impressions:
    shopper_arms.setdefault(row["shopper_id"], set()).add(row["experiment_arm"])

def parse_utc(value: str) -> datetime:
    return datetime.fromisoformat(value.replace("Z", "+00:00"))

def outcome_follows_impression(row: dict[str, object]) -> bool:
    impression = impression_index.get((row["request_id"], row["product_id"]))
    return impression is not None and parse_utc(impression["served_at"]) < parse_utc(row["occurred_at"])

print("baseline_ndcg:", round(baseline_ndcg, 3), "candidate_ndcg:", round(candidate_ndcg, 3))

08-offline-release-gates.py

release_gates = {
    "candidate_recall_complete": all(candidate_recall(query, retrieved[query]) == 1.0 for query in QUERIES),
    "candidate_ndcg_improves": candidate_ndcg > baseline_ndcg,
    "ndcg_slices_at_least_0_95": all(ndcg(rows) >= 0.95 for rows in candidate_ranked.values()),
    "ranker_p95_latency_ms_at_most_20": nearest_rank(latencies_ms, 0.95) <= 20,
    "blocked_items_absent": not blocked_hits,
    "impression_schema": all(required_impression_fields <= row.keys() for row in impressions),
    "outcome_schema": all(required_outcome_fields <= row.keys() for row in outcomes),
    "outcome_event_ids_unique": len(outcome_event_ids) == len(set(outcome_event_ids)),
    "outcomes_join_impressions": all((row["request_id"], row["product_id"]) in impression_index for row in outcomes),
    "outcomes_follow_impressions": all(outcome_follows_impression(row) for row in outcomes),
    "one_arm_per_shopper": all(len(arms) == 1 for arms in shopper_arms.values()),
    "seller_concentration_at_most_2_per_slate": max_seller_count <= 2,
}

print("release_gates_pass:", all(release_gates.values()))

09-assemble-experiment-receipt.py

receipt = {
    "bundle_id": RANKER_VERSION,
    "previous_ranker": PREVIOUS_RANKER,
    "catalog_snapshot": CATALOG_SNAPSHOT,
    "eligibility_version": ELIGIBILITY_VERSION,
    "candidate_version": CANDIDATE_VERSION,
    "offline": {
        "baseline_ndcg_at_3": round(baseline_ndcg, 3),
        "candidate_ndcg_at_3": round(candidate_ndcg, 3),
        "ranker_p95_latency_ms": nearest_rank(latencies_ms, 0.95),
    },
    "release_gates": release_gates,
    "experiment": {
        "experiment_id": EXPERIMENT_ID,
        "assignment_unit": "shopper_id",
        "control": PREVIOUS_RANKER,
        "treatment": RANKER_VERSION,
        "allocation": {"control": 0.5, "treatment": 0.5},
        "minimum_eligible_searches": 10000,
        "primary_metric": "purchase_conversion_per_search",
        "guardrails": ["returns_per_purchase", "search_p95_latency_ms", "blocked_listing_impressions", "seller_concentration_at_3"],
        "stop_conditions": [
            "blocked_listing_impressions > 0",
            "search_p95_latency_ms > 120",
            "returns_per_purchase > control + 0.01",
        ],
    },
    "candidate_decision": "candidate_for_ab_test" if all(release_gates.values()) else "hold",
}

print("candidate_decision:", receipt["candidate_decision"])

10-publish-ranking-experiment-receipt.py

print("control slate:", [row["product_id"] for row in control_impressions])
print("treatment slate:", [row["product_id"] for row in treatment_impressions])
print("first treatment impression:", json.dumps(treatment_impressions[0], sort_keys=True))
print("first treatment outcome:", json.dumps(outcomes[0], sort_keys=True))
print(json.dumps(receipt, indent=2))

Output

control slate: ['P4', 'P5', 'P6']
treatment slate: ['P5', 'P6', 'P4']
first treatment impression: {"candidate_version": "retrieval-v1", "catalog_snapshot": "catalog-2026-05-01", "eligibility_version": "market-eligibility-v1", "experiment_arm": "treatment", "experiment_id": "ranking-ab-2026-05", "position": 1, "product_id": "P5", "query": "label-printer", "ranker_version": "market-ranker-v1", "request_id": "REQ-501", "served_at": "2026-05-01T12:00:05Z", "shopper_id": "SHOP-200"}
first treatment outcome: {"event_id": "EVT-900", "occurred_at": "2026-05-01T12:00:08Z", "outcome": "click", "product_id": "P5", "request_id": "REQ-501"}
{
  "bundle_id": "market-ranker-v1",
  "previous_ranker": "market-ranker-v0",
  "catalog_snapshot": "catalog-2026-05-01",
  "eligibility_version": "market-eligibility-v1",
  "candidate_version": "retrieval-v1",
  "offline": {
    "baseline_ndcg_at_3": 0.854,
    "candidate_ndcg_at_3": 1.0,
    "ranker_p95_latency_ms": 18
  },
  "release_gates": {
    "candidate_recall_complete": true,
    "candidate_ndcg_improves": true,
    "ndcg_slices_at_least_0_95": true,
    "ranker_p95_latency_ms_at_most_20": true,
    "blocked_items_absent": true,
    "impression_schema": true,
    "outcome_schema": true,
    "outcome_event_ids_unique": true,
    "outcomes_join_impressions": true,
    "outcomes_follow_impressions": true,
    "one_arm_per_shopper": true,
    "seller_concentration_at_most_2_per_slate": true
  },
  "experiment": {
    "experiment_id": "ranking-ab-2026-05",
    "assignment_unit": "shopper_id",
    "control": "market-ranker-v0",
    "treatment": "market-ranker-v1",
    "allocation": {
      "control": 0.5,
      "treatment": 0.5
    },
    "minimum_eligible_searches": 10000,
    "primary_metric": "purchase_conversion_per_search",
    "guardrails": [
      "returns_per_purchase",
      "search_p95_latency_ms",
      "blocked_listing_impressions",
      "seller_concentration_at_3"
    ],
    "stop_conditions": [
      "blocked_listing_impressions > 0",
      "search_p95_latency_ms > 120",
      "returns_per_purchase > control + 0.01"
    ]
  },
  "candidate_decision": "candidate_for_ab_test"
}

candidate_for_ab_test is intentionally narrower than launch approval. The local 20 millisecond gate measures ranker scoring only; the experiment's 120 millisecond stop applies to end-to-end search latency. The receipt says which ranker deserves controlled exposure, which immutable logs make later outcomes interpretable, and which previous alias remains available if guardrails fail.

Submission checklist

Artifact	Evidence
eligibility policy	blocked items can't enter ranking
candidate evaluation	recall measured before reranking
ranker evaluation	NDCG slices and latency recorded
impression schema	positions, versions, and assignment context logged before outcomes
outcome stream	later events append separately and join displayed products
experiment plan	stable assignment, metrics, guardrails, stop rules
rollback	stable ranker alias remains available

Practice: break the ranking contract

Use the runnable examples as a release harness. Change one condition at a time, predict the result, then rerun the examples.

Change P3 retrieval score from 0.84 to 0.10, then change retrieve() default budget from 3 to 2. Which metric exposes missing relevant product?
Set P9.policy_approved to True. Why isn't that a harmless relevance change?
Change candidate score for P6 from 0.84 to 0.20. Which NDCG slice fails?
Remove position from impression_rows(). Which receipt gate blocks experiment readiness?
Change all three eligible insulated-bag listings to seller S1. Which marketplace guardrail fails?
Use SHOP-100 for both control and treatment impressions. Which assignment gate fails?
Change first outcome's product_id from P5 to P9. Which replay gate fails?
Append a retry with the same event_id as EVT-900. Which deduplication gate fails?

Practice answer sketches

Mastery check

Evaluation rubric

Artifact	Strong submission demonstrates
ranking stack	explicit eligibility, candidate generation, reranking inputs, and replayable slate logs
evaluation	NDCG-style offline result with policy guardrails and slice review
experiment plan	immutable impressions, joined outcomes, stable assignment, guardrails, stopping rule, and rollback

Common failures

Symptom	Cause	Fix
Great NDCG while prohibited item appears	ranking runs before eligibility	filter first and test invariant
Strong ranker metric while relevant product never appears	candidate retrieval dropped it before reranking	gate candidate recall separately
New model increasingly favors former winners	clicks used without exposure evidence	log impressions and experiment arms
Outcome history changes after retries	mutable impression row overwritten with later status	append outcome events and join by request plus product
Same shopper appears in both A/B arms	assignment identity wasn't preserved	log shopper and experiment IDs before exposure
Conversion increases while customer value falls	returns absent from guardrails	gate promotion on downstream outcomes

Next Step

Continue to Capstone: Demand Forecasting

You can now ship a measured ranking surface whose outputs influence future data. Next you'll ship time-ordered forecasts and alerts for inventory and warehouse capacity.

PreviousCapstone: Delivery ETA Prediction

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Learning to Rank using Gradient Descent.

Burges, C. J. C., et al. · 2005 · ICML 2005

Unbiased Learning-to-Rank with Biased Feedback

Joachims, T., Swaminathan, A., & Schnabel, T. · 2017 · WSDM 2017

Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing.

Kohavi, R., Tang, D., Xu, Y. · 2020

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Back to Topics

LearnPortfolio CapstonesCapstone: Product Ranking

📊HardEvaluation & Benchmarks

Capstone: Product Ranking

Ship a marketplace ranking candidate with eligible retrieval, separate recall and NDCG gates, replayable exposure rows, and an A/B-ready rollback receipt.

15 min read

Learning path

Step 80 of 158 in the full curriculum

Capstone: Delivery ETA Prediction Capstone: Demand Forecasting

The ETA project produced one guarded prediction for one parcel. This project makes a decision that shapes its own future data: which products shoppers see near the top of search results.

Specify the Ranking Surface

Define the scope tightly:

Contract	Decision
surface	marketplace search results
query set	frozen judged search queries plus online experiment traffic
eligibility	in-stock, deliverable region, policy-approved listing
offline metric	candidate recall@100 and NDCG@10
online primary metric	purchase conversion per search
guardrails	returns, latency, unsafe listings, seller concentration

Candidate generation and reranking are separate artifacts. The candidate component must retrieve relevant items reliably. The ranker can only adjust order inside that eligible set.

Build offline evidence first

Create a judged fixture set. Each query has eligible candidates and graded relevance from human review:

text

ranking-product/
  data/
    catalog_snapshot.jsonl
    judged_queries.jsonl
    eligibility_policy.json
  retrieval/
    candidate_generator.py
  ranking/
    ranker.py
    evaluate_ndcg.py
  experiments/
    impression_schema.json
    ab_plan.md
  tests/
    test_blocked_listing_never_surfaces.py
    test_ndcg_regression_gate.py

Required offline checks:

Check	Why it blocks release
eligible-only results	a relevant prohibited listing still can't display
candidate recall@100	the ranker can't repair missing products
NDCG@10 by query category	overall improvement may hide poor critical categories
scoring latency	a slower list damages shopping experience
diversity/seller concentration	one seller with many feedback events shouldn't crowd out catalog

Filter policy before retrieval

The compact fixture below keeps two judged searches in memory. Production might retrieve 100 candidates per query; this local receipt uses 3 so each listing stays visible.

01-marketplace-eligibility-contract.py

from collections import Counter
from dataclasses import dataclass
from datetime import datetime
from math import ceil, log2
import json

@dataclass(frozen=True)
class Listing:
    product_id: str
    query: str
    relevance: int
    retrieval_score: float
    baseline_rank_score: float
    candidate_rank_score: float
    in_stock: bool
    policy_approved: bool
    deliverable_region: bool
    seller_id: str

CATALOG = [
    Listing("P1", "insulated-bag", 3, 0.96, 0.82, 0.97, True, True, True, "S1"),
    Listing("P2", "insulated-bag", 1, 0.88, 0.75, 0.62, True, True, True, "S2"),
    Listing("P3", "insulated-bag", 2, 0.84, 0.65, 0.90, True, True, True, "S3"),
    Listing("P9", "insulated-bag", 3, 0.99, 0.99, 0.99, True, False, True, "S9"),
    Listing("P10", "insulated-bag", 2, 0.95, 0.98, 0.96, False, True, True, "S1"),
    Listing("P4", "label-printer", 1, 0.74, 0.90, 0.45, True, True, True, "S4"),
    Listing("P5", "label-printer", 3, 0.93, 0.72, 0.96, True, True, True, "S5"),
    Listing("P6", "label-printer", 2, 0.86, 0.65, 0.84, True, True, True, "S6"),
    Listing("P8", "label-printer", 3, 0.98, 0.99, 0.99, False, False, True, "S8"),
    Listing("P11", "label-printer", 2, 0.97, 0.97, 0.97, True, True, False, "S11"),
]

QUERIES = ("insulated-bag", "label-printer")

def exclusion_reason(listing: Listing) -> str | None:
    if not listing.policy_approved:
        return "policy_block"
    if not listing.in_stock:
        return "out_of_stock"
    if not listing.deliverable_region:
        return "unavailable_region"
    return None

print("catalog listings:", len(CATALOG))
print("queries:", QUERIES)

Output

catalog listings: 10
queries: ('insulated-bag', 'label-printer')

02-filter-and-retrieve-candidates.py

def retrieve(query: str, budget: int = 3) -> list[Listing]:
    eligible = [
        listing for listing in CATALOG
        if listing.query == query and exclusion_reason(listing) is None
    ]
    return sorted(eligible, key=lambda listing: (-listing.retrieval_score, listing.product_id))[:budget]

def relevant_ids(query: str) -> set[str]:
    return {
        listing.product_id for listing in CATALOG
        if listing.query == query and exclusion_reason(listing) is None and listing.relevance >= 2
    }

def candidate_recall(query: str, candidates: list[Listing]) -> float:
    relevant = relevant_ids(query)
    if not relevant:
        raise ValueError(f"{query}: no relevant eligible judgments")
    return len(relevant & {listing.product_id for listing in candidates}) / len(relevant)

retrieved = {query: retrieve(query) for query in QUERIES}
rejected = {
    listing.product_id: exclusion_reason(listing)
    for listing in CATALOG
    if exclusion_reason(listing) is not None
}

for query, candidates in retrieved.items():
    print(query, [listing.product_id for listing in candidates], f"recall@3={candidate_recall(query, candidates):.3f}")
print("rejected:", rejected)

Output

insulated-bag ['P1', 'P2', 'P3'] recall@3=1.000
label-printer ['P5', 'P6', 'P4'] recall@3=1.000
rejected: {'P9': 'policy_block', 'P10': 'out_of_stock', 'P8': 'policy_block', 'P11': 'unavailable_region'}

Measure Retrieval and Ordering Separately

The next cell reranks each eligible candidate set and compares baseline and candidate NDCG. It also rechecks the invariant after scoring.

03-ndcg-ranking-helpers.py

def dcg(rows: list[Listing]) -> float:
    return sum((2**listing.relevance - 1) / log2(rank + 2) for rank, listing in enumerate(rows))

def ndcg(rows: list[Listing]) -> float:
    ideal = sorted(rows, key=lambda listing: (-listing.relevance, listing.product_id))
    ideal_dcg = dcg(ideal)
    return dcg(rows) / ideal_dcg if ideal_dcg else 0.0

def rerank(rows: list[Listing], score_field: str) -> list[Listing]:
    return sorted(rows, key=lambda listing: (-getattr(listing, score_field), listing.product_id))

baseline_ranked = {
    query: rerank(candidates, "baseline_rank_score")
    for query, candidates in retrieved.items()
}
candidate_ranked = {
    query: rerank(candidates, "candidate_rank_score")
    for query, candidates in retrieved.items()
}

print("queries ranked:", list(candidate_ranked))

04-compare-retrieval-and-ranking.py

for query in QUERIES:
    print(
        query,
        f"baseline_ndcg@3={ndcg(baseline_ranked[query]):.3f}",
        f"candidate_ndcg@3={ndcg(candidate_ranked[query]):.3f}",
    )

blocked_hits = [
    listing.product_id
    for rows in candidate_ranked.values()
    for listing in rows
    if exclusion_reason(listing) is not None
]
print("blocked hits:", blocked_hits)

Output

insulated-bag baseline_ndcg@3=0.972 candidate_ndcg@3=1.000
label-printer baseline_ndcg@3=0.736 candidate_ndcg@3=1.000
blocked hits: []

Log exposure before learning from outcomes

Every displayed slate should log:

Field	Why
`request_id`, `query`, `served_at`	group one displayed decision
`shopper_id`, `experiment_id`, `experiment_arm`	replay stable experiment assignment
`catalog_snapshot`, `eligibility_version`	show available choices
`candidate_version`, `ranker_version`	identify scoring path
`product_id`, `position`	preserve exposure

05-experiment-arm-config.py

CATALOG_SNAPSHOT = "catalog-2026-05-01"
ELIGIBILITY_VERSION = "market-eligibility-v1"
CANDIDATE_VERSION = "retrieval-v1"
RANKER_VERSION = "market-ranker-v1"
PREVIOUS_RANKER = "market-ranker-v0"
EXPERIMENT_ID = "ranking-ab-2026-05"

ARM_CONFIG = {
    "control": (PREVIOUS_RANKER, baseline_ranked),
    "treatment": (RANKER_VERSION, candidate_ranked),
}

def impression_rows(
    request_id: str,
    shopper_id: str,
    served_at: str,
    query: str,
    arm: str,
) -> list[dict[str, object]]:
    ranker_version, ranked_by_query = ARM_CONFIG[arm]
    return [
        {
            "request_id": request_id,
            "shopper_id": shopper_id,
            "served_at": served_at,
            "query": query,
            "catalog_snapshot": CATALOG_SNAPSHOT,
            "eligibility_version": ELIGIBILITY_VERSION,
            "candidate_version": CANDIDATE_VERSION,
            "ranker_version": ranker_version,
            "experiment_id": EXPERIMENT_ID,
            "experiment_arm": arm,
            "product_id": listing.product_id,
            "position": position,
        }
        for position, listing in enumerate(ranked_by_query[query], start=1)
    ]

print("experiment:", EXPERIMENT_ID, "arms:", list(ARM_CONFIG))

06-log-impression-rows.py

control_impressions = impression_rows("REQ-500", "SHOP-100", "2026-05-01T12:00:00Z", "label-printer", "control")
treatment_impressions = impression_rows("REQ-501", "SHOP-200", "2026-05-01T12:00:05Z", "label-printer", "treatment")
impressions = control_impressions + treatment_impressions
outcomes = [
    {
        "event_id": "EVT-900",
        "occurred_at": "2026-05-01T12:00:08Z",
        "request_id": "REQ-501",
        "product_id": "P5",
        "outcome": "click",
    },
    {
        "event_id": "EVT-901",
        "occurred_at": "2026-05-01T12:04:18Z",
        "request_id": "REQ-501",
        "product_id": "P5",
        "outcome": "purchase",
    },
]

print("impressions:", len(impressions), "outcomes:", len(outcomes))

07-join-outcomes-to-impressions.py

required_impression_fields = {
    "request_id", "shopper_id", "served_at", "query", "catalog_snapshot",
    "eligibility_version", "candidate_version", "ranker_version",
    "experiment_id", "experiment_arm", "product_id", "position",
}
required_outcome_fields = {"event_id", "occurred_at", "request_id", "product_id", "outcome"}
outcome_event_ids = [row["event_id"] for row in outcomes]
latencies_ms = [11, 13, 12, 15, 14, 16, 13, 12, 17, 18]

def nearest_rank(values: list[int], percentile: float) -> int:
    if not values:
        raise ValueError("latency sample must not be empty")
    rank = max(1, ceil(len(values) * percentile))
    return sorted(values)[rank - 1]

baseline_ndcg = sum(ndcg(rows) for rows in baseline_ranked.values()) / len(QUERIES)
candidate_ndcg = sum(ndcg(rows) for rows in candidate_ranked.values()) / len(QUERIES)
max_seller_count = max(
    max(Counter(listing.seller_id for listing in rows).values())
    for rows in candidate_ranked.values()
)
impression_index = {
    (row["request_id"], row["product_id"]): row
    for row in impressions
}
shopper_arms: dict[str, set[str]] = {}
for row in impressions:
    shopper_arms.setdefault(row["shopper_id"], set()).add(row["experiment_arm"])

def parse_utc(value: str) -> datetime:
    return datetime.fromisoformat(value.replace("Z", "+00:00"))

def outcome_follows_impression(row: dict[str, object]) -> bool:
    impression = impression_index.get((row["request_id"], row["product_id"]))
    return impression is not None and parse_utc(impression["served_at"]) < parse_utc(row["occurred_at"])

print("baseline_ndcg:", round(baseline_ndcg, 3), "candidate_ndcg:", round(candidate_ndcg, 3))

08-offline-release-gates.py

release_gates = {
    "candidate_recall_complete": all(candidate_recall(query, retrieved[query]) == 1.0 for query in QUERIES),
    "candidate_ndcg_improves": candidate_ndcg > baseline_ndcg,
    "ndcg_slices_at_least_0_95": all(ndcg(rows) >= 0.95 for rows in candidate_ranked.values()),
    "ranker_p95_latency_ms_at_most_20": nearest_rank(latencies_ms, 0.95) <= 20,
    "blocked_items_absent": not blocked_hits,
    "impression_schema": all(required_impression_fields <= row.keys() for row in impressions),
    "outcome_schema": all(required_outcome_fields <= row.keys() for row in outcomes),
    "outcome_event_ids_unique": len(outcome_event_ids) == len(set(outcome_event_ids)),
    "outcomes_join_impressions": all((row["request_id"], row["product_id"]) in impression_index for row in outcomes),
    "outcomes_follow_impressions": all(outcome_follows_impression(row) for row in outcomes),
    "one_arm_per_shopper": all(len(arms) == 1 for arms in shopper_arms.values()),
    "seller_concentration_at_most_2_per_slate": max_seller_count <= 2,
}

print("release_gates_pass:", all(release_gates.values()))

09-assemble-experiment-receipt.py

receipt = {
    "bundle_id": RANKER_VERSION,
    "previous_ranker": PREVIOUS_RANKER,
    "catalog_snapshot": CATALOG_SNAPSHOT,
    "eligibility_version": ELIGIBILITY_VERSION,
    "candidate_version": CANDIDATE_VERSION,
    "offline": {
        "baseline_ndcg_at_3": round(baseline_ndcg, 3),
        "candidate_ndcg_at_3": round(candidate_ndcg, 3),
        "ranker_p95_latency_ms": nearest_rank(latencies_ms, 0.95),
    },
    "release_gates": release_gates,
    "experiment": {
        "experiment_id": EXPERIMENT_ID,
        "assignment_unit": "shopper_id",
        "control": PREVIOUS_RANKER,
        "treatment": RANKER_VERSION,
        "allocation": {"control": 0.5, "treatment": 0.5},
        "minimum_eligible_searches": 10000,
        "primary_metric": "purchase_conversion_per_search",
        "guardrails": ["returns_per_purchase", "search_p95_latency_ms", "blocked_listing_impressions", "seller_concentration_at_3"],
        "stop_conditions": [
            "blocked_listing_impressions > 0",
            "search_p95_latency_ms > 120",
            "returns_per_purchase > control + 0.01",
        ],
    },
    "candidate_decision": "candidate_for_ab_test" if all(release_gates.values()) else "hold",
}

print("candidate_decision:", receipt["candidate_decision"])

10-publish-ranking-experiment-receipt.py

print("control slate:", [row["product_id"] for row in control_impressions])
print("treatment slate:", [row["product_id"] for row in treatment_impressions])
print("first treatment impression:", json.dumps(treatment_impressions[0], sort_keys=True))
print("first treatment outcome:", json.dumps(outcomes[0], sort_keys=True))
print(json.dumps(receipt, indent=2))

Output

control slate: ['P4', 'P5', 'P6']
treatment slate: ['P5', 'P6', 'P4']
first treatment impression: {"candidate_version": "retrieval-v1", "catalog_snapshot": "catalog-2026-05-01", "eligibility_version": "market-eligibility-v1", "experiment_arm": "treatment", "experiment_id": "ranking-ab-2026-05", "position": 1, "product_id": "P5", "query": "label-printer", "ranker_version": "market-ranker-v1", "request_id": "REQ-501", "served_at": "2026-05-01T12:00:05Z", "shopper_id": "SHOP-200"}
first treatment outcome: {"event_id": "EVT-900", "occurred_at": "2026-05-01T12:00:08Z", "outcome": "click", "product_id": "P5", "request_id": "REQ-501"}
{
  "bundle_id": "market-ranker-v1",
  "previous_ranker": "market-ranker-v0",
  "catalog_snapshot": "catalog-2026-05-01",
  "eligibility_version": "market-eligibility-v1",
  "candidate_version": "retrieval-v1",
  "offline": {
    "baseline_ndcg_at_3": 0.854,
    "candidate_ndcg_at_3": 1.0,
    "ranker_p95_latency_ms": 18
  },
  "release_gates": {
    "candidate_recall_complete": true,
    "candidate_ndcg_improves": true,
    "ndcg_slices_at_least_0_95": true,
    "ranker_p95_latency_ms_at_most_20": true,
    "blocked_items_absent": true,
    "impression_schema": true,
    "outcome_schema": true,
    "outcome_event_ids_unique": true,
    "outcomes_join_impressions": true,
    "outcomes_follow_impressions": true,
    "one_arm_per_shopper": true,
    "seller_concentration_at_most_2_per_slate": true
  },
  "experiment": {
    "experiment_id": "ranking-ab-2026-05",
    "assignment_unit": "shopper_id",
    "control": "market-ranker-v0",
    "treatment": "market-ranker-v1",
    "allocation": {
      "control": 0.5,
      "treatment": 0.5
    },
    "minimum_eligible_searches": 10000,
    "primary_metric": "purchase_conversion_per_search",
    "guardrails": [
      "returns_per_purchase",
      "search_p95_latency_ms",
      "blocked_listing_impressions",
      "seller_concentration_at_3"
    ],
    "stop_conditions": [
      "blocked_listing_impressions > 0",
      "search_p95_latency_ms > 120",
      "returns_per_purchase > control + 0.01"
    ]
  },
  "candidate_decision": "candidate_for_ab_test"
}

Submission checklist

Artifact	Evidence
eligibility policy	blocked items can't enter ranking
candidate evaluation	recall measured before reranking
ranker evaluation	NDCG slices and latency recorded
impression schema	positions, versions, and assignment context logged before outcomes
outcome stream	later events append separately and join displayed products
experiment plan	stable assignment, metrics, guardrails, stop rules
rollback	stable ranker alias remains available

Practice: break the ranking contract

Use the runnable examples as a release harness. Change one condition at a time, predict the result, then rerun the examples.

Change P3 retrieval score from 0.84 to 0.10, then change retrieve() default budget from 3 to 2. Which metric exposes missing relevant product?
Set P9.policy_approved to True. Why isn't that a harmless relevance change?
Change candidate score for P6 from 0.84 to 0.20. Which NDCG slice fails?
Remove position from impression_rows(). Which receipt gate blocks experiment readiness?
Change all three eligible insulated-bag listings to seller S1. Which marketplace guardrail fails?
Use SHOP-100 for both control and treatment impressions. Which assignment gate fails?
Change first outcome's product_id from P5 to P9. Which replay gate fails?
Append a retry with the same event_id as EVT-900. Which deduplication gate fails?

Practice answer sketches

Mastery check

Evaluation rubric

Artifact	Strong submission demonstrates
ranking stack	explicit eligibility, candidate generation, reranking inputs, and replayable slate logs
evaluation	NDCG-style offline result with policy guardrails and slice review
experiment plan	immutable impressions, joined outcomes, stable assignment, guardrails, stopping rule, and rollback

Common failures

Symptom	Cause	Fix
Great NDCG while prohibited item appears	ranking runs before eligibility	filter first and test invariant
Strong ranker metric while relevant product never appears	candidate retrieval dropped it before reranking	gate candidate recall separately
New model increasingly favors former winners	clicks used without exposure evidence	log impressions and experiment arms
Outcome history changes after retries	mutable impression row overwritten with later status	append outcome events and join by request plus product
Same shopper appears in both A/B arms	assignment identity wasn't preserved	log shopper and experiment IDs before exposure
Conversion increases while customer value falls	returns absent from guardrails	gate promotion on downstream outcomes

Next Step

Continue to Capstone: Demand Forecasting

You can now ship a measured ranking surface whose outputs influence future data. Next you'll ship time-ordered forecasts and alerts for inventory and warehouse capacity.

PreviousCapstone: Delivery ETA Prediction

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Learning to Rank using Gradient Descent.

Burges, C. J. C., et al. · 2005 · ICML 2005

Unbiased Learning-to-Rank with Biased Feedback

Joachims, T., Swaminathan, A., & Schnabel, T. · 2017 · WSDM 2017

Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing.

Kohavi, R., Tang, D., Xu, Y. · 2020

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Capstone: Product Ranking

Specify the Ranking Surface

Build offline evidence first

Filter policy before retrieval

Measure Retrieval and Ordering Separately

Log exposure before learning from outcomes

Submission checklist

Practice: break the ranking contract

Practice answer sketches

Mastery check

Evaluation rubric

Common failures

Mastery Check

Discussion

Capstone: Product Ranking

Specify the Ranking Surface

Build offline evidence first

Filter policy before retrieval

Measure Retrieval and Ordering Separately

Log exposure before learning from outcomes

Submission checklist

Practice: break the ranking contract

Practice answer sketches

Mastery check

Evaluation rubric

Common failures

Mastery Check

Discussion

Capstone: Product Ranking

Specify the Ranking Surface

Build offline evidence first

Filter policy before retrieval

Measure Retrieval and Ordering Separately

Log exposure before learning from outcomes

Submission checklist

Practice: break the ranking contract

Practice answer sketches

Which metric catches dropped P3 after retrieval budget falls to 2?

Why isn't approving P9 a harmless relevance edit?

What happens when P6 candidate score falls from 0.84 to 0.20?

Which gate fails when impression rows drop position?

Which gate fails when three products in one displayed slate share seller S1?

Which gate fails when SHOP-100 appears in both experiment arms?

Which gate fails when outcome product P9 was never displayed for REQ-501?

Which gate fails when a retry reuses EVT-900?

Mastery check

Why does this capstone require candidate recall and NDCG separately?

Why is candidate_for_ab_test stronger than a demo but weaker than a launch?

Which log field protects later training from treating placement as relevance?

Why append outcomes separately instead of updating impression rows in place?

Evaluation rubric

Common failures

Mastery Check

Discussion

Capstone: Product Ranking

Specify the Ranking Surface

Build offline evidence first

Filter policy before retrieval

Measure Retrieval and Ordering Separately

Log exposure before learning from outcomes

Submission checklist

Practice: break the ranking contract

Practice answer sketches

Which metric catches dropped P3 after retrieval budget falls to 2?

Why isn't approving P9 a harmless relevance edit?

What happens when P6 candidate score falls from 0.84 to 0.20?

Which gate fails when impression rows drop position?

Which gate fails when three products in one displayed slate share seller S1?

Which gate fails when SHOP-100 appears in both experiment arms?

Which gate fails when outcome product P9 was never displayed for REQ-501?

Which gate fails when a retry reuses EVT-900?

Mastery check

Why does this capstone require candidate recall and NDCG separately?

Why is candidate_for_ab_test stronger than a demo but weaker than a launch?

Which log field protects later training from treating placement as relevance?

Why append outcomes separately instead of updating impression rows in place?

Evaluation rubric

Common failures

Mastery Check

Discussion