Ship a marketplace ranking candidate with eligible retrieval, NDCG evidence, exposure logging, and A/B guardrails.
The ETA project produced one guarded prediction for one parcel. This project makes a decision that shapes its own future data: which products shoppers see near the top of search results.
Your task is to ship a ranking candidate for searches such as insulated delivery bag. It may reorder eligible, in-stock listings. It may not surface blocked listings, disguise sponsored placement, or call clicks an unbiased truth signal.
Define the scope tightly:
| Contract | Decision |
|---|---|
| surface | marketplace search results |
| query set | frozen judged search queries plus online experiment traffic |
| eligibility | in-stock, deliverable region, policy-approved listing |
| offline metric | candidate recall@100 and NDCG@10 |
| online primary metric | purchase conversion per search |
| guardrails | returns, latency, unsafe listings, seller concentration |
Candidate generation and reranking are separate artifacts. The candidate component must retrieve relevant items reliably. The ranker can only adjust order inside that eligible set.
Learning-to-rank methods can learn pairwise preferences so a relevant product receives a higher score than a less relevant candidate.[1] That modeling detail doesn't excuse missing policy: eligibility filtering must run before scoring, and its version belongs in every impression log.
Create a judged fixture set. Each query has eligible candidates and graded relevance from human review:
1ranking-product/
2 data/
3 catalog_snapshot.jsonl
4 judged_queries.jsonl
5 eligibility_policy.json
6 retrieval/
7 candidate_generator.py
8 ranking/
9 ranker.py
10 evaluate_ndcg.py
11 experiments/
12 impression_schema.json
13 ab_plan.md
14 tests/
15 test_blocked_listing_never_surfaces.py
16 test_ndcg_regression_gate.pyRequired offline checks:
| Check | Why it blocks release |
|---|---|
| eligible-only results | a relevant prohibited listing still can't display |
| candidate recall@100 | the ranker can't repair missing products |
| NDCG@10 by query category | overall improvement may hide poor critical categories |
| scoring latency | a slower list damages shopping experience |
| diversity/seller concentration | one feedback-rich seller shouldn't crowd out catalog |
This compact evaluator ensures the candidate improves ordering for two queries while never surfacing a blocked product. In the full artifact, judgments would live in versioned data files.
1from math import log2
2
3queries = [
4 {"id": "insulated-bag", "blocked": {"P9"}, "baseline": [("P1", 3), ("P2", 1), ("P3", 2)], "candidate": [("P1", 3), ("P3", 2), ("P2", 1)]},
5 {"id": "label-printer", "blocked": {"P8"}, "baseline": [("P4", 1), ("P5", 3), ("P6", 2)], "candidate": [("P5", 3), ("P6", 2), ("P4", 1)]},
6]
7
8def dcg(rows):
9 return sum((2**rel - 1) / log2(rank + 2) for rank, (_, rel) in enumerate(rows))
10
11def ndcg(rows):
12 ideal = sorted(rows, key=lambda row: row[1], reverse=True)
13 return dcg(rows) / dcg(ideal)
14
15baseline = sum(ndcg(query["baseline"]) for query in queries) / len(queries)
16candidate = sum(ndcg(query["candidate"]) for query in queries) / len(queries)
17blocked_hits = [product for query in queries for product, _ in query["candidate"] if product in query["blocked"]]
18
19decision = "eligible_for_ab_review" if candidate > baseline and not blocked_hits else "hold"
20print("baseline:", round(baseline, 3))
21print("candidate:", round(candidate, 3))
22print("blocked hits:", blocked_hits)
23print("decision:", decision)1baseline: 0.854
2candidate: 1.0
3blocked hits: []
4decision: eligible_for_ab_reviewThis decision is intentionally limited. Offline judged queries support an online experiment, not broad promotion. Live users may respond differently from judges, and new ranking exposure changes the future outcome data.
Every displayed slate should log:
| Field | Why |
|---|---|
request_id, query | group one decision |
catalog_snapshot, eligibility_version | show available choices |
candidate_version, ranker_version | identify scoring path |
product_id, position | preserve exposure |
clicked, purchased, returned | distinguish shallow and useful outcomes |
| experiment arm | estimate treatment effect |
Clicks alone are not reliable targets: top-ranked items receive attention because they are top-ranked. Use judged offline sets for regression detection, and use a controlled A/B experiment to compare user outcomes. Kohavi et al. describe online experimentation as a disciplined way to measure changes while monitoring guardrails and unexpected harm.[2]
An A/B plan should state allocation, duration or sample-size rule, primary metric, latency and return-rate guardrails, prohibited-listing invariant, and stop conditions. A conversion lift that increases returns or policy violations is not a successful ranking release.
| Artifact | Evidence |
|---|---|
| eligibility policy | blocked items can't enter ranking |
| candidate evaluation | recall measured before reranking |
| ranker evaluation | NDCG slices and latency recorded |
| impression schema | positions and versions logged |
| experiment plan | metrics, guardrails, stop rules |
| rollback | stable ranker alias remains available |
| Artifact | Strong submission demonstrates |
|---|---|
| ranking stack | explicit eligibility, candidate generation, reranking inputs, and deterministic fallback |
| evaluation | NDCG-style offline result with policy guardrails and slice review |
| experiment plan | exposure logging, primary metric, guardrails, stopping rule, and rollback |
| Symptom | Cause | Fix |
|---|---|---|
| Great NDCG while prohibited item appears | ranking runs before eligibility | filter first and test invariant |
| New model increasingly favors former winners | clicks used without exposure evidence | log impressions and experiment arms |
| Conversion increases while customer value falls | returns absent from guardrails | gate promotion on downstream outcomes |