Rank products for a shopper using candidate retrieval, relevance metrics, and feedback-loop safeguards.
A delivery-risk classifier chooses one action for one parcel. A storefront model makes a different prediction: among thousands of products, which few items should appear first for a query or customer?
This is a ranking problem. Retrieval finds eligible candidates quickly; a ranking model orders them using query, product, inventory, price, and behavioral features. Recommendations use the same shape even without a search query: assemble candidates, score them, display a short list, then measure useful interaction rather than raw clicks alone.
Suppose a shopper searches for insulated delivery bag. The catalog contains one million products. A feature-rich model shouldn't score every listing on every request.
| Stage | Job | Example guard |
|---|---|---|
| eligibility filter | remove forbidden or unavailable items | in stock, permitted region |
| candidate generation | recover perhaps 200 plausible products | text or embedding retrieval |
| ranking | order candidates precisely | relevance, price, fulfillment reliability |
| business policy | enforce final constraints | sponsored labels, diversity, safety |
A product absent from the candidate set can't be rescued by a perfect ranker. Candidate recall must therefore be measured separately from ranking quality. This resembles the retrieval and reranking pattern later used for document evidence, but the outcome here is a product exposure decision.
In a ranked list, placing the best result first matters more than moving it from position 40 to position 39. Discounted Cumulative Gain (DCG) gives high relevance near the top more weight:
Here is the graded relevance at position . Normalized DCG (NDCG) divides by the score of the ideal ordering, producing a value between zero and one for a query.
1from math import log2
2
3def dcg(relevances):
4 return sum((2**rel - 1) / log2(rank + 2) for rank, rel in enumerate(relevances))
5
6def ndcg(relevances):
7 ideal = sorted(relevances, reverse=True)
8 return dcg(relevances) / dcg(ideal)
9
10baseline = [3, 0, 2, 1]
11candidate = [3, 2, 1, 0]
12print("baseline ndcg:", round(ndcg(baseline), 3))
13print("candidate ndcg:", round(ndcg(candidate), 3))1baseline ndcg: 0.951
2candidate ndcg: 1.0The baseline still puts the most relevant bag first, so its score is high. The candidate improves the remaining top positions. This is why ranking evaluation needs graded judgments or trustworthy interactions, not a single accuracy bit.
Learning-to-rank work such as RankNet trains pair preferences between documents or products so relevant items outrank less relevant ones.[1] In production, pairwise learning is only one modeling choice. Your evidence contract remains candidate-set version, feature snapshot, label policy, NDCG slices, latency, and online experiment plan.
When a model places an item first, it receives more attention and therefore more clicks. Training directly on clicks rewards past ranking position as if it were product relevance. This is a feedback loop.
Log every impression before using interactions:
| Logged field | Reason |
|---|---|
| request and query ID | group displayed slate |
| candidate set version | know what could have ranked |
| final positions | measure exposure |
| ranker version | attribute outcomes |
| click, cart, purchase, return | distinguish curiosity from value |
| inventory and price snapshot | reproduce context |
For a first portfolio system, evaluate with human relevance labels on a frozen query set, then define an A/B plan for online outcomes. Controlled online experiments are necessary because engagement changes can reflect latency, layout, price, stock, or ranking behavior; experiment design isolates the candidate change.[2]
Recommendations can narrow what customers see. Add constraints for unavailable items, unsafe listings, seller policy, and overconcentration. Monitor slices such as new products, small sellers, languages, and low-inventory categories rather than accepting one marketplace-wide NDCG.
A useful release packet includes offline query judgments, candidate recall, NDCG@10, p95 scoring latency, diversity checks, impression logging schema, and an A/B stopping rule. A list that looks good offline isn't allowed to silently write its own future training labels.
| Evidence | A production-ready answer demonstrates |
|---|---|
| candidate contract | separates eligibility and retrieval from learned reranking |
| evaluation | uses ranking metrics and product guardrails instead of classification accuracy alone |
| feedback safety | logs impressions and outcomes and addresses position-driven bias |
| Symptom | Cause | Fix |
|---|---|---|
| NDCG looks strong but desired item never appears | candidates lost recall | measure retrieval recall separately |
| Popular products dominate forever | click-position loop | log exposure and use controlled evaluation |
| Relevance improves while returns rise | wrong online objective | pair engagement with purchase and return guardrails |