LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

ยฉ 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 151 articles completed

๐Ÿ› ๏ธComputing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
๐Ÿ“ŠMath & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
๐Ÿ“šPreparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
๐ŸงฎML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
๐Ÿ“ฆProduction ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
๐ŸงชCore LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
๐ŸงฐApplied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
๐ŸŽ“Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
๐Ÿง Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
๐ŸงฌAdvanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
๐Ÿค–Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
โšกInference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
๐Ÿ—๏ธSystem Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
Back to Topics
LearnPortfolio CapstonesCapstone: Product Ranking
๐Ÿ“ŠHardEvaluation & Benchmarks

Capstone: Product Ranking

Ship a marketplace ranking candidate with eligible retrieval, NDCG evidence, exposure logging, and A/B guardrails.

9 min read
Learning path
Step 77 of 151 in the full curriculum
Capstone: Delivery ETA PredictionCapstone: Demand Forecasting

Capstone: Product Ranking

The ETA project produced one guarded prediction for one parcel. This project makes a decision that shapes its own future data: which products shoppers see near the top of search results.

Your task is to ship a ranking candidate for searches such as insulated delivery bag. It may reorder eligible, in-stock listings. It may not surface blocked listings, disguise sponsored placement, or call clicks an unbiased truth signal.

Product-ranking capstone showing eligible catalog filtering, candidate retrieval, learned ordering, impression logging, offline relevance judging, and guarded online experiment. Product-ranking capstone showing eligible catalog filtering, candidate retrieval, learned ordering, impression logging, offline relevance judging, and guarded online experiment.
The ranker earns exposure only after offline ordering evidence and an impression schema make future feedback measurable rather than self-confirming.

Specify the Ranking Surface

Define the scope tightly:

ContractDecision
surfacemarketplace search results
query setfrozen judged search queries plus online experiment traffic
eligibilityin-stock, deliverable region, policy-approved listing
offline metriccandidate recall@100 and NDCG@10
online primary metricpurchase conversion per search
guardrailsreturns, latency, unsafe listings, seller concentration

Candidate generation and reranking are separate artifacts. The candidate component must retrieve relevant items reliably. The ranker can only adjust order inside that eligible set.

Learning-to-rank methods can learn pairwise preferences so a relevant product receives a higher score than a less relevant candidate.[1] That modeling detail doesn't excuse missing policy: eligibility filtering must run before scoring, and its version belongs in every impression log.

Diagram showing Catalog snapshot inventory + policy, Eligible candidates recall@100, Ranker v1 NDCG@10, and Displayed slate + impression log purchases and returns. Diagram showing Catalog snapshot inventory + policy, Eligible candidates recall@100, Ranker v1 NDCG@10, and Displayed slate + impression log purchases and returns.
Catalog snapshot inventory + policy, Eligible candidates recall@100, Ranker v1 NDCG@10, and Displayed slate + impression log purchases and returns.

Build Offline Evidence First

Create a judged fixture set. Each query has eligible candidates and graded relevance from human review:

text
1ranking-product/ 2 data/ 3 catalog_snapshot.jsonl 4 judged_queries.jsonl 5 eligibility_policy.json 6 retrieval/ 7 candidate_generator.py 8 ranking/ 9 ranker.py 10 evaluate_ndcg.py 11 experiments/ 12 impression_schema.json 13 ab_plan.md 14 tests/ 15 test_blocked_listing_never_surfaces.py 16 test_ndcg_regression_gate.py

Required offline checks:

CheckWhy it blocks release
eligible-only resultsa relevant prohibited listing still can't display
candidate recall@100the ranker can't repair missing products
NDCG@10 by query categoryoverall improvement may hide poor critical categories
scoring latencya slower list damages shopping experience
diversity/seller concentrationone feedback-rich seller shouldn't crowd out catalog

Make the Ranking Gate Runnable

This compact evaluator ensures the candidate improves ordering for two queries while never surfacing a blocked product. In the full artifact, judgments would live in versioned data files.

ranking-release-gate.py
1from math import log2 2 3queries = [ 4 {"id": "insulated-bag", "blocked": {"P9"}, "baseline": [("P1", 3), ("P2", 1), ("P3", 2)], "candidate": [("P1", 3), ("P3", 2), ("P2", 1)]}, 5 {"id": "label-printer", "blocked": {"P8"}, "baseline": [("P4", 1), ("P5", 3), ("P6", 2)], "candidate": [("P5", 3), ("P6", 2), ("P4", 1)]}, 6] 7 8def dcg(rows): 9 return sum((2**rel - 1) / log2(rank + 2) for rank, (_, rel) in enumerate(rows)) 10 11def ndcg(rows): 12 ideal = sorted(rows, key=lambda row: row[1], reverse=True) 13 return dcg(rows) / dcg(ideal) 14 15baseline = sum(ndcg(query["baseline"]) for query in queries) / len(queries) 16candidate = sum(ndcg(query["candidate"]) for query in queries) / len(queries) 17blocked_hits = [product for query in queries for product, _ in query["candidate"] if product in query["blocked"]] 18 19decision = "eligible_for_ab_review" if candidate > baseline and not blocked_hits else "hold" 20print("baseline:", round(baseline, 3)) 21print("candidate:", round(candidate, 3)) 22print("blocked hits:", blocked_hits) 23print("decision:", decision)
Output
1baseline: 0.854 2candidate: 1.0 3blocked hits: [] 4decision: eligible_for_ab_review

This decision is intentionally limited. Offline judged queries support an online experiment, not broad promotion. Live users may respond differently from judges, and new ranking exposure changes the future outcome data.

Log Exposure Before Learning From Outcomes

Every displayed slate should log:

FieldWhy
request_id, querygroup one decision
catalog_snapshot, eligibility_versionshow available choices
candidate_version, ranker_versionidentify scoring path
product_id, positionpreserve exposure
clicked, purchased, returneddistinguish shallow and useful outcomes
experiment armestimate treatment effect

Clicks alone are not reliable targets: top-ranked items receive attention because they are top-ranked. Use judged offline sets for regression detection, and use a controlled A/B experiment to compare user outcomes. Kohavi et al. describe online experimentation as a disciplined way to measure changes while monitoring guardrails and unexpected harm.[2]

An A/B plan should state allocation, duration or sample-size rule, primary metric, latency and return-rate guardrails, prohibited-listing invariant, and stop conditions. A conversion lift that increases returns or policy violations is not a successful ranking release.

Submission Checklist

ArtifactEvidence
eligibility policyblocked items can't enter ranking
candidate evaluationrecall measured before reranking
ranker evaluationNDCG slices and latency recorded
impression schemapositions and versions logged
experiment planmetrics, guardrails, stop rules
rollbackstable ranker alias remains available

Mastery Check

Evaluation rubric

ArtifactStrong submission demonstrates
ranking stackexplicit eligibility, candidate generation, reranking inputs, and deterministic fallback
evaluationNDCG-style offline result with policy guardrails and slice review
experiment planexposure logging, primary metric, guardrails, stopping rule, and rollback

Common Failures

SymptomCauseFix
Great NDCG while prohibited item appearsranking runs before eligibilityfilter first and test invariant
New model increasingly favors former winnersclicks used without exposure evidencelog impressions and experiment arms
Conversion increases while customer value fallsreturns absent from guardrailsgate promotion on downstream outcomes
Next Step
Continue to Capstone: Demand Forecasting

You can now ship a measured ranking surface whose outputs influence future data. Next you will ship time-ordered forecasts and alerts for inventory and warehouse capacity.

PreviousCapstone: Delivery ETA Prediction
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Learning to Rank using Gradient Descent.

Burges, C. J. C., et al. ยท 2005 ยท ICML 2005

Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing.

Kohavi, R., Tang, D., Xu, Y. ยท 2020