LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

ยฉ 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 151 articles completed

๐Ÿ› ๏ธComputing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
๐Ÿ“ŠMath & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
๐Ÿ“šPreparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
๐ŸงฎML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
๐Ÿ“ฆProduction ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
๐ŸงชCore LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
๐ŸงฐApplied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
๐ŸŽ“Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
๐Ÿง Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
๐ŸงฌAdvanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
๐Ÿค–Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
โšกInference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
๐Ÿ—๏ธSystem Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
Back to Topics
LearnProduction ML SystemsRanking and Recommendation Systems
๐Ÿ“ŠMediumEvaluation & Benchmarks

Ranking and Recommendation Systems

Rank products for a shopper using candidate retrieval, relevance metrics, and feedback-loop safeguards.

7 min read
Learning path
Step 42 of 151 in the full curriculum
Gradient Boosted Trees in ProductionForecasting and Anomaly Detection

A delivery-risk classifier chooses one action for one parcel. A storefront model makes a different prediction: among thousands of products, which few items should appear first for a query or customer?

This is a ranking problem. Retrieval finds eligible candidates quickly; a ranking model orders them using query, product, inventory, price, and behavioral features. Recommendations use the same shape even without a search query: assemble candidates, score them, display a short list, then measure useful interaction rather than raw clicks alone.

Marketplace ranking path showing permitted in-stock candidates entering a learned ranker, top results receiving exposure, and debiased outcomes feeding later evaluation. Marketplace ranking path showing permitted in-stock candidates entering a learned ranker, top results receiving exposure, and debiased outcomes feeding later evaluation.
A ranking system first limits eligible products, then scores order quality while remembering that displayed position influences future labels.

Candidate Generation Comes Before Ranking

Suppose a shopper searches for insulated delivery bag. The catalog contains one million products. A feature-rich model shouldn't score every listing on every request.

StageJobExample guard
eligibility filterremove forbidden or unavailable itemsin stock, permitted region
candidate generationrecover perhaps 200 plausible productstext or embedding retrieval
rankingorder candidates preciselyrelevance, price, fulfillment reliability
business policyenforce final constraintssponsored labels, diversity, safety

A product absent from the candidate set can't be rescued by a perfect ranker. Candidate recall must therefore be measured separately from ranking quality. This resembles the retrieval and reranking pattern later used for document evidence, but the outcome here is a product exposure decision.

A Metric That Values the Top Slots

In a ranked list, placing the best result first matters more than moving it from position 40 to position 39. Discounted Cumulative Gain (DCG) gives high relevance near the top more weight:

DCG@k=โˆ‘i=1k2reliโˆ’1logโก2(i+1)\text{DCG@k} = \sum_{i=1}^{k} \frac{2^{rel_i} - 1}{\log_2(i+1)}DCG@k=i=1โˆ‘kโ€‹log2โ€‹(i+1)2reliโ€‹โˆ’1โ€‹

Here relirel_ireliโ€‹ is the graded relevance at position iii. Normalized DCG (NDCG) divides by the score of the ideal ordering, producing a value between zero and one for a query.

ndcg-for-product-results.py
1from math import log2 2 3def dcg(relevances): 4 return sum((2**rel - 1) / log2(rank + 2) for rank, rel in enumerate(relevances)) 5 6def ndcg(relevances): 7 ideal = sorted(relevances, reverse=True) 8 return dcg(relevances) / dcg(ideal) 9 10baseline = [3, 0, 2, 1] 11candidate = [3, 2, 1, 0] 12print("baseline ndcg:", round(ndcg(baseline), 3)) 13print("candidate ndcg:", round(ndcg(candidate), 3))
Output
1baseline ndcg: 0.951 2candidate ndcg: 1.0

The baseline still puts the most relevant bag first, so its score is high. The candidate improves the remaining top positions. This is why ranking evaluation needs graded judgments or trustworthy interactions, not a single accuracy bit.

Learning-to-rank work such as RankNet trains pair preferences between documents or products so relevant items outrank less relevant ones.[1] In production, pairwise learning is only one modeling choice. Your evidence contract remains candidate-set version, feature snapshot, label policy, NDCG slices, latency, and online experiment plan.

Diagram showing Eligible catalog stock + policy, Candidates fast retrieval, Ranker query + product features, and Top results logged exposure. Diagram showing Eligible catalog stock + policy, Candidates fast retrieval, Ranker query + product features, and Top results logged exposure.
Eligible catalog stock + policy, Candidates fast retrieval, Ranker query + product features, and Top results logged exposure.

Clicks Are Biased Labels

When a model places an item first, it receives more attention and therefore more clicks. Training directly on clicks rewards past ranking position as if it were product relevance. This is a feedback loop.

Log every impression before using interactions:

Logged fieldReason
request and query IDgroup displayed slate
candidate set versionknow what could have ranked
final positionsmeasure exposure
ranker versionattribute outcomes
click, cart, purchase, returndistinguish curiosity from value
inventory and price snapshotreproduce context

For a first portfolio system, evaluate with human relevance labels on a frozen query set, then define an A/B plan for online outcomes. Controlled online experiments are necessary because engagement changes can reflect latency, layout, price, stock, or ranking behavior; experiment design isolates the candidate change.[2]

Protect Users and the Dataset

Recommendations can narrow what customers see. Add constraints for unavailable items, unsafe listings, seller policy, and overconcentration. Monitor slices such as new products, small sellers, languages, and low-inventory categories rather than accepting one marketplace-wide NDCG.

A useful release packet includes offline query judgments, candidate recall, NDCG@10, p95 scoring latency, diversity checks, impression logging schema, and an A/B stopping rule. A list that looks good offline isn't allowed to silently write its own future training labels.

Mastery Check

Evaluation rubric

EvidenceA production-ready answer demonstrates
candidate contractseparates eligibility and retrieval from learned reranking
evaluationuses ranking metrics and product guardrails instead of classification accuracy alone
feedback safetylogs impressions and outcomes and addresses position-driven bias

Common Pitfalls

SymptomCauseFix
NDCG looks strong but desired item never appearscandidates lost recallmeasure retrieval recall separately
Popular products dominate foreverclick-position looplog exposure and use controlled evaluation
Relevance improves while returns risewrong online objectivepair engagement with purchase and return guardrails
Next Step
Continue to Forecasting and Anomaly Detection

Ranking orders choices at one moment. Next you will predict values over future time periods and distinguish recurring demand patterns from operational surprises.

PreviousGradient Boosted Trees in Production
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Learning to Rank using Gradient Descent.

Burges, C. J. C., et al. ยท 2005 ยท ICML 2005

Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing.

Kohavi, R., Tang, D., Xu, Y. ยท 2020