LeetLLM
LearnPracticeFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Practice
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnML Algorithms & EvaluationValidation and Leakage
📊MediumEvaluation & Benchmarks

Validation and Leakage

Make model and policy claims honestly: define the decision moment, split return episodes by time and customer, expose feature and preprocessing leakage, and audit LLM evaluation contamination.

15 min read
Learning path
Step 32 of 155 in the full curriculum
Reinforcement Learning BasicsClustering and PCA

The reinforcement-learning lesson ended with a routing policy for damaged returns. On its tiny table, human_review earned more expected reward once abandonment risk was included. That is not enough to ship a policy. A scientific claim needs evidence from claims that did not shape the decision rule.

Validation asks whether a fitted rule works on new examples. Leakage occurs when evaluation lets the rule see information it could not have at decision time, or lets test data influence choices. Both mistakes make weak systems look strong.[1]

Evaluation design showing January through June training, July validation, and a locked August test window, a claim-opening decision gate that allows ambiguity and prior-return features but blocks later photo and refund outcomes, and repeated-customer rows crossing versus respecting split boundaries. Evaluation design showing January through June training, July validation, and a locked August test window, a claim-opening decision gate that allows ambiguity and prior-return features but blocks later photo and refund outcomes, and repeated-customer rows crossing versus respecting split boundaries.
Honest evaluation freezes three boundaries: future months cannot shape the model, post-decision events cannot become features, and entities promised as unseen cannot cross between training and test.

Start at the Decision Moment

Suppose you are fitting a guardrail model for the return router. When a claim opens, it predicts whether the claim must go to human review. The label comes later from an audit, but the prediction must be made before photos, reviewer actions, and refund outcomes arrive.

For claim A10234, the decision moment is 2026-07-01 09:00.

FieldWhen it existsValid model feature at opening?Why
ambiguity_scoreBefore opening decisionYesExtracted from submitted claim text
prior_returns_90dBefore opening decisionYesCustomer history already recorded
photo_received_atAfter evidence requestNoThe router's action can create this event
refund_reversedAfter resolutionNoIt reveals a later outcome
needs_reviewAfter auditLabel onlyIt is what the guardrail predicts

Write this feature contract before training a model. If a field does not exist when the action is chosen, it cannot enter features, prompt context, retrieval results, or preprocessing statistics.

audit-feature-availability.py
1from datetime import datetime 2 3decision_at = datetime.fromisoformat("2026-07-01T09:00") 4fields = [ 5 ("ambiguity_score", "2026-07-01T08:59"), 6 ("prior_returns_90d", "2026-07-01T08:59"), 7 ("photo_received_at", "2026-07-01T10:22"), 8 ("refund_reversed", "2026-07-15T12:00"), 9] 10 11for field, observed_at in fields: 12 known_in_time = datetime.fromisoformat(observed_at) <= decision_at 13 decision = "ALLOW" if known_in_time else "BLOCK" 14 print(f"{field:18} {decision}")
Output
1ambiguity_score ALLOW 2prior_returns_90d ALLOW 3photo_received_at BLOCK 4refund_reversed BLOCK

The two blocked fields may be useful for later outcome analysis. They are not legal inputs for the opening decision.

A timestamp audit is necessary, but it is not sufficient. Rebuild derived features from an as-of snapshot and inspect their source windows too. A bad offline join can stamp a feature before the decision while still aggregating records that arrived later.

Train, Validation, and Test Have Different Jobs

Every dataset split controls a different kind of influence:

SplitAllowed useMust not do
TrainFit coefficients, trees, encoders, scalers, or learned featuresClaim this score estimates deployment quality
ValidationChoose features, thresholds, model families, prompts, or reward settingsReuse it as an untouched final estimate
TestEstimate final performance after choices are frozenInspect errors, revise system, then report same test score as final

Claims arrive over time, so the first honest train, validation, and test split is chronological:

  • January through June claims: training data.
  • July claims: validation data for choosing the review threshold.
  • August claims: locked test data, opened once after choices are done.
make-a-locked-time-split.py
1episodes = [ 2 {"claim": f"{month}-{index}", "month": month} 3 for month in range(1, 9) 4 for index in range(2) 5] 6 7train = [row for row in episodes if row["month"] <= 6] 8validation = [row for row in episodes if row["month"] == 7] 9test = [row for row in episodes if row["month"] == 8] 10 11print("train claims =", len(train), "months = 1..6") 12print("valid claims =", len(validation), "months = 7") 13print("test claims =", len(test), "months = 8 (locked)")
Output
1train claims = 12 months = 1..6 2valid claims = 2 months = 7 3test claims = 2 months = 8 (locked)
Diagram showing Jan-Jun claims fit model and transforms, July claims choose threshold, August claims one locked estimate, and September traffic monitor outcomes. Diagram showing Jan-Jun claims fit model and transforms, July claims choose threshold, August claims one locked estimate, and September traffic monitor outcomes.
Jan-Jun claims fit model and transforms, July claims choose threshold, August claims one locked estimate, and September traffic monitor outcomes.

If you inspect August failures and revise the pipeline, that work may be excellent engineering. It also spends the August test set. You then need a later untouched window for a final estimate.

Cross-Validation Rotates the Validation Job

One validation month can be unusual. Cross-validation (CV) rotates the held-out role through earlier data: fit on some folds, score on one unseen fold, repeat, and average the scores. It is a tool for model development, not permission to open the final test set repeatedly.[1]

Imagine five earlier-history folds give these F1 scores for the review guardrail: 0.67, 0.75, 0.58, 0.83, and 0.71. F1 is useful here because review-required claims may be less frequent than routine ones, and both missed reviews and excessive reviews matter.

summarize-cross-validation-folds.py
1import numpy as np 2 3fold_f1 = np.array([0.67, 0.75, 0.58, 0.83, 0.71]) 4 5print("fold F1 =", fold_f1.tolist()) 6print(f"mean F1 = {fold_f1.mean():.2f}") 7print(f"std F1 = {fold_f1.std(ddof=1):.2f}")
Output
1fold F1 = [0.67, 0.75, 0.58, 0.83, 0.71] 2mean F1 = 0.71 3std F1 = 0.09

The mean of 0.71 summarizes performance across folds. The standard deviation of 0.09 says the result is not equally reliable everywhere. Investigate the weak fold before automating high-cost actions.

Ordinary shuffled CV assumes rows are exchangeable: any row could plausibly have arrived in any fold. That assumption breaks when the future differs from the past or when several rows come from one customer, order, merchant, document, or conversation.

Leak 1: A Future Outcome Looks Like an Amazing Feature

Suppose refund_reversed is set after a claim closes. It almost reveals whether review was required. Adding it to an opening-time guardrail can produce a spectacular metric and an unusable model.

This experiment creates eight months of return claims. It fits on months 1 through 6 and scores months 7 and 8. The only difference between the two feature sets is a post-close audit field that copies the target.

future-field-leakage.py
1import numpy as np 2from sklearn.linear_model import LogisticRegression 3from sklearn.metrics import accuracy_score 4 5rng = np.random.default_rng(9) 6month = np.repeat(np.arange(1, 9), 50) 7ambiguity = rng.normal(0, 1, len(month)) 8policy_support = rng.integers(0, 2, len(month)) 9needs_review = ( 10 ambiguity 11 - 0.9 * policy_support 12 + rng.normal(0, 1.1, len(month)) 13 > 0 14).astype(int) 15 16# This field is recorded after resolution, so it is forbidden at claim opening. 17post_close_audit = needs_review.copy() 18train_mask = month <= 6 19future_mask = month >= 7 20 21decision_time_X = np.column_stack([ambiguity, policy_support]) 22future_field_X = np.column_stack([ambiguity, policy_support, post_close_audit]) 23 24for name, features in [ 25 ("decision-time", decision_time_X), 26 ("future-field", future_field_X), 27]: 28 model = LogisticRegression().fit(features[train_mask], needs_review[train_mask]) 29 predictions = model.predict(features[future_mask]) 30 score = accuracy_score(needs_review[future_mask], predictions) 31 print(f"{name:13} accuracy={score:.2f}")
Output
1decision-time accuracy=0.73 2future-field accuracy=1.00

1.00 is not evidence of a brilliant guardrail. Here it proves that the evaluation admitted its answer key. The 0.73 result asks a real question: how well can opening-time information identify later review decisions?

Leak 2: Repeated Customers Change the Promise

A random row split can be correct when production will repeatedly serve known customers. It is wrong when you claim the model generalizes to new customers while each customer's earlier rows appear in training.

The next dataset is intentionally stark. A customer-specific pattern determines every label. A model with customer_id memorizes customers perfectly when their rows are scattered through folds, then fails on customer IDs it never encountered.

group-leakage-from-repeat-customers.py
1import numpy as np 2import pandas as pd 3from sklearn.linear_model import LogisticRegression 4from sklearn.model_selection import GroupKFold, StratifiedKFold, cross_val_score 5from sklearn.pipeline import make_pipeline 6from sklearn.preprocessing import OneHotEncoder 7 8customers = np.repeat([f"c{i:02}" for i in range(60)], 4) 9needs_review = np.repeat([i % 2 for i in range(60)], 4) 10features = pd.DataFrame({"customer_id": customers}) 11model = make_pipeline( 12 OneHotEncoder(handle_unknown="ignore"), 13 LogisticRegression(max_iter=1000), 14) 15 16row_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=3) 17group_cv = GroupKFold(n_splits=5) 18 19row_score = cross_val_score( 20 model, features, needs_review, cv=row_cv, scoring="accuracy" 21).mean() 22group_score = cross_val_score( 23 model, 24 features, 25 needs_review, 26 groups=customers, 27 cv=group_cv, 28 scoring="accuracy", 29).mean() 30 31print(f"random rows mean accuracy = {row_score:.2f}") 32print(f"new customers mean accuracy = {group_score:.2f}")
Output
1random rows mean accuracy = 1.00 2new customers mean accuracy = 0.50

Neither metric is universally correct. 1.00 answers, "Can I recognize customers already represented in history?" 0.47 answers, "Can customer identity alone help on unseen customers?" Write the intended deployment promise beside the score.

Leak 3: Preprocessing Can Learn From Validation Data

Leakage does not require a suspicious column. A scaler, imputer, vocabulary, principal component analysis (PCA) transform, or feature selector learns parameters during fit. If it sees all rows before a split, the validation data already affected the trained pipeline.

Random-label preprocessing leakage comparison with a full feature matrix selecting columns before row splitting and scoring 0.74, versus train and validation matrices separated before feature selection with a train-fitted mask applied to validation and scoring 0.52. Random-label preprocessing leakage comparison with a full feature matrix selecting columns before row splitting and scoring 0.74, versus train and validation matrices separated before feature selection with a train-fitted mask applied to validation and scoring 0.52.
Fitting selection on the full matrix lets validation labels influence which columns survive, inflating random-label accuracy to `0.74`. Splitting first keeps the validation rows outside every fit and returns the expected near-chance `0.52`.

The experiment below uses scikit-learn's Pipeline with random labels.[2] If a feature selector sees all labels before splitting, it can select chance correlations that happen to fit the test rows. Fitting selection inside the pipeline removes that advantage.

preprocessing-leakage-random-labels.py
1import numpy as np 2from sklearn.feature_selection import SelectKBest 3from sklearn.metrics import accuracy_score 4from sklearn.model_selection import train_test_split 5from sklearn.pipeline import make_pipeline 6from sklearn.tree import DecisionTreeClassifier 7 8rng = np.random.RandomState(42) 9features = rng.standard_normal((200, 2_000)) 10random_labels = rng.choice(2, 200) 11 12selected = SelectKBest(k=25).fit_transform(features, random_labels) 13leaky_train_X, leaky_test_X, leaky_train_y, leaky_test_y = train_test_split( 14 selected, random_labels, random_state=42 15) 16leaky_model = DecisionTreeClassifier(max_depth=4, random_state=1).fit( 17 leaky_train_X, leaky_train_y 18) 19 20train_X, test_X, train_y, test_y = train_test_split( 21 features, random_labels, random_state=42 22) 23safe_model = make_pipeline( 24 SelectKBest(k=25), 25 DecisionTreeClassifier(max_depth=4, random_state=1), 26).fit(train_X, train_y) 27 28print( 29 "fit selector before split accuracy =", 30 f"{accuracy_score(leaky_test_y, leaky_model.predict(leaky_test_X)):.2f}", 31) 32print( 33 "pipeline after split accuracy =", 34 f"{accuracy_score(test_y, safe_model.predict(test_X)):.2f}", 35)
Output
1fit selector before split accuracy = 0.74 2pipeline after split accuracy = 0.52

Because the labels are random, the true attainable accuracy is around chance. The inflated 0.74 score comes entirely from leakage. In a real claim model, you often will not know the true attainable score, which is why pipeline discipline matters.

Time-Ordered Validation Preserves Causality

Return routing faces seasonal traffic, policy changes, and changing customer behavior. A model fitted on August should not be used to predict June in an experiment that claims future performance. TimeSeriesSplit creates expanding history: each validation window comes after its training window.

walk-forward-validation-windows.py
1import numpy as np 2from sklearn.model_selection import TimeSeriesSplit 3 4months = np.repeat(np.arange(1, 9), 3) 5splitter = TimeSeriesSplit(n_splits=3) 6 7for fold, (train_index, validation_index) in enumerate(splitter.split(months), start=1): 8 train_months = months[train_index] 9 validation_months = months[validation_index] 10 print( 11 f"fold {fold}: train <= {train_months.max()} " 12 f"validate = {validation_months.min()}..{validation_months.max()}" 13 )
Output
1fold 1: train <= 2 validate = 3..4 2fold 2: train <= 4 validate = 5..6 3fold 3: train <= 6 validate = 7..8

If one customer can appear over many months and you need a promise about unseen customers, time ordering alone is insufficient. Hold out future months and keep the relevant customer boundary intact inside the evaluation design.

Validation Chooses; Test Reports

A probability model still needs an action threshold. Choosing that threshold from the test set is leakage, because the test labels then influence the deployed decision rule.

Below, July validation data chooses among three review thresholds. Only after selection is frozen does August report one test score.

choose-threshold-before-opening-test.py
1import numpy as np 2from sklearn.metrics import f1_score 3 4thresholds = [0.30, 0.50, 0.70] 5validation_probability = np.array([0.82, 0.62, 0.58, 0.42, 0.31, 0.12]) 6validation_label = np.array([1, 1, 1, 0, 0, 0]) 7 8test_probability = np.array([0.77, 0.49, 0.39, 0.52, 0.66, 0.14]) 9test_label = np.array([1, 1, 0, 0, 1, 0]) 10 11validation_f1 = { 12 threshold: f1_score(validation_label, validation_probability >= threshold) 13 for threshold in thresholds 14} 15chosen = max(validation_f1, key=validation_f1.get) 16 17for threshold, score in validation_f1.items(): 18 print(f"threshold={threshold:.2f} valid_f1={score:.2f}") 19print( 20 f"chosen threshold={chosen:.2f} " 21 f"test_f1={f1_score(test_label, test_probability >= chosen):.2f}" 22)
Output
1threshold=0.30 valid_f1=0.75 2threshold=0.50 valid_f1=1.00 3threshold=0.70 valid_f1=0.50 4chosen threshold=0.50 test_f1=0.67

The test score may be lower than validation. That is not a reason to retune on test; it is information about how uncertain your claimed performance is. Log the threshold, split definition, feature contract, and final metric together.

LLM Evaluation Has the Same Boundaries

Language-model systems add three common ways to leak information:

SystemLeakage pathHonest boundary
Retrieval-augmented generation (RAG)Chunks from one source document appear in both tuning and test corporaSplit documents before chunking
Fine-tuning or promptingEval questions, answers, or worked traces enter training examples or prompt demonstrationsMaintain provenance and exclude eval material
Public benchmark evaluationA benchmark item may be present in model training dataPrefer fresh or contamination-limited test material and report uncertainty

For a return-policy assistant, splitting chunks after document creation is too late if chunks from the same policy page reach both sides. The retriever then appears to generalize while it is retrieving near-copies of documents seen during development.

split-documents-before-chunking.py
1documents = { 2 "policy_returns": [ 3 "damaged items need review", 4 "photo evidence may support refund", 5 ], 6 "policy_shipping": [ 7 "late parcel gets tracking check", 8 "carrier scan precedes claim", 9 ], 10 "policy_sellers": [ 11 "seller appeal needs audit", 12 "audit stores decision reason", 13 ], 14} 15chunks = [(document, chunk) for document, texts in documents.items() for chunk in texts] 16 17bad_train = chunks[::2] 18bad_test = chunks[1::2] 19bad_overlap = sorted( 20 {document for document, _ in bad_train} 21 & {document for document, _ in bad_test} 22) 23 24safe_train_documents = {"policy_returns", "policy_shipping"} 25safe_test_documents = {"policy_sellers"} 26safe_overlap = sorted(safe_train_documents & safe_test_documents) 27 28print("split chunks first shared documents =", bad_overlap) 29print("split documents first shared documents =", safe_overlap)
Output
1split chunks first shared documents = ['policy_returns', 'policy_sellers', 'policy_shipping'] 2split documents first shared documents = []

Public LLM benchmarks have a harder version of this problem: training corpora can include published test material. LiveBench was designed around frequently updated questions from recent information sources with objective scoring to reduce contamination risk, not to make contamination impossible for all future uses.[3]

A cheap local smoke test searches for exact phrase overlap between your own training and evaluation artifacts:

ngram-contamination-smoke-test.py
1training_prompts = [ 2 "route damaged return with required review to human queue", 3 "approve refund after clear photo evidence arrives", 4] 5evaluation_prompts = [ 6 "route damaged return with required review to human queue", 7 "send an ambiguous damaged claim to a reviewer", 8] 9 10def ngrams(text, size=5): 11 words = text.lower().split() 12 return { 13 " ".join(words[start : start + size]) 14 for start in range(len(words) - size + 1) 15 } 16 17training_grams = set().union(*(ngrams(prompt) for prompt in training_prompts)) 18for prompt in evaluation_prompts: 19 overlaps = ngrams(prompt) & training_grams 20 status = "FLAG" if overlaps else "clear" 21 print(f"{status}: {prompt}")
Output
1FLAG: route damaged return with required review to human queue 2clear: send an ambiguous damaged claim to a reviewer

This flags copied wording. It does not prove the second prompt is safe: paraphrases, translated items, and memorized solution patterns can evade exact matching. Treat overlap checks as a reason to investigate, not as a certificate of clean evaluation.

A Necessary Caveat for the RL Policy

These labs evaluate a supervised guardrail: given information available when a claim opens, predict whether human review is required. They do not by themselves estimate the reward of a brand-new sequential policy.

Historical trajectories reveal the outcome of the action that was taken. They usually do not reveal what would have happened if the router had selected a different action for the same customer. A time split prevents future leakage, but it does not fill in those missing counterfactual outcomes. For an RL policy, pair split discipline with a validated simulator, carefully logged exploration or propensities for off-policy evaluation, human review, or a staged online experiment before automation.[4]

Write the Evaluation Contract

Before publishing a score, create a short, reviewable artifact:

return-review-evaluation-contract.txt
1Decision moment: claim opening, before evidence request or reviewer action 2Allowed features: ambiguity_score, policy_support, prior_returns_90d 3Forbidden features: photo_received_at, reviewer_action, refund_reversed 4Train: January-June claims 5Validation: July claims; choose threshold and features here 6Test: August claims; open once after choices are frozen 7Entity promise: new-customer quality, so no customer crosses split boundary 8Preprocessing: every fit step is inside training fold Pipeline 9Feature snapshot: as of claim opening; aggregation windows stop before decision time 10Metrics: review-required F1 plus unsupported-refund count

The metric is only the last line of reasoning. The contract states what the metric means.

Practice Tasks

  1. Add reviewer_action to the feature-audit lab and assign an observation time. Explain why it is label-adjacent leakage.
  2. Change group-leakage-from-repeat-customers.py so prediction is for returning customers rather than new customers. State whether row-wise splitting now answers a relevant product question and what time boundary is still needed.
  3. Replace SelectKBest with StandardScaler or PCA inside a Pipeline. Identify which learned values must come only from training rows.
  4. Extend the document-chunk lab with a paraphrased evaluation chunk. Explain why the n-gram check misses it and what manual or semantic review you would add.
  5. Write an evaluation contract for the RL policy itself. Mark which outcomes are observed and which are counterfactual.

Practice guidance

  1. reviewer_action exists only after routing and review work begin. It belongs in outcome analysis, not opening-time features.
  2. A row-wise split can answer a returning-customer question if prior customer history is legitimately available at decision time. Keep the evaluation chronological so later activity does not leak backward.
  3. StandardScaler must learn means and variances from training rows. PCA must learn its centering values and components from training rows. A Pipeline keeps those fit calls inside the split.
  4. A paraphrase may share no exact five-word sequence with its source. Add document provenance, semantic near-duplicate search, and manual review of flagged pairs.
  5. An RL contract should log action, propensity, reward, and decision-time context. Historical logs observe the chosen action's outcome; alternative-action outcomes remain counterfactual and need stronger evaluation before automation.

Key Concepts

  • decision moment and feature availability
  • training, validation, and locked test roles
  • cross-validation and score variation
  • future-outcome leakage
  • group leakage
  • preprocessing fit boundaries
  • time-ordered evaluation
  • threshold selection
  • RAG corpus leakage
  • LLM benchmark contamination
  • policy counterfactuals

Evaluation rubric

  • Foundational: Identifies fields unavailable at decision time and keeps a locked test set separate from model choices.
  • Intermediate: Selects time, group, and pipeline boundaries that match a stated deployment promise and reproduces the leak demonstrations.
  • Advanced: Designs an evaluation contract for an LLM or sequential policy system, including contamination and counterfactual limitations.

Follow-up questions

Common pitfalls

  • Symptom: An opening-time model becomes nearly perfect after adding outcome fields. Cause: Later events reveal the label. Fix: Maintain and test a decision-time feature allowlist.
  • Symptom: Random-fold performance collapses on new customers or documents. Cause: Repeated entities crossed folds. Fix: Match group boundaries to the deployment claim.
  • Symptom: A null-data experiment beats chance after feature selection. Cause: Preprocessing fit on held-out labels. Fix: Split first and fit transforms inside a Pipeline.
  • Symptom: A team keeps improving after reading test failures but still reports the original test as final. Cause: The test set became development data. Fix: Lock a later untouched window.
  • Symptom: A RAG or LLM benchmark score looks excellent, but source provenance is unknown. Cause: Document overlap or benchmark contamination may be present. Fix: Preserve provenance, split before chunking, scan overlaps, and prefer fresh held-out evaluation material.
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.A return guardrail must score claim A10234 at 2026-07-01 09:00. The final record later contains ambiguity_score, prior_returns_90d, photo_received_at, refund_reversed, and the audit label needs_review. Which feature contract is valid for the opening-time model?
2.Claims span months 1 through 8, and the score is meant to estimate future performance under seasonal traffic and policy drift. Which validation design preserves causal order?
3.Five earlier-history folds for a review guardrail have F1 scores 0.67, 0.75, 0.58, 0.83, and 0.71. The team wants to pick a model before opening the locked August test. What is the right use of this result?
4.A team promises that its review model works for customers not present in development history. The data has four claims per customer, and customer_id is included as a feature. Random row cross-validation gives 1.00 accuracy, while a customer-group split gives 0.50. Which statement is defensible?
5.In a null experiment, labels are random. A team fits SelectKBest on all rows before splitting, then trains a tree and gets 0.74 accuracy on held-out rows. The same selector inside a Pipeline fit after the split gives 0.52. What conclusion follows?
6.July validation labels give F1 scores of 0.75, 1.00, and 0.50 for review thresholds 0.30, 0.50, and 0.70. August is the locked test month. Which procedure preserves a valid final estimate?
7.A return-policy RAG corpus has three source pages, each split into two chunks. The team alternates chunks into tuning and test sets, so every source page has one chunk in each set. To avoid document-level leakage and test whether the system generalizes beyond sources used during tuning, what should the team change?
8.An LLM evaluation set has one prompt with exact five-word overlap with a training prompt and another paraphrase with no shared five-word sequence. How should the contamination check be used?
9.Historical router logs contain each claim's opening context, the action the old router took, and the reward observed after resolution. A new RL policy would choose a different action for some August claims. What limitation remains even with a chronological train/test split?

9 questions remaining.

Next Step
Continue to Clustering and PCA

You now know how to keep labels and evaluation boundaries honest; next you will inspect vector structure when clean labels do not yet exist.

PreviousReinforcement Learning Basics
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

The Elements of Statistical Learning.

Hastie, T., Tibshirani, R., Friedman, J. · 2009

Scikit-learn: Machine Learning in Python.

Pedregosa, F., et al. · 2011 · JMLR

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

White, C., et al. · 2024

Reinforcement Learning: An Introduction

Sutton, R. S. and Barto, A. G. · 2018 · MIT Press