Make model and policy claims honestly: define the decision moment, split return episodes by time and customer, expose feature and preprocessing leakage, and audit LLM evaluation contamination.
The reinforcement-learning lesson ended with a routing policy for damaged returns. On its tiny table, human_review earned more expected reward once abandonment risk was included. That is not enough to ship a policy. A scientific claim needs evidence from claims that did not shape the decision rule.
Validation asks whether a fitted rule works on new examples. Leakage occurs when evaluation lets the rule see information it could not have at decision time, or lets test data influence choices. Both mistakes make weak systems look strong.[1]
Suppose you are fitting a guardrail model for the return router. When a claim opens, it predicts whether the claim must go to human review. The label comes later from an audit, but the prediction must be made before photos, reviewer actions, and refund outcomes arrive.
For claim A10234, the decision moment is 2026-07-01 09:00.
| Field | When it exists | Valid model feature at opening? | Why |
|---|---|---|---|
ambiguity_score | Before opening decision | Yes | Extracted from submitted claim text |
prior_returns_90d | Before opening decision | Yes | Customer history already recorded |
photo_received_at | After evidence request | No | The router's action can create this event |
refund_reversed | After resolution | No | It reveals a later outcome |
needs_review | After audit | Label only | It is what the guardrail predicts |
Write this feature contract before training a model. If a field does not exist when the action is chosen, it cannot enter features, prompt context, retrieval results, or preprocessing statistics.
1from datetime import datetime
2
3decision_at = datetime.fromisoformat("2026-07-01T09:00")
4fields = [
5 ("ambiguity_score", "2026-07-01T08:59"),
6 ("prior_returns_90d", "2026-07-01T08:59"),
7 ("photo_received_at", "2026-07-01T10:22"),
8 ("refund_reversed", "2026-07-15T12:00"),
9]
10
11for field, observed_at in fields:
12 known_in_time = datetime.fromisoformat(observed_at) <= decision_at
13 decision = "ALLOW" if known_in_time else "BLOCK"
14 print(f"{field:18} {decision}")1ambiguity_score ALLOW
2prior_returns_90d ALLOW
3photo_received_at BLOCK
4refund_reversed BLOCKThe two blocked fields may be useful for later outcome analysis. They are not legal inputs for the opening decision.
A timestamp audit is necessary, but it is not sufficient. Rebuild derived features from an as-of snapshot and inspect their source windows too. A bad offline join can stamp a feature before the decision while still aggregating records that arrived later.
Every dataset split controls a different kind of influence:
| Split | Allowed use | Must not do |
|---|---|---|
| Train | Fit coefficients, trees, encoders, scalers, or learned features | Claim this score estimates deployment quality |
| Validation | Choose features, thresholds, model families, prompts, or reward settings | Reuse it as an untouched final estimate |
| Test | Estimate final performance after choices are frozen | Inspect errors, revise system, then report same test score as final |
Claims arrive over time, so the first honest train, validation, and test split is chronological:
1episodes = [
2 {"claim": f"{month}-{index}", "month": month}
3 for month in range(1, 9)
4 for index in range(2)
5]
6
7train = [row for row in episodes if row["month"] <= 6]
8validation = [row for row in episodes if row["month"] == 7]
9test = [row for row in episodes if row["month"] == 8]
10
11print("train claims =", len(train), "months = 1..6")
12print("valid claims =", len(validation), "months = 7")
13print("test claims =", len(test), "months = 8 (locked)")1train claims = 12 months = 1..6
2valid claims = 2 months = 7
3test claims = 2 months = 8 (locked)
If you inspect August failures and revise the pipeline, that work may be excellent engineering. It also spends the August test set. You then need a later untouched window for a final estimate.
One validation month can be unusual. Cross-validation (CV) rotates the held-out role through earlier data: fit on some folds, score on one unseen fold, repeat, and average the scores. It is a tool for model development, not permission to open the final test set repeatedly.[1]
Imagine five earlier-history folds give these F1 scores for the review guardrail: 0.67, 0.75, 0.58, 0.83, and 0.71. F1 is useful here because review-required claims may be less frequent than routine ones, and both missed reviews and excessive reviews matter.
1import numpy as np
2
3fold_f1 = np.array([0.67, 0.75, 0.58, 0.83, 0.71])
4
5print("fold F1 =", fold_f1.tolist())
6print(f"mean F1 = {fold_f1.mean():.2f}")
7print(f"std F1 = {fold_f1.std(ddof=1):.2f}")1fold F1 = [0.67, 0.75, 0.58, 0.83, 0.71]
2mean F1 = 0.71
3std F1 = 0.09The mean of 0.71 summarizes performance across folds. The standard deviation of 0.09 says the result is not equally reliable everywhere. Investigate the weak fold before automating high-cost actions.
Ordinary shuffled CV assumes rows are exchangeable: any row could plausibly have arrived in any fold. That assumption breaks when the future differs from the past or when several rows come from one customer, order, merchant, document, or conversation.
Suppose refund_reversed is set after a claim closes. It almost reveals whether review was required. Adding it to an opening-time guardrail can produce a spectacular metric and an unusable model.
This experiment creates eight months of return claims. It fits on months 1 through 6 and scores months 7 and 8. The only difference between the two feature sets is a post-close audit field that copies the target.
1import numpy as np
2from sklearn.linear_model import LogisticRegression
3from sklearn.metrics import accuracy_score
4
5rng = np.random.default_rng(9)
6month = np.repeat(np.arange(1, 9), 50)
7ambiguity = rng.normal(0, 1, len(month))
8policy_support = rng.integers(0, 2, len(month))
9needs_review = (
10 ambiguity
11 - 0.9 * policy_support
12 + rng.normal(0, 1.1, len(month))
13 > 0
14).astype(int)
15
16# This field is recorded after resolution, so it is forbidden at claim opening.
17post_close_audit = needs_review.copy()
18train_mask = month <= 6
19future_mask = month >= 7
20
21decision_time_X = np.column_stack([ambiguity, policy_support])
22future_field_X = np.column_stack([ambiguity, policy_support, post_close_audit])
23
24for name, features in [
25 ("decision-time", decision_time_X),
26 ("future-field", future_field_X),
27]:
28 model = LogisticRegression().fit(features[train_mask], needs_review[train_mask])
29 predictions = model.predict(features[future_mask])
30 score = accuracy_score(needs_review[future_mask], predictions)
31 print(f"{name:13} accuracy={score:.2f}")1decision-time accuracy=0.73
2future-field accuracy=1.001.00 is not evidence of a brilliant guardrail. Here it proves that the evaluation admitted its answer key. The 0.73 result asks a real question: how well can opening-time information identify later review decisions?
A random row split can be correct when production will repeatedly serve known customers. It is wrong when you claim the model generalizes to new customers while each customer's earlier rows appear in training.
The next dataset is intentionally stark. A customer-specific pattern determines every label. A model with customer_id memorizes customers perfectly when their rows are scattered through folds, then fails on customer IDs it never encountered.
1import numpy as np
2import pandas as pd
3from sklearn.linear_model import LogisticRegression
4from sklearn.model_selection import GroupKFold, StratifiedKFold, cross_val_score
5from sklearn.pipeline import make_pipeline
6from sklearn.preprocessing import OneHotEncoder
7
8customers = np.repeat([f"c{i:02}" for i in range(60)], 4)
9needs_review = np.repeat([i % 2 for i in range(60)], 4)
10features = pd.DataFrame({"customer_id": customers})
11model = make_pipeline(
12 OneHotEncoder(handle_unknown="ignore"),
13 LogisticRegression(max_iter=1000),
14)
15
16row_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=3)
17group_cv = GroupKFold(n_splits=5)
18
19row_score = cross_val_score(
20 model, features, needs_review, cv=row_cv, scoring="accuracy"
21).mean()
22group_score = cross_val_score(
23 model,
24 features,
25 needs_review,
26 groups=customers,
27 cv=group_cv,
28 scoring="accuracy",
29).mean()
30
31print(f"random rows mean accuracy = {row_score:.2f}")
32print(f"new customers mean accuracy = {group_score:.2f}")1random rows mean accuracy = 1.00
2new customers mean accuracy = 0.50Neither metric is universally correct. 1.00 answers, "Can I recognize customers already represented in history?" 0.47 answers, "Can customer identity alone help on unseen customers?" Write the intended deployment promise beside the score.
Leakage does not require a suspicious column. A scaler, imputer, vocabulary, principal component analysis (PCA) transform, or feature selector learns parameters during fit. If it sees all rows before a split, the validation data already affected the trained pipeline.
The experiment below uses scikit-learn's Pipeline with random labels.[2] If a feature selector sees all labels before splitting, it can select chance correlations that happen to fit the test rows. Fitting selection inside the pipeline removes that advantage.
1import numpy as np
2from sklearn.feature_selection import SelectKBest
3from sklearn.metrics import accuracy_score
4from sklearn.model_selection import train_test_split
5from sklearn.pipeline import make_pipeline
6from sklearn.tree import DecisionTreeClassifier
7
8rng = np.random.RandomState(42)
9features = rng.standard_normal((200, 2_000))
10random_labels = rng.choice(2, 200)
11
12selected = SelectKBest(k=25).fit_transform(features, random_labels)
13leaky_train_X, leaky_test_X, leaky_train_y, leaky_test_y = train_test_split(
14 selected, random_labels, random_state=42
15)
16leaky_model = DecisionTreeClassifier(max_depth=4, random_state=1).fit(
17 leaky_train_X, leaky_train_y
18)
19
20train_X, test_X, train_y, test_y = train_test_split(
21 features, random_labels, random_state=42
22)
23safe_model = make_pipeline(
24 SelectKBest(k=25),
25 DecisionTreeClassifier(max_depth=4, random_state=1),
26).fit(train_X, train_y)
27
28print(
29 "fit selector before split accuracy =",
30 f"{accuracy_score(leaky_test_y, leaky_model.predict(leaky_test_X)):.2f}",
31)
32print(
33 "pipeline after split accuracy =",
34 f"{accuracy_score(test_y, safe_model.predict(test_X)):.2f}",
35)1fit selector before split accuracy = 0.74
2pipeline after split accuracy = 0.52Because the labels are random, the true attainable accuracy is around chance. The inflated 0.74 score comes entirely from leakage. In a real claim model, you often will not know the true attainable score, which is why pipeline discipline matters.
Return routing faces seasonal traffic, policy changes, and changing customer behavior. A model fitted on August should not be used to predict June in an experiment that claims future performance. TimeSeriesSplit creates expanding history: each validation window comes after its training window.
1import numpy as np
2from sklearn.model_selection import TimeSeriesSplit
3
4months = np.repeat(np.arange(1, 9), 3)
5splitter = TimeSeriesSplit(n_splits=3)
6
7for fold, (train_index, validation_index) in enumerate(splitter.split(months), start=1):
8 train_months = months[train_index]
9 validation_months = months[validation_index]
10 print(
11 f"fold {fold}: train <= {train_months.max()} "
12 f"validate = {validation_months.min()}..{validation_months.max()}"
13 )1fold 1: train <= 2 validate = 3..4
2fold 2: train <= 4 validate = 5..6
3fold 3: train <= 6 validate = 7..8If one customer can appear over many months and you need a promise about unseen customers, time ordering alone is insufficient. Hold out future months and keep the relevant customer boundary intact inside the evaluation design.
A probability model still needs an action threshold. Choosing that threshold from the test set is leakage, because the test labels then influence the deployed decision rule.
Below, July validation data chooses among three review thresholds. Only after selection is frozen does August report one test score.
1import numpy as np
2from sklearn.metrics import f1_score
3
4thresholds = [0.30, 0.50, 0.70]
5validation_probability = np.array([0.82, 0.62, 0.58, 0.42, 0.31, 0.12])
6validation_label = np.array([1, 1, 1, 0, 0, 0])
7
8test_probability = np.array([0.77, 0.49, 0.39, 0.52, 0.66, 0.14])
9test_label = np.array([1, 1, 0, 0, 1, 0])
10
11validation_f1 = {
12 threshold: f1_score(validation_label, validation_probability >= threshold)
13 for threshold in thresholds
14}
15chosen = max(validation_f1, key=validation_f1.get)
16
17for threshold, score in validation_f1.items():
18 print(f"threshold={threshold:.2f} valid_f1={score:.2f}")
19print(
20 f"chosen threshold={chosen:.2f} "
21 f"test_f1={f1_score(test_label, test_probability >= chosen):.2f}"
22)1threshold=0.30 valid_f1=0.75
2threshold=0.50 valid_f1=1.00
3threshold=0.70 valid_f1=0.50
4chosen threshold=0.50 test_f1=0.67The test score may be lower than validation. That is not a reason to retune on test; it is information about how uncertain your claimed performance is. Log the threshold, split definition, feature contract, and final metric together.
Language-model systems add three common ways to leak information:
| System | Leakage path | Honest boundary |
|---|---|---|
| Retrieval-augmented generation (RAG) | Chunks from one source document appear in both tuning and test corpora | Split documents before chunking |
| Fine-tuning or prompting | Eval questions, answers, or worked traces enter training examples or prompt demonstrations | Maintain provenance and exclude eval material |
| Public benchmark evaluation | A benchmark item may be present in model training data | Prefer fresh or contamination-limited test material and report uncertainty |
For a return-policy assistant, splitting chunks after document creation is too late if chunks from the same policy page reach both sides. The retriever then appears to generalize while it is retrieving near-copies of documents seen during development.
1documents = {
2 "policy_returns": [
3 "damaged items need review",
4 "photo evidence may support refund",
5 ],
6 "policy_shipping": [
7 "late parcel gets tracking check",
8 "carrier scan precedes claim",
9 ],
10 "policy_sellers": [
11 "seller appeal needs audit",
12 "audit stores decision reason",
13 ],
14}
15chunks = [(document, chunk) for document, texts in documents.items() for chunk in texts]
16
17bad_train = chunks[::2]
18bad_test = chunks[1::2]
19bad_overlap = sorted(
20 {document for document, _ in bad_train}
21 & {document for document, _ in bad_test}
22)
23
24safe_train_documents = {"policy_returns", "policy_shipping"}
25safe_test_documents = {"policy_sellers"}
26safe_overlap = sorted(safe_train_documents & safe_test_documents)
27
28print("split chunks first shared documents =", bad_overlap)
29print("split documents first shared documents =", safe_overlap)1split chunks first shared documents = ['policy_returns', 'policy_sellers', 'policy_shipping']
2split documents first shared documents = []Public LLM benchmarks have a harder version of this problem: training corpora can include published test material. LiveBench was designed around frequently updated questions from recent information sources with objective scoring to reduce contamination risk, not to make contamination impossible for all future uses.[3]
A cheap local smoke test searches for exact phrase overlap between your own training and evaluation artifacts:
1training_prompts = [
2 "route damaged return with required review to human queue",
3 "approve refund after clear photo evidence arrives",
4]
5evaluation_prompts = [
6 "route damaged return with required review to human queue",
7 "send an ambiguous damaged claim to a reviewer",
8]
9
10def ngrams(text, size=5):
11 words = text.lower().split()
12 return {
13 " ".join(words[start : start + size])
14 for start in range(len(words) - size + 1)
15 }
16
17training_grams = set().union(*(ngrams(prompt) for prompt in training_prompts))
18for prompt in evaluation_prompts:
19 overlaps = ngrams(prompt) & training_grams
20 status = "FLAG" if overlaps else "clear"
21 print(f"{status}: {prompt}")1FLAG: route damaged return with required review to human queue
2clear: send an ambiguous damaged claim to a reviewerThis flags copied wording. It does not prove the second prompt is safe: paraphrases, translated items, and memorized solution patterns can evade exact matching. Treat overlap checks as a reason to investigate, not as a certificate of clean evaluation.
These labs evaluate a supervised guardrail: given information available when a claim opens, predict whether human review is required. They do not by themselves estimate the reward of a brand-new sequential policy.
Historical trajectories reveal the outcome of the action that was taken. They usually do not reveal what would have happened if the router had selected a different action for the same customer. A time split prevents future leakage, but it does not fill in those missing counterfactual outcomes. For an RL policy, pair split discipline with a validated simulator, carefully logged exploration or propensities for off-policy evaluation, human review, or a staged online experiment before automation.[4]
Before publishing a score, create a short, reviewable artifact:
1Decision moment: claim opening, before evidence request or reviewer action
2Allowed features: ambiguity_score, policy_support, prior_returns_90d
3Forbidden features: photo_received_at, reviewer_action, refund_reversed
4Train: January-June claims
5Validation: July claims; choose threshold and features here
6Test: August claims; open once after choices are frozen
7Entity promise: new-customer quality, so no customer crosses split boundary
8Preprocessing: every fit step is inside training fold Pipeline
9Feature snapshot: as of claim opening; aggregation windows stop before decision time
10Metrics: review-required F1 plus unsupported-refund countThe metric is only the last line of reasoning. The contract states what the metric means.
reviewer_action to the feature-audit lab and assign an observation time. Explain why it is label-adjacent leakage.group-leakage-from-repeat-customers.py so prediction is for returning customers rather than new customers. State whether row-wise splitting now answers a relevant product question and what time boundary is still needed.SelectKBest with StandardScaler or PCA inside a Pipeline. Identify which learned values must come only from training rows.reviewer_action exists only after routing and review work begin. It belongs in outcome analysis, not opening-time features.StandardScaler must learn means and variances from training rows. PCA must learn its centering values and components from training rows. A Pipeline keeps those fit calls inside the split.Pipeline.Answer every question, then check your score. Score above 75% to mark this lesson complete.
9 questions remaining.
The Elements of Statistical Learning.
Hastie, T., Tibshirani, R., Friedman, J. · 2009
Scikit-learn: Machine Learning in Python.
Pedregosa, F., et al. · 2011 · JMLR
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
White, C., et al. · 2024
Reinforcement Learning: An Introduction
Sutton, R. S. and Barto, A. G. · 2018 · MIT Press