LeetLLM
LearnPracticeFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Practice
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnML Algorithms & EvaluationExperiment Design and A/B Testing
📊MediumEvaluation & Benchmarks

Experiment Design and A/B Testing

Design a trustworthy online experiment for an AI support change: randomize customers, measure useful outcomes, quantify uncertainty, and reject false wins.

20 min read
Learning path
Step 36 of 155 in the full curriculum
Decoding AlgorithmsPyTorch Training Loops

A support assistant can produce a fluent refund answer and still fail customers. Perhaps the answer arrives too slowly. Perhaps it cites the wrong policy. Perhaps customers reopen a ticket five minutes later. A model improvement matters only when a careful measurement says it helps.

In Decoding Algorithms, you logged the retrieval evidence and decoder settings that produced an answer. Now you will hold that generator fixed and test one proposed change: rewriting a customer's query before retrieval. An A/B test, also called a randomized controlled experiment, gives randomly assigned customers either the existing system (control) or the new system (treatment) and compares their outcomes.

The goal isn't a green dashboard tile. It's a defensible decision: ship the rewrite, keep investigating, or roll it back.

Query-rewrite experiment decision surface comparing 38 percent control resolution with 42 percent treatment resolution, a 4-point lift with a 1-to-7-point interval, a 2-point ship threshold, balanced traffic, and latency, grounding, and escalation guardrails inside budget. Query-rewrite experiment decision surface comparing 38 percent control resolution with 42 percent treatment resolution, a 4-point lift with a 1-to-7-point interval, a 2-point ship threshold, balanced traffic, and latency, grounding, and escalation guardrails inside budget.
Only query rewriting changes between arms. Treatment improves observed resolution from 38% to 42%; the planned interval stays above zero, the point estimate clears the +2.0-point ship threshold, traffic is balanced, and every declared guardrail remains inside budget.

Test One Change At A Time

Suppose a shopper asks:

I bought an annual plan 12 days ago. Can I get a refund?

Your existing pipeline retrieves policy passages from the raw question and generates a reply. A candidate treatment first rewrites the request into a retrieval-focused query such as annual plan refund within 30 days, then sends the retrieved evidence through the same prompt, model, and decoder settings.

ComponentControlTreatmentWhy it matters
Query sent to retrieverOriginal customer messageRewritten retrieval queryThis is the one intentional change.
Policy index snapshotrefund-policies-2026-05-01refund-policies-2026-05-01A different index would mix two changes together.
Generator and promptSame pinned versionSame pinned versionGeneration changes can't masquerade as retrieval gains.
Decoder receiptselection=greedy, same output schemaselection=greedy, same output schemaDecoder policy doesn't differ by arm.

Write a hypothesis that can fail:

Query rewriting will increase resolved refund conversations while keeping grounded-answer audit failures, p95 latency, and human escalation inside their budgets.

Here, p95 latency is the response time that 95 percent of enrolled conversations finish at or below. It protects customers in the slower tail, not only the average user.

An experiment brief records the decision before results tempt you to change it:

Brief fieldLocked choice
Enrollment momentFirst submitted refund question
Randomization unitCustomer account ID
Primary metricResolved conversation rate
Resolved definitionCustomer marks solved and opens no human ticket within 10 minutes
Guardrail: evidenceGrounded-answer failure rate must not increase by more than 0.5 percentage points
Guardrail: speedp95 latency must not increase by more than 150 ms
Guardrail: operationsEscalation rate must not increase by more than 0.5 percentage points
MDE for power planning+2.0 percentage points in resolution rate
Launch rule for resolutionApproximate 95% interval lower bound above zero and observed lift of at least +2.0 percentage points
Analysis ruleOne fixed-horizon look after planned enrollment

For this compact teaching gate, guardrails use observed deltas. A production brief should also predeclare how uncertainty is handled for each guardrail; a point estimate barely inside a budget isn't strong evidence that the actual regression is acceptable.

Random assignment supports causal conclusions only when assignment, instrumentation, and stopping rules stay trustworthy. Practical online experiment systems make this pre-launch contract explicit.[1]

Assign A Customer Once

Why assign by customer account rather than message? One shopper might ask a follow-up question after seeing the first answer. If the first message uses treatment and the follow-up uses control, the experiences interfere: the second outcome depends partly on what happened in the first.

A stable hash gives each enrolled customer the same arm on every request. The salt includes an experiment name and version so a later experiment can assign independently.

stable-assignment.py
1import hashlib 2 3EXPERIMENT = "refund-query-rewrite:v1" 4 5def arm_for(customer_id: str) -> str: 6 payload = f"{EXPERIMENT}:{customer_id}".encode() 7 bucket = int.from_bytes(hashlib.sha256(payload).digest()[:8], "big") % 100 8 return "treatment" if bucket < 50 else "control" 9 10customers = ["customer-014", "customer-014", "customer-231", "customer-844"] 11for customer in customers: 12 print(customer, arm_for(customer)) 13 14assert arm_for("customer-014") == arm_for("customer-014") 15print("repeat assignment is stable")
Stable customer assignment
1customer-014 treatment 2customer-014 treatment 3customer-231 control 4customer-844 control 5repeat assignment is stable

If the same customer appears twice in the output, the assigned arm must match twice. In a service, persist the experiment version and arm on each event as well. Hashing is deterministic; logging makes it auditable.

Define An Outcome From Events

The assistant shouldn't get credit merely because it produced an answer. Resolution requires an observed customer outcome. For this chapter, one conversation counts as resolved only if the customer marks it solved and doesn't open a support ticket within ten minutes.

This next lab turns raw product events into that metric. It also records two guardrails: whether an audit says the answer is grounded in retrieved policy and the response latency in milliseconds.

define-resolution.py
1conversations = [ 2 {"arm": "control", "solved": True, "ticket_after_min": None, "grounded": True, "latency_ms": 1080}, 3 {"arm": "control", "solved": True, "ticket_after_min": 4, "grounded": True, "latency_ms": 1200}, 4 {"arm": "control", "solved": False, "ticket_after_min": 2, "grounded": False, "latency_ms": 980}, 5 {"arm": "treatment", "solved": True, "ticket_after_min": None, "grounded": True, "latency_ms": 1120}, 6 {"arm": "treatment", "solved": True, "ticket_after_min": 18, "grounded": True, "latency_ms": 1240}, 7 {"arm": "treatment", "solved": True, "ticket_after_min": 6, "grounded": True, "latency_ms": 1280}, 8] 9 10def resolved(row: dict) -> bool: 11 ticket_too_soon = row["ticket_after_min"] is not None and row["ticket_after_min"] <= 10 12 return row["solved"] and not ticket_too_soon 13 14for arm in ("control", "treatment"): 15 rows = [row for row in conversations if row["arm"] == arm] 16 successes = sum(resolved(row) for row in rows) 17 grounded = sum(row["grounded"] for row in rows) 18 print(f"{arm}: resolved={successes}/{len(rows)} grounded={grounded}/{len(rows)}")
Outcome definition on event rows
1control: resolved=1/3 grounded=2/3 2treatment: resolved=2/3 grounded=3/3

The small rows teach the definition, not the launch result. Notice that the treatment row with ticket_after_min=18 still counts as resolved because its ticket arrives outside the locked ten-minute window. That isn't automatically correct or wrong: it makes the metric contract inspectable. If later tickets matter, choose a longer window or track them separately before launch. A real analysis also needs enough customers to distinguish an actual lift from chance variation.

Diagram showing Eligible customer, Stable assignment, Control raw query, and Treatment rewrite query. Diagram showing Eligible customer, Stable assignment, Control raw query, and Treatment rewrite query.
Eligible customer, Stable assignment, Control raw query, and Treatment rewrite query.

Read Lift From Counts

After the planned window, suppose each arm contains 2,000 enrolled customers.

VariantCustomersResolvedResolution ratep95 latencyEscalatedGrounded audit failures
Control: raw query2,00076038.0%1180 ms31022
Treatment: rewritten query2,00084042.0%1260 ms30623

The absolute lift is the treatment rate minus the control rate:

Δ^=p^treatment−p^control=8402000−7602000=0.04\widehat{\Delta} = \hat{p}_{treatment} - \hat{p}_{control} = \frac{840}{2000} - \frac{760}{2000} = 0.04Δ=p^​treatment​−p^​control​=2000840​−2000760​=0.04

The treatment resolves 4.0 additional conversations per 100 enrolled customers. That is a 4.0 percentage point lift.

The relative lift compares the absolute change with the original baseline:

0.040.38≈0.105\frac{0.04}{0.38} \approx 0.1050.380.04​≈0.105

That is a 10.5 percent relative lift. Always name which one you report; saying "up 10.5 points" would be wrong.

compute-lift.py
1control_resolved, control_n = 760, 2000 2treatment_resolved, treatment_n = 840, 2000 3 4p_control = control_resolved / control_n 5p_treatment = treatment_resolved / treatment_n 6absolute_lift = p_treatment - p_control 7relative_lift = absolute_lift / p_control 8 9print(f"control rate: {p_control:.1%}") 10print(f"treatment rate: {p_treatment:.1%}") 11print(f"absolute lift: {absolute_lift * 100:.1f} percentage points") 12print(f"relative lift: {relative_lift:.1%}")
Absolute and relative lift
1control rate: 38.0% 2treatment rate: 42.0% 3absolute lift: 4.0 percentage points 4relative lift: 10.5%

Put Uncertainty Around The Lift

If you reran the experiment with different customers, the counts wouldn't be identical. A confidence interval is a procedure for describing this sampling uncertainty. For two reasonably large binary-outcome arms, a normal-approximation interval is a useful first calculation.

For each arm, p^\hat{p}p^​ is its observed resolution rate and nnn is its number of enrolled customers. The estimated standard error of the difference is:

SE(Δ^)=p^control(1−p^control)ncontrol+p^treatment(1−p^treatment)ntreatmentSE(\widehat{\Delta}) = \sqrt{ \frac{\hat{p}_{control}(1 - \hat{p}_{control})}{n_{control}} + \frac{\hat{p}_{treatment}(1 - \hat{p}_{treatment})}{n_{treatment}} }SE(Δ)=ncontrol​p^​control​(1−p^​control​)​+ntreatment​p^​treatment​(1−p^​treatment​)​​

A rough 95 percent interval is Δ^±1.96×SE\widehat{\Delta} \pm 1.96 \times SEΔ±1.96×SE. The number 1.96 is the standard-normal cutoff that leaves about 2.5 percent in each tail.

resolution-lift-interval.py
1from math import sqrt 2 3control_resolved, control_n = 760, 2000 4treatment_resolved, treatment_n = 840, 2000 5 6p_control = control_resolved / control_n 7p_treatment = treatment_resolved / treatment_n 8diff = p_treatment - p_control 9se = sqrt( 10 p_control * (1 - p_control) / control_n 11 + p_treatment * (1 - p_treatment) / treatment_n 12) 13low = diff - 1.96 * se 14high = diff + 1.96 * se 15 16print(f"estimated lift: {diff * 100:.1f} pp") 17print(f"standard error: {se * 100:.2f} pp") 18print(f"approximate 95% interval: [{low * 100:.1f}, {high * 100:.1f}] pp")
Approximate lift interval
1estimated lift: 4.0 pp 2standard error: 1.55 pp 3approximate 95% interval: [1.0, 7.0] pp

The estimate is +4.0 points and this approximation gives an interval of about +1.0 to +7.0 points. Because the lower bound is above zero, this planned analysis gives evidence for a positive treatment effect, assuming assignment and instrumentation are sound. It doesn't prove every future rollout will gain four points.

For rare outcomes, heavily clustered customers, many simultaneous comparisons, or business-critical launches, choose the inference method with a statistician or a mature experimentation platform before running the test.

Experiment timeline showing success rules locked before exposure, control and query-rewrite customers enrolled under pinned system versions, one planned analysis horizon, an integrity gate that routes mismatches to investigation, and only passing traffic continuing to the lift interval and launch decision. Experiment timeline showing success rules locked before exposure, control and query-rewrite customers enrolled under pinned system versions, one planned analysis horizon, an integrity gate that routes mismatches to investigation, and only passing traffic continuing to the lift interval and launch decision.
The estimate stays hidden until the planned horizon and integrity audit. Balanced assignment with pinned system versions unlocks the +1.0 to +7.0-point interval; an SRM or version mismatch exits to investigation before anyone reads lift.

Plan For A Meaningful Win

A test with too little traffic may return "uncertain" even when the treatment helps. Before launching, declare a minimum detectable effect (MDE): the smallest true lift the planned test is designed to detect with its target power. Our team chooses +2.0 percentage points because a smaller resolution improvement wouldn't justify maintaining the rewrite service. This brief also uses +2.0 points as its minimum observed lift worth shipping. Keep the two fields conceptually separate: MDE determines planned enrollment, while the launch threshold determines the decision rule.

See what sample size does while the observed rates stay at 38 and 42 percent:

interval-width-by-traffic.py
1from math import sqrt 2 3p_control = 0.38 4p_treatment = 0.42 5lift = p_treatment - p_control 6 7for n in (200, 500, 2000): 8 se = sqrt(p_control * (1 - p_control) / n + p_treatment * (1 - p_treatment) / n) 9 low = lift - 1.96 * se 10 high = lift + 1.96 * se 11 print(f"{n:>4} per arm: [{low * 100:>4.1f}, {high * 100:>4.1f}] pp")
Traffic narrows uncertainty
1200 per arm: [-5.6, 13.6] pp 2 500 per arm: [-2.1, 10.1] pp 32000 per arm: [ 1.0, 7.0] pp

With 200 or 500 customers per arm, a good-looking +4.0 point estimate is still compatible with no improvement. More enrollment narrows uncertainty; it doesn't magically make treatment better.

Power is the probability that a planned test will detect an effect of the chosen size when that effect truly exists. For equal-sized arms and a binary rate near the baseline, this approximation provides a planning estimate:

n≈2p(1−p)(z1−α/2+zpower)2δ2n \approx \frac{2p(1-p)(z_{1-\alpha/2} + z_{power})^2}{\delta^2}n≈δ22p(1−p)(z1−α/2​+zpower​)2​

Here, ppp is baseline resolution rate, δ\deltaδ is the MDE, α\alphaα is the false-positive budget, and power is the desired detection probability.

plan-sample-size.py
1from math import ceil 2from statistics import NormalDist 3 4baseline_rate = 0.38 5mde = 0.02 6alpha = 0.05 7target_power = 0.80 8 9normal = NormalDist() 10z_alpha = normal.inv_cdf(1 - alpha / 2) 11z_power = normal.inv_cdf(target_power) 12n_per_arm = ceil( 13 2 * baseline_rate * (1 - baseline_rate) * (z_alpha + z_power) ** 2 / mde**2 14) 15 16print(f"MDE: {mde * 100:.1f} pp") 17print(f"target power: {target_power:.0%}") 18print(f"planning estimate: {n_per_arm:,} customers per arm")
Planning enrollment from MDE
1MDE: 2.0 pp 2target power: 80% 3planning estimate: 9,246 customers per arm

This formula is for planning, not post-result storytelling. Actual platform planning may account for unequal allocation, repeated customers, variance reduction, multiple metrics, or sequential analysis.

Check Traffic Before Outcomes

A sample ratio mismatch (SRM) means observed assignment counts disagree suspiciously with the intended split. If the design says 50/50 but your data has 2,350 treatment customers and 1,650 control customers, investigate before reading resolution lift. Assignment code, event logging, eligibility filters, or bot removal may differ by arm.

This lab computes a chi-square diagnostic for a planned 50/50 allocation. A one-degree-of-freedom value above 10.83 is a deliberately strict alert threshold corresponding to a very small tail probability, about 0.1 percent.

check-assignment-split.py
1observed = {"control": 1650, "treatment": 2350} 2expected_each = sum(observed.values()) / 2 3 4chi_square = sum( 5 (count - expected_each) ** 2 / expected_each 6 for count in observed.values() 7) 8alert_threshold = 10.83 9 10print("planned split: 50% / 50%") 11print(f"observed: control={observed['control']}, treatment={observed['treatment']}") 12print(f"chi-square statistic: {chi_square:.1f}") 13print("pause analysis for SRM investigation:", chi_square > alert_threshold)
Sample ratio integrity check
1planned split: 50% / 50% 2observed: control=1650, treatment=2350 3chi-square statistic: 122.5 4pause analysis for SRM investigation: True

An SRM alert doesn't identify the bug. It says your comparison hasn't earned trust yet. Stop, diagnose the event path, and restart or reanalyze only when you understand what changed.

Don't Stop The First Time Noise Looks Good

Suppose treatment truly has no effect: both arms resolve 40 percent of conversations. If you run one planned two-sided check at the end with a 5 percent false-positive threshold, false wins should occur about 5 percent of the time. If you inspect the ordinary interval after every batch and stop at the first apparent win, noise receives many chances to cross the threshold.

The following A/A simulation uses two identical arms, 5,000 repeat experiments, 20 looks, and a fixed seed. It demonstrates this exact stopping policy, not a universal percentage for every test design.

simulate-peeking.py
1import numpy as np 2 3rng = np.random.default_rng(11) 4trials = 5_000 5looks = 20 6batch_size = 100 7true_rate = 0.40 8 9planned_wins = 0 10peek_wins = 0 11 12for _ in range(trials): 13 control = rng.binomial(1, true_rate, size=looks * batch_size) 14 treatment = rng.binomial(1, true_rate, size=looks * batch_size) 15 crossed_early = False 16 17 for n in range(batch_size, looks * batch_size + 1, batch_size): 18 p_control = control[:n].mean() 19 p_treatment = treatment[:n].mean() 20 se = (p_control * (1 - p_control) / n + p_treatment * (1 - p_treatment) / n) ** 0.5 21 significant = se > 0 and abs(p_treatment - p_control) / se > 1.96 22 crossed_early = crossed_early or significant 23 if n == looks * batch_size: 24 planned_wins += significant 25 26 peek_wins += crossed_early 27 28print(f"planned final look: {planned_wins / trials:.1%}") 29print(f"stop at first crossing: {peek_wins / trials:.1%}")
Peeking simulation
1planned final look: 4.9% 2stop at first crossing: 25.5%
Seeded simulation of 5,000 zero-effect A slash A tests: one planned look produces a 4.9 percent false-win rate, while checking after each of 20 batches and stopping on the first crossing produces a 25.5 percent false-win rate. Seeded simulation of 5,000 zero-effect A slash A tests: one planned look produces a 4.9 percent false-win rate, while checking after each of 20 batches and stopping on the first crossing produces a 25.5 percent false-win rate.
Both arms have the same true resolution rate. Changing only the stopping policy turns roughly one false win in twenty into about one in four in this reproducible simulation.

Ordinary fixed-horizon intervals aren't valid for a decision triggered by continuous monitoring. Johari and colleagues define always-valid p-values and confidence intervals for sequential decisions, where repeated observation is part of the planned method.[2] For a first experiment, the straightforward rule is enough: choose enrollment in advance and analyze once.

Reduce Noise With Pre-Experiment Data

Power doesn't always require more customers. CUPED (Controlled-experiment Using Pre-Experiment Data) uses a measurement collected before assignment, such as whether a customer resolved a prior refund issue, to remove predictable customer-to-customer variation from the outcome.

Let YYY be the experiment outcome and XXX be a pre-experiment covariate. CUPED forms:

Yadjusted=Y−θ(X−Xˉ),θ=cov⁡(Y,X)var⁡(X)Y_{adjusted} = Y - \theta(X - \bar{X}), \qquad \theta = \frac{\operatorname{cov}(Y, X)}{\operatorname{var}(X)}Yadjusted​=Y−θ(X−Xˉ),θ=var(X)cov(Y,X)​

Since XXX was measured before treatment, it can't have been caused by treatment. When XXX predicts YYY, the adjusted outcome varies less while targeting the same treatment difference. Deng, Xu, Kohavi, and Walker introduced this approach for online experiments and reported substantial variance reductions in Bing experiments.[3]

This simulation gives returning customers different baseline resolution probabilities, then assigns treatment independently. Compare the standard error before and after CUPED.

cuped-for-resolution.py
1import numpy as np 2 3rng = np.random.default_rng(27) 4n = 20_000 5arm = rng.integers(0, 2, size=n) 6resolved_before = rng.binomial(1, 0.45, size=n) 7probability = np.clip(0.18 + 0.46 * resolved_before + 0.04 * arm, 0, 1) 8resolved_now = rng.binomial(1, probability) 9 10theta = np.cov(resolved_now, resolved_before, ddof=1)[0, 1] / np.var(resolved_before, ddof=1) 11adjusted = resolved_now - theta * (resolved_before - resolved_before.mean()) 12 13def lift_and_se(outcome: np.ndarray) -> tuple[float, float]: 14 control = outcome[arm == 0] 15 treatment = outcome[arm == 1] 16 lift = treatment.mean() - control.mean() 17 se = (control.var(ddof=1) / len(control) + treatment.var(ddof=1) / len(treatment)) ** 0.5 18 return lift, se 19 20raw_lift, raw_se = lift_and_se(resolved_now) 21cuped_lift, cuped_se = lift_and_se(adjusted) 22variance_reduction = 1 - adjusted.var(ddof=1) / resolved_now.var(ddof=1) 23 24print(f"raw estimate: lift={raw_lift * 100:.2f} pp, se={raw_se * 100:.2f} pp") 25print(f"CUPED estimate: lift={cuped_lift * 100:.2f} pp, se={cuped_se * 100:.2f} pp") 26print(f"outcome variance reduced: {variance_reduction:.1%}")
CUPED variance reduction
1raw estimate: lift=3.15 pp, se=0.70 pp 2CUPED estimate: lift=3.17 pp, se=0.61 pp 3outcome variance reduced: 21.9%

In a finite sample, the raw and adjusted estimates won't be numerically identical. What matters is that adjustment uses only a pre-treatment signal and reduces uncertainty. Never adjust for a quantity treatment could affect, such as latency measured after the rewrite is enabled; that can bias the comparison.

Treat AI Evaluation As A Stack

An online experiment answers whether customers benefit under live use. It shouldn't be the first time you discover that a treatment emits unsupported refund claims. For an AI feature, use layers:

LayerQuestionExample evidence
Offline regression setDoes treatment still follow policy on known cases?Hand-labeled refund prompts, groundedness and refusal checks
Online primary outcomeDoes it help customers complete their task?Resolution rate
Product guardrailsDid it hurt speed or operations?p95 latency, escalation rate, cost
Integrity receiptDid we compare the promised systems?Assignment split, index snapshot, prompt/model/decoder versions

If you use a large language model (LLM) as a judge for an offline rubric, treat it as a measured evaluator rather than truth: pin its model and prompt version and compare it with human labels on a retained calibration set. Zheng et al. study LLM judges as approximations to human preference and document position, verbosity, self-enhancement, and reasoning limitations.[4]

This lab makes the configuration receipt explicit. A comparison with two different decoder versions is rejected before anybody reads its lift.

validate-treatment-receipt.py
1control = { 2 "index": "refund-policies-2026-05-01", 3 "generator": "support-generator-v7", 4 "decoder": {"selection": "greedy", "schema": "refund-answer-v2"}, 5} 6treatment = { 7 "index": "refund-policies-2026-05-01", 8 "generator": "support-generator-v7", 9 "decoder": {"selection": "greedy", "schema": "refund-answer-v2"}, 10 "new_component": "query-rewrite-v1", 11} 12 13locked_fields = ("index", "generator", "decoder") 14mismatches = [field for field in locked_fields if control[field] != treatment[field]] 15 16print("intentional treatment change:", treatment["new_component"]) 17print("unplanned mismatches:", mismatches) 18print("comparison is interpretable:", not mismatches)
Pinned experiment receipt
1intentional treatment change: query-rewrite-v1 2unplanned mismatches: [] 3comparison is interpretable: True

Make The Launch Decision Auditable

Now combine the evidence. This brief requires a positive interval, an observed lift of at least +2.0 points, guardrails inside their budgets, and trustworthy traffic. This last lab is a small launch gate that prints each condition instead of burying the call in a slide deck.

launch-decision.py
1from math import sqrt 2 3report = { 4 "control": {"n": 2000, "resolved": 760, "p95_latency_ms": 1180, "escalated": 310, "grounding_failures": 22}, 5 "treatment": {"n": 2000, "resolved": 840, "p95_latency_ms": 1260, "escalated": 306, "grounding_failures": 23}, 6 "minimum_ship_lift": 0.02, 7 "srm_alert": False, 8} 9 10c = report["control"] 11t = report["treatment"] 12p_c = c["resolved"] / c["n"] 13p_t = t["resolved"] / t["n"] 14lift = p_t - p_c 15se = sqrt(p_c * (1 - p_c) / c["n"] + p_t * (1 - p_t) / t["n"]) 16low = lift - 1.96 * se 17 18latency_delta = t["p95_latency_ms"] - c["p95_latency_ms"] 19escalation_delta = t["escalated"] / t["n"] - c["escalated"] / c["n"] 20grounding_delta = t["grounding_failures"] / t["n"] - c["grounding_failures"] / c["n"] 21 22checks = { 23 "traffic_integrity": not report["srm_alert"], 24 "positive_lift_interval": low > 0, 25 "observed_lift_threshold": lift >= report["minimum_ship_lift"], 26 "latency_budget": latency_delta <= 150, 27 "escalation_budget": escalation_delta <= 0.005, 28 "grounding_budget": grounding_delta <= 0.005, 29} 30 31for name, passed in checks.items(): 32 print(f"{name}: {'PASS' if passed else 'FAIL'}") 33print( 34 f"lift={lift * 100:.1f} pp, low_end={low * 100:.1f} pp, " 35 f"ship_threshold={report['minimum_ship_lift'] * 100:.1f} pp, latency_delta={latency_delta} ms" 36) 37print("decision:", "SHIP" if all(checks.values()) else "ROLL BACK OR INVESTIGATE")
Launch decision report
1traffic_integrity: PASS 2positive_lift_interval: PASS 3observed_lift_threshold: PASS 4latency_budget: PASS 5escalation_budget: PASS 6grounding_budget: PASS 7lift=4.0 pp, low_end=1.0 pp, ship_threshold=2.0 pp, latency_delta=80 ms 8decision: SHIP

This gate separates evidence from business value. positive_lift_interval asks whether the interval's lower bound is above zero. observed_lift_threshold asks whether the point estimate clears the predeclared +2.0 point bar. It doesn't prove the true lift exceeds +2.0 points: the interval begins at +1.0. A more conservative brief could require the lower bound to clear +2.0 points and would keep this result under investigation.

For these planned numbers and this predeclared rule, the launch gate says SHIP. Change treatment latency to 1490 and rerun: resolution still improves, but the speed budget fails. A treatment doesn't earn launch by winning only its favorite metric.

Practice: Break The Launch Gate

  1. Change treatment latency from 1260 to 1490. Which check fails, and why doesn't the resolution lift override it?
  2. Restore latency, then change minimum_ship_lift from 0.02 to 0.05. Which check fails even though the interval still excludes zero?
  3. Restore the threshold, then set srm_alert to True. Why should analysis pause before anybody argues about the observed lift?

Expected observations

  1. latency_budget fails because treatment is now 310 ms slower than control, beyond the locked 150 ms budget.
  2. observed_lift_threshold fails because the observed +4.0 point lift is below the new +5.0 point business threshold. Statistical evidence and shipping value answer different questions.
  3. traffic_integrity fails. SRM can signal broken assignment, filtering, or logging, so the comparison hasn't earned a launch decision.

Common Pitfalls

SymptomLikely causeFix before making a claim
Customer receives both arms during one conversation thread.Randomized by request instead of stable customer identity.Assign on first eligible event and persist the arm.
Treatment count is far from planned split.Assignment, filtering, or event delivery differs by arm.Pause analysis and diagnose SRM.
Dashboard turns green only after daily checking.Fixed-horizon statistics were used with optional stopping.Analyze once at planned horizon or use a planned sequential method.
Resolution rises while policy mistakes rise too.Primary product outcome ignored groundedness risk.Make audited grounding a launch guardrail.
An automated judge improves after its prompt changed mid-run.Measurement ruler changed during experiment.Pin evaluator version and retain human calibration cases.
CUPED result changes direction after using live latency as a covariate.Adjustment used a post-treatment variable.Use covariates measured before assignment only.

Explain The Decision Without Looking Back

Without looking back, explain why this experiment changes only the query rewrite while holding index, generator, prompt, and decoder fixed. Then answer these checks.

Evaluation rubric

  • Foundational: Defines control, treatment, enrolled customer, primary outcome, and guardrails before viewing results.
  • Intermediate: Computes lift and an uncertainty interval, then identifies SRM and optional stopping as reasons to withhold a launch call.
  • Advanced: Produces an auditable AI launch report with pinned system receipts, justified MDE, grounding checks, latency budgets, and a valid analysis plan.

Reuse This Experiment Pattern

  • A stable customer assignment function with a versioned experiment key.
  • A metric definition based on observable resolution events rather than answer fluency.
  • A planned analysis that reports absolute lift, uncertainty, sample size assumptions, and SRM status.
  • A launch report that keeps groundedness and latency beside the primary outcome.

Key Terms

  • Control: Existing system used as comparison baseline.
  • Treatment: Candidate system containing one intended change.
  • Randomization unit: Entity kept in one arm, such as a customer account.
  • Guardrail: Metric that must remain acceptable even when primary metric improves.
  • MDE: Minimum detectable effect worth designing the experiment to detect.
  • SRM: Sample ratio mismatch, an assignment-integrity alarm.
  • CUPED: Pre-experiment covariate adjustment that can reduce outcome variance.
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.The intended treatment is query rewriting before retrieval. The receipt shows the same policy index and generator in both arms, but control uses decoder schema refund-answer-v2 and treatment uses refund-answer-v3. What should the team conclude before reading lift?
2.A customer sends an eligible refund question and a follow-up five minutes later. Which assignment implementation prevents the customer from receiving both experiment arms?
3.Resolution requires a solved mark and no human ticket within 10 minutes. Four conversations have these results: A is solved with no ticket, B is solved with a ticket after 18 minutes, C is solved with a ticket after 6 minutes, and D is not solved with no ticket. What is the resolution rate?
4.Treatment resolves 840 of 2,000 customers and control resolves 760 of 2,000. What is the clearest user-facing interpretation of absolute lift?
5.At the planned final look, an experiment reports an absolute lift of +4.0 percentage points with a standard error of 1.55 points. Using lift plus or minus 1.96 times the standard error, which interpretation is correct?
6.A team keeps baseline rate, alpha, and target power the same but lowers the MDE for planning from 2.0 percentage points to 1.0 percentage point. What changes in the plan?
7.A test planned a 50/50 split but records 1,650 control customers and 2,350 treatment customers. Its chi-square statistic is 122.5, far above the predeclared SRM alert threshold of 10.83. What should the team do next?
8.A team planned one fixed-horizon analysis, but after launch it checks an ordinary 95% interval after every 100-customer batch and stops the first time the interval excludes zero. What is the problem?
9.For CUPED on the query-rewrite experiment, which covariate is valid to use for variance reduction?
10.A predeclared gate requires no SRM alert, a 95% lift interval lower bound above zero, observed lift of at least +2.0 points, and a p95 latency increase no greater than 150 ms. Traffic is trustworthy, lift is +4.0 points with a +1.0 point lower bound, but p95 latency rises from 1180 ms to 1490 ms. Other guardrails pass. What is the decision?

10 questions remaining.

Next Step
Continue to PyTorch Training Loops

You can now design evidence that a model-system change helped customers. Next you will build the optimization loop that creates learned model changes in the first place.

PreviousDecoding Algorithms
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing.

Kohavi, R., Tang, D., Xu, Y. · 2020

Peeking at A/B Tests: Why it matters, and what to do about it.

Johari, R., et al. · 2017 · KDD 2017

Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data (CUPED)

Deng, A., Xu, Y., Kohavi, R., & Walker, T. · 2013

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.

Zheng, L., et al. · 2023 · NeurIPS 2023