Design a trustworthy online experiment for an AI support change: randomize customers, measure useful outcomes, quantify uncertainty, and reject false wins.
A support assistant can produce a fluent refund answer and still fail customers. Perhaps the answer arrives too slowly. Perhaps it cites the wrong policy. Perhaps customers reopen a ticket five minutes later. A model improvement matters only when a careful measurement says it helps.
In Decoding Algorithms, you logged the retrieval evidence and decoder settings that produced an answer. Now you will hold that generator fixed and test one proposed change: rewriting a customer's query before retrieval. An A/B test, also called a randomized controlled experiment, gives randomly assigned customers either the existing system (control) or the new system (treatment) and compares their outcomes.
The goal isn't a green dashboard tile. It's a defensible decision: ship the rewrite, keep investigating, or roll it back.
38% to 42%; the planned interval stays above zero, the point estimate clears the +2.0-point ship threshold, traffic is balanced, and every declared guardrail remains inside budget.Suppose a shopper asks:
I bought an annual plan 12 days ago. Can I get a refund?
Your existing pipeline retrieves policy passages from the raw question and generates a reply. A candidate treatment first rewrites the request into a retrieval-focused query such as annual plan refund within 30 days, then sends the retrieved evidence through the same prompt, model, and decoder settings.
| Component | Control | Treatment | Why it matters |
|---|---|---|---|
| Query sent to retriever | Original customer message | Rewritten retrieval query | This is the one intentional change. |
| Policy index snapshot | refund-policies-2026-05-01 | refund-policies-2026-05-01 | A different index would mix two changes together. |
| Generator and prompt | Same pinned version | Same pinned version | Generation changes can't masquerade as retrieval gains. |
| Decoder receipt | selection=greedy, same output schema | selection=greedy, same output schema | Decoder policy doesn't differ by arm. |
Write a hypothesis that can fail:
Query rewriting will increase resolved refund conversations while keeping grounded-answer audit failures, p95 latency, and human escalation inside their budgets.
Here, p95 latency is the response time that 95 percent of enrolled conversations finish at or below. It protects customers in the slower tail, not only the average user.
An experiment brief records the decision before results tempt you to change it:
| Brief field | Locked choice |
|---|---|
| Enrollment moment | First submitted refund question |
| Randomization unit | Customer account ID |
| Primary metric | Resolved conversation rate |
| Resolved definition | Customer marks solved and opens no human ticket within 10 minutes |
| Guardrail: evidence | Grounded-answer failure rate must not increase by more than 0.5 percentage points |
| Guardrail: speed | p95 latency must not increase by more than 150 ms |
| Guardrail: operations | Escalation rate must not increase by more than 0.5 percentage points |
| MDE for power planning | +2.0 percentage points in resolution rate |
| Launch rule for resolution | Approximate 95% interval lower bound above zero and observed lift of at least +2.0 percentage points |
| Analysis rule | One fixed-horizon look after planned enrollment |
For this compact teaching gate, guardrails use observed deltas. A production brief should also predeclare how uncertainty is handled for each guardrail; a point estimate barely inside a budget isn't strong evidence that the actual regression is acceptable.
Random assignment supports causal conclusions only when assignment, instrumentation, and stopping rules stay trustworthy. Practical online experiment systems make this pre-launch contract explicit.[1]
Why assign by customer account rather than message? One shopper might ask a follow-up question after seeing the first answer. If the first message uses treatment and the follow-up uses control, the experiences interfere: the second outcome depends partly on what happened in the first.
A stable hash gives each enrolled customer the same arm on every request. The salt includes an experiment name and version so a later experiment can assign independently.
1import hashlib
2
3EXPERIMENT = "refund-query-rewrite:v1"
4
5def arm_for(customer_id: str) -> str:
6 payload = f"{EXPERIMENT}:{customer_id}".encode()
7 bucket = int.from_bytes(hashlib.sha256(payload).digest()[:8], "big") % 100
8 return "treatment" if bucket < 50 else "control"
9
10customers = ["customer-014", "customer-014", "customer-231", "customer-844"]
11for customer in customers:
12 print(customer, arm_for(customer))
13
14assert arm_for("customer-014") == arm_for("customer-014")
15print("repeat assignment is stable")1customer-014 treatment
2customer-014 treatment
3customer-231 control
4customer-844 control
5repeat assignment is stableIf the same customer appears twice in the output, the assigned arm must match twice. In a service, persist the experiment version and arm on each event as well. Hashing is deterministic; logging makes it auditable.
The assistant shouldn't get credit merely because it produced an answer. Resolution requires an observed customer outcome. For this chapter, one conversation counts as resolved only if the customer marks it solved and doesn't open a support ticket within ten minutes.
This next lab turns raw product events into that metric. It also records two guardrails: whether an audit says the answer is grounded in retrieved policy and the response latency in milliseconds.
1conversations = [
2 {"arm": "control", "solved": True, "ticket_after_min": None, "grounded": True, "latency_ms": 1080},
3 {"arm": "control", "solved": True, "ticket_after_min": 4, "grounded": True, "latency_ms": 1200},
4 {"arm": "control", "solved": False, "ticket_after_min": 2, "grounded": False, "latency_ms": 980},
5 {"arm": "treatment", "solved": True, "ticket_after_min": None, "grounded": True, "latency_ms": 1120},
6 {"arm": "treatment", "solved": True, "ticket_after_min": 18, "grounded": True, "latency_ms": 1240},
7 {"arm": "treatment", "solved": True, "ticket_after_min": 6, "grounded": True, "latency_ms": 1280},
8]
9
10def resolved(row: dict) -> bool:
11 ticket_too_soon = row["ticket_after_min"] is not None and row["ticket_after_min"] <= 10
12 return row["solved"] and not ticket_too_soon
13
14for arm in ("control", "treatment"):
15 rows = [row for row in conversations if row["arm"] == arm]
16 successes = sum(resolved(row) for row in rows)
17 grounded = sum(row["grounded"] for row in rows)
18 print(f"{arm}: resolved={successes}/{len(rows)} grounded={grounded}/{len(rows)}")1control: resolved=1/3 grounded=2/3
2treatment: resolved=2/3 grounded=3/3The small rows teach the definition, not the launch result. Notice that the treatment row with ticket_after_min=18 still counts as resolved because its ticket arrives outside the locked ten-minute window. That isn't automatically correct or wrong: it makes the metric contract inspectable. If later tickets matter, choose a longer window or track them separately before launch. A real analysis also needs enough customers to distinguish an actual lift from chance variation.
After the planned window, suppose each arm contains 2,000 enrolled customers.
| Variant | Customers | Resolved | Resolution rate | p95 latency | Escalated | Grounded audit failures |
|---|---|---|---|---|---|---|
| Control: raw query | 2,000 | 760 | 38.0% | 1180 ms | 310 | 22 |
| Treatment: rewritten query | 2,000 | 840 | 42.0% | 1260 ms | 306 | 23 |
The absolute lift is the treatment rate minus the control rate:
The treatment resolves 4.0 additional conversations per 100 enrolled customers. That is a 4.0 percentage point lift.
The relative lift compares the absolute change with the original baseline:
That is a 10.5 percent relative lift. Always name which one you report; saying "up 10.5 points" would be wrong.
1control_resolved, control_n = 760, 2000
2treatment_resolved, treatment_n = 840, 2000
3
4p_control = control_resolved / control_n
5p_treatment = treatment_resolved / treatment_n
6absolute_lift = p_treatment - p_control
7relative_lift = absolute_lift / p_control
8
9print(f"control rate: {p_control:.1%}")
10print(f"treatment rate: {p_treatment:.1%}")
11print(f"absolute lift: {absolute_lift * 100:.1f} percentage points")
12print(f"relative lift: {relative_lift:.1%}")1control rate: 38.0%
2treatment rate: 42.0%
3absolute lift: 4.0 percentage points
4relative lift: 10.5%If you reran the experiment with different customers, the counts wouldn't be identical. A confidence interval is a procedure for describing this sampling uncertainty. For two reasonably large binary-outcome arms, a normal-approximation interval is a useful first calculation.
For each arm, is its observed resolution rate and is its number of enrolled customers. The estimated standard error of the difference is:
A rough 95 percent interval is . The number 1.96 is the standard-normal cutoff that leaves about 2.5 percent in each tail.
1from math import sqrt
2
3control_resolved, control_n = 760, 2000
4treatment_resolved, treatment_n = 840, 2000
5
6p_control = control_resolved / control_n
7p_treatment = treatment_resolved / treatment_n
8diff = p_treatment - p_control
9se = sqrt(
10 p_control * (1 - p_control) / control_n
11 + p_treatment * (1 - p_treatment) / treatment_n
12)
13low = diff - 1.96 * se
14high = diff + 1.96 * se
15
16print(f"estimated lift: {diff * 100:.1f} pp")
17print(f"standard error: {se * 100:.2f} pp")
18print(f"approximate 95% interval: [{low * 100:.1f}, {high * 100:.1f}] pp")1estimated lift: 4.0 pp
2standard error: 1.55 pp
3approximate 95% interval: [1.0, 7.0] ppThe estimate is +4.0 points and this approximation gives an interval of about +1.0 to +7.0 points. Because the lower bound is above zero, this planned analysis gives evidence for a positive treatment effect, assuming assignment and instrumentation are sound. It doesn't prove every future rollout will gain four points.
For rare outcomes, heavily clustered customers, many simultaneous comparisons, or business-critical launches, choose the inference method with a statistician or a mature experimentation platform before running the test.
+1.0 to +7.0-point interval; an SRM or version mismatch exits to investigation before anyone reads lift.A test with too little traffic may return "uncertain" even when the treatment helps. Before launching, declare a minimum detectable effect (MDE): the smallest true lift the planned test is designed to detect with its target power. Our team chooses +2.0 percentage points because a smaller resolution improvement wouldn't justify maintaining the rewrite service. This brief also uses +2.0 points as its minimum observed lift worth shipping. Keep the two fields conceptually separate: MDE determines planned enrollment, while the launch threshold determines the decision rule.
See what sample size does while the observed rates stay at 38 and 42 percent:
1from math import sqrt
2
3p_control = 0.38
4p_treatment = 0.42
5lift = p_treatment - p_control
6
7for n in (200, 500, 2000):
8 se = sqrt(p_control * (1 - p_control) / n + p_treatment * (1 - p_treatment) / n)
9 low = lift - 1.96 * se
10 high = lift + 1.96 * se
11 print(f"{n:>4} per arm: [{low * 100:>4.1f}, {high * 100:>4.1f}] pp")1200 per arm: [-5.6, 13.6] pp
2 500 per arm: [-2.1, 10.1] pp
32000 per arm: [ 1.0, 7.0] ppWith 200 or 500 customers per arm, a good-looking +4.0 point estimate is still compatible with no improvement. More enrollment narrows uncertainty; it doesn't magically make treatment better.
Power is the probability that a planned test will detect an effect of the chosen size when that effect truly exists. For equal-sized arms and a binary rate near the baseline, this approximation provides a planning estimate:
Here, is baseline resolution rate, is the MDE, is the false-positive budget, and power is the desired detection probability.
1from math import ceil
2from statistics import NormalDist
3
4baseline_rate = 0.38
5mde = 0.02
6alpha = 0.05
7target_power = 0.80
8
9normal = NormalDist()
10z_alpha = normal.inv_cdf(1 - alpha / 2)
11z_power = normal.inv_cdf(target_power)
12n_per_arm = ceil(
13 2 * baseline_rate * (1 - baseline_rate) * (z_alpha + z_power) ** 2 / mde**2
14)
15
16print(f"MDE: {mde * 100:.1f} pp")
17print(f"target power: {target_power:.0%}")
18print(f"planning estimate: {n_per_arm:,} customers per arm")1MDE: 2.0 pp
2target power: 80%
3planning estimate: 9,246 customers per armThis formula is for planning, not post-result storytelling. Actual platform planning may account for unequal allocation, repeated customers, variance reduction, multiple metrics, or sequential analysis.
A sample ratio mismatch (SRM) means observed assignment counts disagree suspiciously with the intended split. If the design says 50/50 but your data has 2,350 treatment customers and 1,650 control customers, investigate before reading resolution lift. Assignment code, event logging, eligibility filters, or bot removal may differ by arm.
This lab computes a chi-square diagnostic for a planned 50/50 allocation. A one-degree-of-freedom value above 10.83 is a deliberately strict alert threshold corresponding to a very small tail probability, about 0.1 percent.
1observed = {"control": 1650, "treatment": 2350}
2expected_each = sum(observed.values()) / 2
3
4chi_square = sum(
5 (count - expected_each) ** 2 / expected_each
6 for count in observed.values()
7)
8alert_threshold = 10.83
9
10print("planned split: 50% / 50%")
11print(f"observed: control={observed['control']}, treatment={observed['treatment']}")
12print(f"chi-square statistic: {chi_square:.1f}")
13print("pause analysis for SRM investigation:", chi_square > alert_threshold)1planned split: 50% / 50%
2observed: control=1650, treatment=2350
3chi-square statistic: 122.5
4pause analysis for SRM investigation: TrueAn SRM alert doesn't identify the bug. It says your comparison hasn't earned trust yet. Stop, diagnose the event path, and restart or reanalyze only when you understand what changed.
Suppose treatment truly has no effect: both arms resolve 40 percent of conversations. If you run one planned two-sided check at the end with a 5 percent false-positive threshold, false wins should occur about 5 percent of the time. If you inspect the ordinary interval after every batch and stop at the first apparent win, noise receives many chances to cross the threshold.
The following A/A simulation uses two identical arms, 5,000 repeat experiments, 20 looks, and a fixed seed. It demonstrates this exact stopping policy, not a universal percentage for every test design.
1import numpy as np
2
3rng = np.random.default_rng(11)
4trials = 5_000
5looks = 20
6batch_size = 100
7true_rate = 0.40
8
9planned_wins = 0
10peek_wins = 0
11
12for _ in range(trials):
13 control = rng.binomial(1, true_rate, size=looks * batch_size)
14 treatment = rng.binomial(1, true_rate, size=looks * batch_size)
15 crossed_early = False
16
17 for n in range(batch_size, looks * batch_size + 1, batch_size):
18 p_control = control[:n].mean()
19 p_treatment = treatment[:n].mean()
20 se = (p_control * (1 - p_control) / n + p_treatment * (1 - p_treatment) / n) ** 0.5
21 significant = se > 0 and abs(p_treatment - p_control) / se > 1.96
22 crossed_early = crossed_early or significant
23 if n == looks * batch_size:
24 planned_wins += significant
25
26 peek_wins += crossed_early
27
28print(f"planned final look: {planned_wins / trials:.1%}")
29print(f"stop at first crossing: {peek_wins / trials:.1%}")1planned final look: 4.9%
2stop at first crossing: 25.5%
Ordinary fixed-horizon intervals aren't valid for a decision triggered by continuous monitoring. Johari and colleagues define always-valid p-values and confidence intervals for sequential decisions, where repeated observation is part of the planned method.[2] For a first experiment, the straightforward rule is enough: choose enrollment in advance and analyze once.
Power doesn't always require more customers. CUPED (Controlled-experiment Using Pre-Experiment Data) uses a measurement collected before assignment, such as whether a customer resolved a prior refund issue, to remove predictable customer-to-customer variation from the outcome.
Let be the experiment outcome and be a pre-experiment covariate. CUPED forms:
Since was measured before treatment, it can't have been caused by treatment. When predicts , the adjusted outcome varies less while targeting the same treatment difference. Deng, Xu, Kohavi, and Walker introduced this approach for online experiments and reported substantial variance reductions in Bing experiments.[3]
This simulation gives returning customers different baseline resolution probabilities, then assigns treatment independently. Compare the standard error before and after CUPED.
1import numpy as np
2
3rng = np.random.default_rng(27)
4n = 20_000
5arm = rng.integers(0, 2, size=n)
6resolved_before = rng.binomial(1, 0.45, size=n)
7probability = np.clip(0.18 + 0.46 * resolved_before + 0.04 * arm, 0, 1)
8resolved_now = rng.binomial(1, probability)
9
10theta = np.cov(resolved_now, resolved_before, ddof=1)[0, 1] / np.var(resolved_before, ddof=1)
11adjusted = resolved_now - theta * (resolved_before - resolved_before.mean())
12
13def lift_and_se(outcome: np.ndarray) -> tuple[float, float]:
14 control = outcome[arm == 0]
15 treatment = outcome[arm == 1]
16 lift = treatment.mean() - control.mean()
17 se = (control.var(ddof=1) / len(control) + treatment.var(ddof=1) / len(treatment)) ** 0.5
18 return lift, se
19
20raw_lift, raw_se = lift_and_se(resolved_now)
21cuped_lift, cuped_se = lift_and_se(adjusted)
22variance_reduction = 1 - adjusted.var(ddof=1) / resolved_now.var(ddof=1)
23
24print(f"raw estimate: lift={raw_lift * 100:.2f} pp, se={raw_se * 100:.2f} pp")
25print(f"CUPED estimate: lift={cuped_lift * 100:.2f} pp, se={cuped_se * 100:.2f} pp")
26print(f"outcome variance reduced: {variance_reduction:.1%}")1raw estimate: lift=3.15 pp, se=0.70 pp
2CUPED estimate: lift=3.17 pp, se=0.61 pp
3outcome variance reduced: 21.9%In a finite sample, the raw and adjusted estimates won't be numerically identical. What matters is that adjustment uses only a pre-treatment signal and reduces uncertainty. Never adjust for a quantity treatment could affect, such as latency measured after the rewrite is enabled; that can bias the comparison.
An online experiment answers whether customers benefit under live use. It shouldn't be the first time you discover that a treatment emits unsupported refund claims. For an AI feature, use layers:
| Layer | Question | Example evidence |
|---|---|---|
| Offline regression set | Does treatment still follow policy on known cases? | Hand-labeled refund prompts, groundedness and refusal checks |
| Online primary outcome | Does it help customers complete their task? | Resolution rate |
| Product guardrails | Did it hurt speed or operations? | p95 latency, escalation rate, cost |
| Integrity receipt | Did we compare the promised systems? | Assignment split, index snapshot, prompt/model/decoder versions |
If you use a large language model (LLM) as a judge for an offline rubric, treat it as a measured evaluator rather than truth: pin its model and prompt version and compare it with human labels on a retained calibration set. Zheng et al. study LLM judges as approximations to human preference and document position, verbosity, self-enhancement, and reasoning limitations.[4]
This lab makes the configuration receipt explicit. A comparison with two different decoder versions is rejected before anybody reads its lift.
1control = {
2 "index": "refund-policies-2026-05-01",
3 "generator": "support-generator-v7",
4 "decoder": {"selection": "greedy", "schema": "refund-answer-v2"},
5}
6treatment = {
7 "index": "refund-policies-2026-05-01",
8 "generator": "support-generator-v7",
9 "decoder": {"selection": "greedy", "schema": "refund-answer-v2"},
10 "new_component": "query-rewrite-v1",
11}
12
13locked_fields = ("index", "generator", "decoder")
14mismatches = [field for field in locked_fields if control[field] != treatment[field]]
15
16print("intentional treatment change:", treatment["new_component"])
17print("unplanned mismatches:", mismatches)
18print("comparison is interpretable:", not mismatches)1intentional treatment change: query-rewrite-v1
2unplanned mismatches: []
3comparison is interpretable: TrueNow combine the evidence. This brief requires a positive interval, an observed lift of at least +2.0 points, guardrails inside their budgets, and trustworthy traffic. This last lab is a small launch gate that prints each condition instead of burying the call in a slide deck.
1from math import sqrt
2
3report = {
4 "control": {"n": 2000, "resolved": 760, "p95_latency_ms": 1180, "escalated": 310, "grounding_failures": 22},
5 "treatment": {"n": 2000, "resolved": 840, "p95_latency_ms": 1260, "escalated": 306, "grounding_failures": 23},
6 "minimum_ship_lift": 0.02,
7 "srm_alert": False,
8}
9
10c = report["control"]
11t = report["treatment"]
12p_c = c["resolved"] / c["n"]
13p_t = t["resolved"] / t["n"]
14lift = p_t - p_c
15se = sqrt(p_c * (1 - p_c) / c["n"] + p_t * (1 - p_t) / t["n"])
16low = lift - 1.96 * se
17
18latency_delta = t["p95_latency_ms"] - c["p95_latency_ms"]
19escalation_delta = t["escalated"] / t["n"] - c["escalated"] / c["n"]
20grounding_delta = t["grounding_failures"] / t["n"] - c["grounding_failures"] / c["n"]
21
22checks = {
23 "traffic_integrity": not report["srm_alert"],
24 "positive_lift_interval": low > 0,
25 "observed_lift_threshold": lift >= report["minimum_ship_lift"],
26 "latency_budget": latency_delta <= 150,
27 "escalation_budget": escalation_delta <= 0.005,
28 "grounding_budget": grounding_delta <= 0.005,
29}
30
31for name, passed in checks.items():
32 print(f"{name}: {'PASS' if passed else 'FAIL'}")
33print(
34 f"lift={lift * 100:.1f} pp, low_end={low * 100:.1f} pp, "
35 f"ship_threshold={report['minimum_ship_lift'] * 100:.1f} pp, latency_delta={latency_delta} ms"
36)
37print("decision:", "SHIP" if all(checks.values()) else "ROLL BACK OR INVESTIGATE")1traffic_integrity: PASS
2positive_lift_interval: PASS
3observed_lift_threshold: PASS
4latency_budget: PASS
5escalation_budget: PASS
6grounding_budget: PASS
7lift=4.0 pp, low_end=1.0 pp, ship_threshold=2.0 pp, latency_delta=80 ms
8decision: SHIPThis gate separates evidence from business value. positive_lift_interval asks whether the interval's lower bound is above zero. observed_lift_threshold asks whether the point estimate clears the predeclared +2.0 point bar. It doesn't prove the true lift exceeds +2.0 points: the interval begins at +1.0. A more conservative brief could require the lower bound to clear +2.0 points and would keep this result under investigation.
For these planned numbers and this predeclared rule, the launch gate says SHIP. Change treatment latency to 1490 and rerun: resolution still improves, but the speed budget fails. A treatment doesn't earn launch by winning only its favorite metric.
1260 to 1490. Which check fails, and why doesn't the resolution lift override it?minimum_ship_lift from 0.02 to 0.05. Which check fails even though the interval still excludes zero?srm_alert to True. Why should analysis pause before anybody argues about the observed lift?latency_budget fails because treatment is now 310 ms slower than control, beyond the locked 150 ms budget.observed_lift_threshold fails because the observed +4.0 point lift is below the new +5.0 point business threshold. Statistical evidence and shipping value answer different questions.traffic_integrity fails. SRM can signal broken assignment, filtering, or logging, so the comparison hasn't earned a launch decision.| Symptom | Likely cause | Fix before making a claim |
|---|---|---|
| Customer receives both arms during one conversation thread. | Randomized by request instead of stable customer identity. | Assign on first eligible event and persist the arm. |
| Treatment count is far from planned split. | Assignment, filtering, or event delivery differs by arm. | Pause analysis and diagnose SRM. |
| Dashboard turns green only after daily checking. | Fixed-horizon statistics were used with optional stopping. | Analyze once at planned horizon or use a planned sequential method. |
| Resolution rises while policy mistakes rise too. | Primary product outcome ignored groundedness risk. | Make audited grounding a launch guardrail. |
| An automated judge improves after its prompt changed mid-run. | Measurement ruler changed during experiment. | Pin evaluator version and retain human calibration cases. |
| CUPED result changes direction after using live latency as a covariate. | Adjustment used a post-treatment variable. | Use covariates measured before assignment only. |
Without looking back, explain why this experiment changes only the query rewrite while holding index, generator, prompt, and decoder fixed. Then answer these checks.
Answer every question, then check your score. Score above 75% to mark this lesson complete.
10 questions remaining.
Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing.
Kohavi, R., Tang, D., Xu, Y. · 2020
Peeking at A/B Tests: Why it matters, and what to do about it.
Johari, R., et al. · 2017 · KDD 2017
Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data (CUPED)
Deng, A., Xu, Y., Kohavi, R., & Walker, T. · 2013
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.
Zheng, L., et al. · 2023 · NeurIPS 2023