Master the design of an A/B testing framework for LLM-powered features, including traffic routing, metric selection, sample sizing, and automated guardrails.
GPU Serving & Autoscaling gave you a fleet that can survive production traffic. A/B testing decides whether a large language model (LLM), prompt, routing, or retrieval change should receive that traffic at all.
A/B testing (also called split testing) is a randomized experiment that divides users into two groups: a control group that receives the existing version of a product, and a treatment group that gets the new version[1]. By comparing outcomes between groups, you can determine whether a change actually improves the product or if observed differences are just noise.
Imagine you run the customer support bot for an online retailer. Right now the bot uses a simple system prompt: "Be helpful." Your team proposes a new prompt: "Be helpful, cite the relevant return or shipping policy, and use the customer's name from the order context." The new prompt sounds better, but how do you prove it?
You can't just ask three engineers which replies they prefer. You need a rigorous experiment. A click or completed return is directly observable; LLM answer quality often needs a rubric or reviewer, and generated wording can vary across runs. A 1% improvement in a quality score isn't valuable if it comes with an unacceptable latency increase or a safety regression.
This article walks you through designing a real online experiment for an LLM-powered feature. You'll build a concrete test from hypothesis to verdict, using a customer support bot as the running example. Along the way you'll learn metric selection, statistical sizing, traffic routing, and the biases that can fool you into shipping a regression.
Before diving into experimental design, you need to understand what LLM generation adds to ordinary online experimentation.
Traditional A/B testing is like counting whether more users clicked a button. The event is objective. LLM A/B testing is like scoring support-ticket replies. You need a judge model or trained reviewer and a scorecard (a rubric) to turn subjective impressions into reproducible numbers.
Without that scorecard, you're just arguing about taste. With it, you can count how many times the judge prefers one response over another and whether the margin is large enough to matter.
A/B tests already handle variable user outcomes: two users assigned the same checkout button don't behave identically. LLM features add another variation source because the same prompt can yield different text as sampling, model serving, and context change. For offline comparisons, hold sampling and retrieval settings fixed and use deterministic decoding when the runtime supports it. If your stack exposes a seed, log it, but don't assume a seed alone makes live traffic reproducible. Otherwise, sampler or serving variation can obscure the treatment effect.
Unlike a button click (binary), LLM quality is subjective. Two engineers might disagree on whether a response is "helpful." Your experiments must move from "it looks good" to rigorous, quantifiable metrics. That's why modern LLM evaluation uses a layered approach rather than gut feel.
Small changes in a system prompt can lead to outsized regressions on edge cases that aren't immediately visible. A prompt tweak that improves the median query might catastrophically fail your hardest 5% of cases. This makes guardrail metrics and golden datasets essential, not optional.
Don't limit your experiments to "Model A vs Model B." LLM A/B tests can evaluate many variables:
| Variable | Example Test |
|---|---|
| Model Versions | Baseline hosted model vs. distilled variant vs. fine-tuned in-house model |
| Prompt Instructions | Concise baseline prompt vs. policy-grounded prompt with approved examples |
| Hyperparameters | Temperature (0.3 vs. 0.7), Top-P (0.9 vs. 0.95), frequency penalties |
| RAG Pipeline Changes | Different embedding models, k=3 vs. k=5 retrieved documents |
| Inference Infrastructure | Speculative decoding on/off, different quantization levels |
In our support bot example, the most common test is a prompt duel: the same model with two different system prompts. Understanding what you're testing determines your metrics, sample size, and success criteria.
Before you pick a metric, you need to know what family of evaluation fits your task. Think of these as three levels of rigor, from cheap and deterministic to flexible and expensive.
For structured outputs or code generation, deterministic rules are the fastest and most reliable evaluators.
These are cheap, fast, and objective. Use them whenever you can. They are also reference-based: you compare the output against a known correct answer or schema.
When the right answer can be phrased many ways, word-overlap metrics like BLEU and ROUGE often miss the point. They reward verbatim copying, which is terrible for creative or conversational tasks.
A better approach is embedding-based similarity.
These tell you whether the model's answer means the same thing as a reference answer, even if the wording differs. Like Level 1, they are reference-based: you still need a golden answer to compare against.
For open-ended conversation, even embeddings fall short because there's no single correct phrasing. A common evaluation pattern is to use a separately calibrated judge model to grade candidate responses against a rubric and any required source context.
This can avoid a single golden wording, but factual or policy tasks still need source context in the rubric. It's scalable and flexible, but judges have their own biases. We'll return to those later.
Reference-based methods (Levels 1 and 2) work best when you know what the output should look like. Reference-free methods (Level 3) are necessary for creative, conversational, or open-ended tasks where the "right" answer isn't fixed.
Let's walk through a complete offline experiment for our support bot. The goal is to compare Prompt A ("Be helpful.") against Prompt B ("Be helpful, cite the relevant policy, and use the customer's name.") on a small golden dataset.
A golden dataset is a curated benchmark of your hardest real-world cases. Unlike generic benchmarks (like MMLU or HumanEval), a golden dataset reflects your specific use case: your customers' most complex queries, your product's most critical workflows.
For the support bot, start with five representative questions:
11. "My package hasn't arrived. It's been 8 days."
22. "I want to return a shoe I bought last month. The box is opened."
33. "Why was I charged twice for order #44129?"
44. "Can I change the delivery address for my pending order?"
55. "The item I received is damaged. What are my options?"These are edge cases that matter. Passing them screens important failures, but it doesn't prove behavior on routine traffic or unseen edge cases. Before an online A/B test, require the variant to pass an offline gate appropriate to its risk. This prevents obvious failures from consuming user traffic.
Send each question through both system prompts. Record the outputs. For this offline comparison, request deterministic decoding where the selected runtime supports it and record all generation, retrieval, and model-version settings. Temperature zero reduces sampling variation in many runtimes; it doesn't by itself guarantee identical output across provider or backend changes.
In practice, you'd use an API client or a local model. The key is to keep every parameter identical except the prompt itself, so you're measuring the prompt's effect, not sampler noise.
A rubric turns subjective quality into countable scores. Here's a simple three-axis rubric for our support bot:
| Axis | 1 (Poor) | 3 (Okay) | 5 (Excellent) |
|---|---|---|---|
| Clarity | Confusing or vague | Readable but wordy | Direct and easy to follow |
| Accuracy | Wrong policy or fact | Mostly correct | Fully correct and complete |
| Policy Citation | No citation | Vague mention | Exact policy cited with link |
You feed the judge LLM the question, the response, and this rubric, then ask it to score each axis from 1 to 5. The rubric is what makes the evaluation reproducible: another engineer can run the same judge with the same rubric and get comparable numbers.
Suppose you run the five questions and get these average scores:
| Question | Prompt A (Clarity, Accuracy, Policy) | Prompt B (Clarity, Accuracy, Policy) |
|---|---|---|
| 1 | (3, 4, 1) | (4, 4, 5) |
| 2 | (4, 3, 1) | (4, 4, 4) |
| 3 | (3, 2, 1) | (4, 4, 5) |
| 4 | (4, 4, 1) | (5, 5, 5) |
| 5 | (3, 3, 1) | (4, 4, 4) |
| Average | (3.4, 3.2, 1.0) | (4.2, 4.2, 4.8) |
Prompt B wins on every axis. But notice that Prompt A never cited the policy, while Prompt B did so consistently. The rubric made that gap visible. Without the rubric, you might have glanced at the outputs and thought both were "fine."
In pairwise mode, you can also compute a win rate. If the judge compares A and B head-to-head on each question and declares B the winner 4 times, A the winner 0 times, and a tie 1 time, the win rate is:
A 90% win rate over five examples is an encouraging screening signal on this curated set, not strong launch evidence. You would still need broader offline checks and a live A/B test to determine user impact, while this small duel can filter out obvious losers before you risk real traffic.
A golden dataset plus a clear rubric turns "looks good" into recorded screening evidence. Use an offline gate before live exposure whenever the treatment can change answer quality or safety.
Offline rubrics are useful, but production experiments need metrics that reflect real business value and system health. Successful tests use a taxonomy of metrics organized into three tiers based on how they're measured.
| Tier | Measurement Method | Examples | Use Case |
|---|---|---|---|
| Traditional (Code-Based) | Automated heuristics | Latency (Time to First Token, TTFT), Cost per 1k tokens, Pass@k, JSON validity, token count | Efficiency, reliability, format compliance |
| Model-Based (LLM-as-Judge) | Stronger LLM evaluation | G-Eval[3], RAGAS (Retrieval-Augmented Generation Assessment Suite) faithfulness[4], toxicity scores, helpfulness ratings | Scalable quality assessment |
| Human-in-the-Loop (HITL) | Human judgment | Thumbs up/down, ELO (relative skill rating) ratings, Side-by-Side (SBS) ranking, time-to-task-completion | Ground truth validation |
Keep latency terms straight. TTFT is the delay until the first token arrives. TPOT is the average time per generated token after streaming starts. ITL is the gap between consecutive output tokens. Mixing those up makes latency regressions hard to diagnose because each metric points to a different bottleneck.
Within each experiment, further categorize by priority:
For LLM products, cost belongs in the same dashboard as quality and latency. A variant can "win" by generating longer answers, invoking more tools, or consuming more retrieved context. Track cost per response, cost per successful session, and output-token growth as guardrails, not as an after-the-fact finance metric.
The following Python dictionary categorizes typical metrics for evaluating our support bot, then applies a simple promotion policy to observed deltas. A positive primary result isn't enough when a predefined guardrail fails.
1METRICS = {
2 # Primary metrics (what you're optimizing)
3 "primary": {
4 "resolution_rate": "% of chats resolved without human agent",
5 "csat_score": "Average customer satisfaction rating",
6 "policy_compliance": "% of returns/exchanges handled per policy",
7 },
8 # Guardrail metrics (must not degrade)
9 "guardrail": {
10 "safety_rate": "% of responses passing safety filters (>= 99.5%)",
11 "latency_p95": "95th percentile response time (<= 3s)",
12 "error_rate": "% of failed generations (<= 0.1%)",
13 "hallucination_rate": "% flagged by factuality checker",
14 "cost_per_session_usd": "Total model spend per successful session",
15 },
16 # Secondary metrics (nice to improve)
17 "secondary": {
18 "response_length": "Average tokens per response",
19 "regeneration_rate": "% of responses user regenerates",
20 "copy_rate": "% of responses user copies",
21 }
22}
23
24observed_change = {
25 "resolution_rate": 0.032,
26 "safety_rate": -0.001,
27 "latency_p95": 0.410,
28 "cost_per_session_usd": 0.018,
29}
30guardrail_limits = {
31 "safety_rate": -0.0005,
32 "latency_p95": 0.300,
33 "cost_per_session_usd": 0.010,
34}
35
36failed_guardrails: list[str] = []
37for metric, limit in guardrail_limits.items():
38 failed = observed_change[metric] < limit if metric == "safety_rate" else observed_change[metric] > limit
39 if failed:
40 failed_guardrails.append(metric)
41
42print(f"primary resolution lift: {observed_change['resolution_rate']:+.1%}")
43print(f"failed guardrails: {failed_guardrails}")
44print("decision:", "hold treatment" if failed_guardrails else "eligible for analysis")1primary resolution lift: +3.2%
2failed guardrails: ['safety_rate', 'latency_p95', 'cost_per_session_usd']
3decision: hold treatmentDetecting a small improvement in LLM quality is like hearing a whisper in a noisy room. You need to listen longer (collect more data) to be sure it wasn't just random noise. If your metric has high variance (like token usage) or low signal (like thumbs up/down), you need a larger sample size to distinguish signal from noise.
You need to calculate the required sample size before starting to ensure statistical power (the probability of detecting a real effect).
Suppose your support bot currently resolves 60% of chats on its own (the baseline rate). A five-point improvement to 65% is the smallest lift worth launching. How many conversations do you need in each group to design a test with useful power for that target?
With a standard 5% false-positive rate and 80% power, the normal approximation gives roughly 1,471 conversations per arm, or about 2,942 total. That's a lot for a small support team, which is why offline screening with a golden dataset is so valuable. It lets you kill bad variants before they consume live traffic.
Notice what happens if you aim smaller. Detecting a 2-point lift (from 60% to 62%) balloons the requirement to over 18,600 total conversations. Smaller effects require quadratically more samples.
For binary metrics (like "resolved" vs "not resolved"), a common two-sided normal approximation for a 50/50 test is:
Here, is the baseline rate, is the treatment rate implied by your minimum detectable effect (MDE), and . The values come from the standard normal distribution: controls false positives and controls false negatives.
The following function computes the necessary sample size for a binary metric. It takes the baseline conversion rate, the minimum detectable effect, alpha, and power as inputs, and returns the total required sample size using the standard normal approximation.
1import math
2from statistics import NormalDist
3
4def required_sample_size(
5 baseline_rate: float, # Current metric value (e.g., 0.60)
6 min_detectable_effect: float, # Smallest meaningful change (e.g., 0.05)
7 alpha: float = 0.05, # Significance level (false positive rate)
8 power: float = 0.80 # Statistical power (1 - false negative rate)
9) -> int:
10 """
11 Calculates the total sample size for a binary metric A/B test with two variants.
12 Uses the standard normal approximation (Z-test).
13 """
14 p_a = baseline_rate
15 p_b = baseline_rate + min_detectable_effect
16
17 if not 0 < p_a < 1:
18 raise ValueError("baseline_rate must be between 0 and 1")
19 if not 0 < p_b < 1:
20 raise ValueError("baseline_rate + min_detectable_effect must be between 0 and 1")
21
22 pooled = (p_a + p_b) / 2
23 normal = NormalDist()
24 z_alpha = normal.inv_cdf(1 - alpha / 2)
25 z_beta = normal.inv_cdf(power)
26
27 numerator = (
28 z_alpha * math.sqrt(2 * pooled * (1 - pooled)) +
29 z_beta * math.sqrt(p_a * (1 - p_a) + p_b * (1 - p_b))
30 )
31 n_per_arm = (numerator / (p_b - p_a)) ** 2
32 return math.ceil(n_per_arm) * 2
33
34# Example: detect a 5-point lift from 60% resolution
35print(required_sample_size(0.60, 0.05))
36
37# Smaller effect: detect a 2-point lift from 60% to 62%
38print(required_sample_size(0.60, 0.02))12942
218672Use this calculator before launch, not after results arrive. For the support-bot example, a 5-point lift needs thousands of conversations, while a 2-point lift needs far more.
For judge scores, latency, or token counts, estimate variance from historical logs or a pilot run and use the matching power analysis for continuous metrics. Don't reuse a binary-proportion shortcut for everything.
The randomization unit determines how traffic and users are divided between the control and treatment groups. Choosing the right unit is critical for both the statistical validity of the experiment and the consistency of the user experience.
| Unit | Pros | Cons |
|---|---|---|
| Per-request | More assignment units when requests are independent | Invalid for stateful conversations and inconsistent UX |
| Per-session | Consistent within session | User may see both variants |
| Per-user | Most consistent | Requires more users, slower |
For chat products, randomizing per user or per conversation is usually the right default. Pin that assignment for the full thread so follow-up turns, tool calls, and regenerations all hit the same variant. This keeps context and tone consistent, and it reduces within-session contamination where one bad turn changes how the user behaves on later turns.
SUTVA (Stable Unit Treatment Value Assumption) can still fail in collaborative features or shared documents, because users can influence each other. In those cases, you often need cluster-based randomization rather than simple per-user hashing.
Stable assignment sounds simple until you ship a multi-turn product. In practice, you need a deterministic routing key (for example user_id or conversation_id), persistent storage of the assigned variant, and telemetry that logs that assignment on every turn. If a conversation starts on Variant B and the second turn accidentally lands on Variant A, you haven't just damaged UX. You've invalidated the experiment because prompt state, retrieved context, and prior model behavior now leak across variants.
This deterministic router assigns at the conversation level. Each turn of one return inquiry gets the same prompt variant, and the logged assignment can travel with every trace:
1import hashlib
2
3def assigned_variant(conversation_id: str, treatment_fraction: float = 0.10) -> str:
4 digest = hashlib.sha256(conversation_id.encode()).digest()
5 bucket = int.from_bytes(digest[:4], "big") / 2**32
6 return "treatment" if bucket < treatment_fraction else "control"
7
8conversation = "return-ticket-48291"
9for turn in range(1, 4):
10 print(f"turn={turn} conversation={conversation} variant={assigned_variant(conversation)}")
11
12other_conversation = "delivery-ticket-48292"
13print(f"new conversation variant={assigned_variant(other_conversation)}")1turn=1 conversation=return-ticket-48291 variant=control
2turn=2 conversation=return-ticket-48291 variant=control
3turn=3 conversation=return-ticket-48291 variant=control
4new conversation variant=controlDifferent rollout patterns answer different questions:
| Pattern | What users see | What you learn | Main blind spot |
|---|---|---|---|
| Shadow | Control output only | Latency, errors, output deltas, offline judge scores | No user preference or engagement signal |
| Canary | Small percentage see variant | Real-user guardrails and operational safety | Low traffic can exaggerate cold-start effects |
| Full A/B | Both groups see different variants | Product impact with statistical comparison | More user exposure and larger sample-size needs |
For high-risk changes, a common escalation sequence is offline evaluation, then shadow, then canary, then a user-facing A/B test. A low-risk copy adjustment may not need every step; record the reason for skipping one.
LLM serving is stateful in ways normal web experiments aren't. Prefix caches, KV cache pages, and warm GPU workers strongly influence TTFT and throughput[5]. A shadow deployment doubles inference work, and a low-traffic canary can look artificially slow simply because it misses warm-cache reuse or lands on cold replicas. Measure warm and cold latency separately, and make sure the treatment has enough steady traffic to stay warm before you call a latency regression "real."
For ranking products, standard A/B tests compare outcomes across separate traffic groups. Interleaving is a within-query design[6]: it mixes candidates from both rankers into one list and attributes clicks back to each source. In the large-scale search experiments reported by Chapelle et al., interleaving was substantially more sensitive than A/B comparison for detecting ranking differences. That advantage is task-specific: it applies when both systems can safely contribute items to one ranking, not to two different free-form chatbot responses.
In practice, the system mixes results from both models and measures user preference by which results get clicked. A concrete example for an e-commerce search:
1User query: "best wireless headphones under $100"
2
3Interleaved results (A vs B):
41. [Model A] Sony XM5 <- User clicks (Vote for A)
52. [Model B] Bose QC45
63. [Model A] Sennheiser M4
74. [Model B] Apple AirPods <- User clicks (Vote for B)
85. [Model A] Audio-Technica M50x <- User clicks (Vote for A)
9
10Result: Model A wins 2-1 for this query.Because interleaving compares rankers on the same query for the same user, it controls much of the query and user-intent variation within that trial. That can make it more sensitive than a standard A/B test for ranking problems.
The "Team Draft" method ensures a fair and balanced mix of results. Imagine two team captains (Model A and Model B) taking turns picking their best remaining result to place in the final list.
This implementation illustrates the Team Draft interleaving strategy. It accepts two ranked lists of results as inputs and outputs a single interleaved list along with the items attributed to each model, ensuring a balanced representation.
1import random
2
3def next_unique(
4 results: list[object],
5 start_idx: int,
6 seen: set[object],
7) -> tuple[object | None, int]:
8 """Returns next unseen result and updated index."""
9 idx = start_idx
10 while idx < len(results) and results[idx] in seen:
11 idx += 1
12
13 if idx >= len(results):
14 return None, idx
15
16 return results[idx], idx + 1
17
18def team_draft_interleave(
19 results_a: list[object],
20 results_b: list[object],
21 k: int = 10,
22 rng: random.Random | None = None,
23) -> tuple[list[object], list[object], list[object]]:
24 """
25 Interleaves two ranked lists using the Team Draft method.
26 Returns: (interleaved_list, items_from_a, items_from_b)
27 """
28 interleaved = []
29 team_a: list[object] = []
30 team_b: list[object] = []
31 seen: set[object] = set()
32 idx_a, idx_b = 0, 0
33 rng = rng or random.Random()
34 turn = rng.choice(["a", "b"])
35
36 while len(interleaved) < k:
37 picked = False
38
39 # Try the current team first, then fall back to the other team if needed.
40 for candidate in (turn, "b" if turn == "a" else "a"):
41 if candidate == "a":
42 item, idx_a = next_unique(results_a, idx_a, seen)
43 if item is None:
44 continue
45 interleaved.append(item)
46 team_a.append(item)
47 seen.add(item)
48 turn = "b"
49 picked = True
50 break
51
52 item, idx_b = next_unique(results_b, idx_b, seen)
53 if item is None:
54 continue
55 interleaved.append(item)
56 team_b.append(item)
57 seen.add(item)
58 turn = "a"
59 picked = True
60 break
61
62 if not picked:
63 break # Both lists exhausted or only duplicates remain.
64
65 return interleaved, team_a, team_b
66
67results_a = ["sony", "sennheiser", "audio-technica", "jbl", "anker"]
68results_b = ["bose", "sony", "apple", "anker", "jabra"]
69
70mixed, from_a, from_b = team_draft_interleave(
71 results_a,
72 results_b,
73 k=5,
74 rng=random.Random(7),
75)
76
77print("mixed:", mixed)
78print("from A:", from_a)
79print("from B:", from_b)1mixed: ['bose', 'sony', 'apple', 'sennheiser', 'anker']
2from A: ['sony', 'sennheiser']
3from B: ['bose', 'apple', 'anker']If you searched through 20 independent secondary metrics at , you would have about a 64% chance of finding at least one "significant" result purely by chance (Family-Wise Error Rate, or FWER). Real experiment metrics are often correlated, so the exact number changes, but the core problem doesn't. Predeclare the primary decision metric; when you interpret a family of exploratory metrics, segments, or rubric axes, apply an appropriate correction such as Benjamini-Hochberg[7] to control the False Discovery Rate (FDR).
The code below applies multiple testing corrections to experimental results. It takes a dictionary of metric names and their corresponding test statistics and p-values, and returns the adjusted significance outcomes.
1from collections.abc import Mapping
2
3def bonferroni_adjust(p_values: list[float]) -> list[float]:
4 n = len(p_values)
5 return [min(p * n, 1.0) for p in p_values]
6
7def benjamini_hochberg_adjust(p_values: list[float]) -> list[float]:
8 n = len(p_values)
9 indexed = sorted(enumerate(p_values), key=lambda item: item[1])
10 adjusted = [0.0] * n
11 running_min = 1.0
12
13 for rank, (idx, p_value) in reversed(list(enumerate(indexed, start=1))):
14 running_min = min(running_min, p_value * n / rank)
15 adjusted[idx] = min(running_min, 1.0)
16
17 return adjusted
18
19def analyze_experiment(
20 metrics_results: dict[str, tuple[float, float]],
21 alpha: float = 0.05,
22) -> Mapping[str, Mapping[str, float | bool]]:
23 """
24 Applies multiple testing correction to a dictionary of {metric_name: (statistic, p_value)}.
25 """
26 p_values = [result[1] for result in metrics_results.values()]
27 p_adj_bonf = bonferroni_adjust(p_values)
28 p_adj_bh = benjamini_hochberg_adjust(p_values)
29
30 return {
31 name: {
32 "p_adj_bonferroni": p_adj_bonf[i],
33 "significant_bonferroni": p_adj_bonf[i] <= alpha,
34 "p_adj_bh": p_adj_bh[i],
35 "significant_bh": p_adj_bh[i] <= alpha,
36 }
37 for i, name in enumerate(metrics_results.keys())
38 }
39
40# Example: four metrics from a support bot experiment
41results = {
42 "policy_compliance": (3.3, 0.001),
43 "resolution_rate": (2.3, 0.020),
44 "latency_p95": (1.4, 0.12),
45 "cost_per_session": (0.8, 0.42),
46}
47
48for metric, values in analyze_experiment(results).items():
49 print(
50 f"{metric}: "
51 f"bonf={values['p_adj_bonferroni']:.3f} "
52 f"bh={values['p_adj_bh']:.3f} "
53 f"significant_bh={values['significant_bh']}"
54 )1policy_compliance: bonf=0.004 bh=0.004 significant_bh=True
2resolution_rate: bonf=0.080 bh=0.040 significant_bh=True
3latency_p95: bonf=0.480 bh=0.160 significant_bh=False
4cost_per_session: bonf=1.000 bh=0.420 significant_bh=FalseHere resolution_rate survives Benjamini-Hochberg but not Bonferroni because another related metric is even stronger. latency_p95 and cost_per_session fail both corrections.
Fixed-horizon A/B tests require choosing the sample size in advance and avoiding early stopping based on ordinary p-values. Checking results daily and stopping whenever they cross inflates the false-positive rate. A registered sequential design[8] permits planned interim looks by allocating error budget across those looks, allowing an early decision only when its boundary is crossed.
One approach uses a spending function such as O'Brien-Fleming to allocate total error budget across planned checkpoints. Early checks require stronger evidence. Exact critical values depend on the design, sidedness, information fraction, and analysis method; generate them with a statistical package and register them before traffic starts.
Repeatedly checking p-values without a sequential testing correction inflates your false positive rate. The exact inflation depends on the metric, test, and correlation between looks, which is why sequential methods matter.
The monitoring code below deliberately consumes precomputed boundaries instead of pretending to calculate a valid sequential test. It records four registered looks and stops only when the observed statistic exceeds that look's critical value:
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class RegisteredLook:
5 fraction: float
6 critical_z: float
7
8# Example boundary schedule supplied by the registered design.
9# Do not derive a real experiment's boundaries by copying these values.
10looks = [
11 RegisteredLook(0.25, 3.47),
12 RegisteredLook(0.50, 2.45),
13 RegisteredLook(0.75, 2.14),
14 RegisteredLook(1.00, 2.01),
15]
16observed_z = {0.25: 1.20, 0.50: 2.10, 0.75: 2.31, 1.00: 0.0}
17
18for look in looks:
19 z_value = observed_z[look.fraction]
20 crossed = abs(z_value) >= look.critical_z
21 print(f"look={look.fraction:.0%} z={z_value:.2f} boundary={look.critical_z:.2f} crossed={crossed}")
22 if crossed:
23 print(f"stop at planned {look.fraction:.0%} look")
24 break1look=25% z=1.20 boundary=3.47 crossed=False
2look=50% z=2.10 boundary=2.45 crossed=False
3look=75% z=2.31 boundary=2.14 crossed=True
4stop at planned 75% lookGuardrails protect users and your production environment during exposure. Different signals require different actions. A synchronous personally identifiable information (PII) or unsafe-content detector can block one response immediately. A sampled judge safety rate, p95 latency, or average cost is an aggregate estimate with noise; it usually needs a defined window and persistence rule before you pause or roll back a treatment.
Setting appropriate thresholds requires balancing safety with experimental velocity. If thresholds are too tight, normal statistical noise will trigger false alarms and halt valid experiments. If they're too loose, users might be exposed to harmful model degradation before the system reacts.
Set guardrails from the baseline distribution you observed during stable periods, not from arbitrary fixed values. This reduces false rollbacks caused by normal day-to-day traffic swings.
This Python class monitors aggregate windows. A single breached window raises an alert; only a sustained breach reaches the predeclared action. In a real system, keep synchronous per-response blocking outside this aggregate loop.
1from typing import TypedDict
2
3class Threshold(TypedDict):
4 direction: str # 'min' or 'max'
5 value: float
6 consecutive_windows: int
7 action: str
8
9class ExperimentGuardrails:
10 def __init__(self, thresholds: dict[str, Threshold]):
11 self.thresholds = thresholds
12 self.streaks = {metric: 0 for metric in thresholds}
13
14 def check(self, experiment_data: dict[str, float]) -> list[str]:
15 actions: list[str] = []
16
17 for metric, threshold in self.thresholds.items():
18 current = experiment_data.get(metric)
19 if current is None:
20 continue
21
22 breached = (
23 current > threshold["value"]
24 if threshold["direction"] == "max"
25 else current < threshold["value"]
26 )
27 self.streaks[metric] = self.streaks[metric] + 1 if breached else 0
28 if breached:
29 print(f"alert {metric}: streak={self.streaks[metric]}")
30 if self.streaks[metric] == threshold["consecutive_windows"]:
31 actions.append(f"{threshold['action']}: {metric}")
32
33 return actions
34
35guardrails = ExperimentGuardrails({
36 "latency_p95_ms": {"direction": "max", "value": 3000, "consecutive_windows": 2, "action": "pause exposure"},
37 "error_rate": {"direction": "max", "value": 0.001, "consecutive_windows": 2, "action": "rollback treatment"},
38 "cost_per_session_usd": {"direction": "max", "value": 0.08, "consecutive_windows": 2, "action": "review spend"},
39})
40
41windows = [
42 {"latency_p95_ms": 3300, "error_rate": 0.002, "cost_per_session_usd": 0.09},
43 {"latency_p95_ms": 3400, "error_rate": 0.003, "cost_per_session_usd": 0.091},
44]
45for index, window in enumerate(windows, start=1):
46 print(f"window {index} actions:", guardrails.check(window))1alert latency_p95_ms: streak=1
2alert error_rate: streak=1
3alert cost_per_session_usd: streak=1
4window 1 actions: []
5alert latency_p95_ms: streak=2
6alert error_rate: streak=2
7alert cost_per_session_usd: streak=2
8window 2 actions: ['pause exposure: latency_p95_ms', 'rollback treatment: error_rate', 'review spend: cost_per_session_usd']The first window alerts without making an aggregate decision. The second sustained window reaches each registered action; an objective sustained error breach triggers rollback while latency and cost cause pause or review.
Standard A/B tests explore first (to find a winner) and exploit later (when you ship the winner to 100% of users). But what if the exploration phase is too costly? If a bad model variant causes users to churn, you want to stop sending traffic to it as quickly as possible.
Multi-armed bandits (MAB) dynamically adjust traffic allocation during the experiment based on real-time performance. Using algorithms such as Thompson Sampling, the system routes more traffic to variants that appear to be winning and less traffic to variants that appear weak. This minimizes "regret," which is the total penalty incurred by exposing users to suboptimal models. Bandits fit continuous optimization tasks (such as prompt tuning) where you have many variants and aim to automatically deprecate underperforming ones without waiting for a fixed-horizon A/B test.
But MABs introduce architectural complexity. They require a tightly coupled feedback loop where user actions (such as a thumbs up) immediately update the routing logic. In distributed systems with high latency or delayed rewards (for example, measuring 7-day retention), bandits can be challenging to implement correctly compared to static A/B test splits.
The following Python example illustrates a basic Beta-Bernoulli Thompson Sampling bandit. It assumes a binary reward signal such as click/no-click or accept/reject. If your reward is continuous or heavily delayed, the posterior update changes.
1import random
2
3class ThompsonSamplingBandit:
4 def __init__(self, n_arms: int, rng: random.Random | None = None):
5 self.successes = [1] * n_arms # Alpha parameter (prior)
6 self.failures = [1] * n_arms # Beta parameter (prior)
7 self.pulls = [0] * n_arms
8 self.rng = rng or random.Random()
9
10 def select_arm(self) -> int:
11 """Samples from the Beta distribution for each arm and selects the highest."""
12 sampled_theta = [
13 self.rng.betavariate(successes, failures)
14 for successes, failures in zip(self.successes, self.failures)
15 ]
16 return max(range(len(sampled_theta)), key=sampled_theta.__getitem__)
17
18 def update(self, arm: int, reward: int):
19 """Updates the posterior with a Bernoulli outcome (0 or 1)."""
20 if reward not in (0, 1):
21 raise ValueError("reward must be 0 or 1 for Beta-Bernoulli Thompson Sampling")
22
23 self.successes[arm] += reward
24 self.failures[arm] += (1 - reward)
25 self.pulls[arm] += 1
26
27rng = random.Random(7)
28true_resolution_rates = [0.55, 0.62]
29bandit = ThompsonSamplingBandit(n_arms=2, rng=rng)
30
31for _ in range(500):
32 arm = bandit.select_arm() # 0 = control, 1 = treatment
33 reward = int(rng.random() < true_resolution_rates[arm])
34 bandit.update(arm, reward)
35
36print("traffic:", bandit.pulls)
37print("successes:", bandit.successes)
38print("failures:", bandit.failures)1traffic: [151, 349]
2successes: [89, 227]
3failures: [64, 124]Evaluating generative AI introduces unique psychological and statistical biases that typically don't appear in standard software experiments.
Users often prefer the first option shown, simply because it requires less effort to read. In pairwise comparisons (such as Side-by-Side evaluation or Reinforcement Learning from Human Feedback (RLHF) labeling), this bias can skew results significantly.
Always randomize the order of presentation. If showing two model outputs A and B, ensure A appears first for 50% of users and B appears first for the other 50%. Log the display order to control for this bias during analysis.
Running a head-to-head judge comparison with A always first can make order effects look like model quality. Swap positions and analyze order effects so presentation bias is measured and reduced rather than silently attributed to the model.
Users initially engage more with new models or features just because they're different, not necessarily better. A new return-label feature might see a spike in usage on Day 1 that evaporates by Day 7.
Plan duration around known demand cycles and novelty risk. A full weekly cycle is a reasonable starting point when weekday and weekend traffic differ, but it isn't a universal stopping rule. If you want a burn-in window, define it before launch and exclude it consistently. Don't decide to drop early days after seeing a spike.
A model can look better overall even while losing inside every major segment. This usually happens when the analyzed sample has different segment mixes across variants.
| Segment | Model A (Resolution %, analyzed n) | Model B (Resolution %, analyzed n) | Winner |
|---|---|---|---|
| Mobile Users | 85% (200) | 80% (800) | Model A |
| Desktop Users | 70% (800) | 65% (200) | Model A |
| Combined | 73% | 77% | Model B |
Both segment-level comparisons favor Model A, but the aggregate flips because Model B's analyzed sample contains far more mobile users (who have higher resolution rates overall). That's Simpson's paradox.
Always segment results by pre-treatment user groups (for example, tenure, subscription tier, device type) and inspect the sample mix in each arm. Be especially careful with triggered analyses, missing-label subsets, or any filter that treatment can influence. Before declaring a winner, verify that the treatment doesn't severely degrade the experience for critical cohorts, even if the aggregate top-line metric improves.
The calculation below reproduces the aggregate reversal. Both pre-treatment cohorts favor Model A even though Model B's mobile-heavy analyzed sample wins after pooling:
1results = {
2 "A": {"mobile": (170, 200), "desktop": (560, 800)},
3 "B": {"mobile": (640, 800), "desktop": (130, 200)},
4}
5
6def rate(successes: int, total: int) -> float:
7 return successes / total
8
9for segment in ["mobile", "desktop"]:
10 a = rate(*results["A"][segment])
11 b = rate(*results["B"][segment])
12 print(f"{segment:7} A={a:.0%} B={b:.0%} winner={'A' if a > b else 'B'}")
13
14for arm in ["A", "B"]:
15 successes = sum(pair[0] for pair in results[arm].values())
16 total = sum(pair[1] for pair in results[arm].values())
17 print(f"aggregate {arm}={rate(successes, total):.0%}")1mobile A=85% B=80% winner=A
2desktop A=70% B=65% winner=A
3aggregate A=73%
4aggregate B=77%LLM-as-Judge[9] offers automated quality comparison to screen candidates before committing to a live A/B test. Full A/B tests with human evaluation are expensive and slow.
LLM judges are useful triage systems, not ground truth. They can cheaply filter obvious regressions in offline evaluation or shadow mode, but they also inherit position bias, verbosity bias, and judge-model bias[9]. Use them to narrow the field before you consume valuable live traffic.
The asynchronous function below uses a strong LLM to evaluate two competing model outputs. It takes the original prompt and both responses as inputs, and returns a judgment ("A", "B", or "TIE") with a brief justification.
1from collections.abc import Mapping
2
3async def llm_judge_compare(
4 query: str,
5 response_a: str,
6 response_b: str,
7 criteria: list[str] | None = None,
8) -> Mapping[str, str]:
9 """Use a strong LLM to judge which response is better."""
10 criteria = criteria or ["accuracy", "helpfulness", "clarity", "completeness"]
11 judge_prompt = f"""You are an expert evaluator. Compare these two responses to the query.
12
13Query: {query}
14
15Response A: {response_a}
16Response B: {response_b}
17
18Rate each response on a scale of 1-5 for: {', '.join(criteria)}.
19Then pick the overall better response: A, B, or TIE.
20Explain your reasoning in one sentence."""
21
22 result = await judge_model.generate(judge_prompt)
23 return {"winner": parse_winner(result), "reasoning": result}Published judge evaluations report position and verbosity biases.[9] Check whether your judge's wins correlate with display order or response length before trusting its verdict.
Everything above the live A/B test (golden datasets, rubrics, offline judges) measures what your variant does on a frozen set you chose. Production measures what it does on traffic you did not choose. That difference is the offline-online gap, and closing it is the whole point of online evaluation. A variant can win 90% offline and still regress live because the real query mix is different, latency and cost shift under load, and rare safety failures only surface at the production distribution.
A common production pattern is to keep the judge running after launch instead of only before it. You sample a predeclared slice of live traffic sized for budget and detection needs, attach the request, retrieved context, tool calls, and output to a trace, then score that trace asynchronously with the same rubric you used offline. Because the eval is linked to a trace, a quality drop points you straight to the failing request rather than to an aggregate dashboard with no drill-down. This is the trace-linked online judge.
Keep two ideas separate, because production systems implement them differently:
| Mechanism | Timing | Job | Latency budget |
|---|---|---|---|
| Guardrail (inline) | Synchronous, in the request path | Block a specific failure (toxicity, PII, schema break) before the user sees it | Milliseconds |
| Online evaluator (judge) | Asynchronous, after the response | Score quality on sampled traffic and watch for drift over time | Seconds, off the hot path |
An inline guardrail is a safety gate that must run before the response ships. An online judge is a measurement system that runs after, on a sample; when it is kept off the request path, it doesn't add user-facing latency. Confusing the two leads people to either slow every request with a judge call or to treat a slow async score as a real-time block.
Online judges also drift when the judge version, rubric, prompt, or traffic mix changes. The fix is the same calibration discipline you use offline, run continuously. Track judge-vs-human agreement on a rotating audit set, choose an acceptance threshold for your decision risk, log the full judge configuration on every score, and re-calibrate whenever that configuration changes.
This audit check uses an explicit local policy rather than a universal agreement cutoff:
1human_labels = ["A", "B", "A", "TIE", "B", "A", "B", "B"]
2judge_labels = ["A", "B", "B", "TIE", "B", "A", "A", "B"]
3minimum_agreement = 0.80
4
5matches = sum(human == judge for human, judge in zip(human_labels, judge_labels))
6agreement = matches / len(human_labels)
7decision = "enable trend monitoring" if agreement >= minimum_agreement else "recalibrate judge"
8
9print(f"audit examples: {len(human_labels)}")
10print(f"judge-human agreement: {agreement:.1%}")
11print(f"policy minimum: {minimum_agreement:.1%}")
12print(f"decision: {decision}")1audit examples: 8
2judge-human agreement: 75.0%
3policy minimum: 80.0%
4decision: recalibrate judgeManaging LLM experiments at scale usually requires more than spreadsheets and ad hoc notebooks. Common choices include:
| Category | Examples | Purpose |
|---|---|---|
| Experiment tracking | W&B Weave, MLflow | Track runs, compare variants, store metrics and artifacts |
| Testing and regression | DeepEval, Promptfoo, Giskard | Automate prompt tests, red-team checks, and eval regressions |
| Observability | Langfuse, Phoenix | Trace request flows, inspect latency and cost, debug failures |
When selecting tools, prioritize integration with your existing ML infrastructure, immutable experiment definitions, trace linkage, and guardrail alerting. A useful toolchain connects offline evaluation to online experiments with shared metric definitions.
Your turn. Build a small "Prompt Arena" CLI tool that automates the workflow we walked through.
This is the last article in Inference & Production Scale. You now have the full production loop: serve a model fleet, autoscale it, route and fall back across providers, watch cost and tokens, observe it in production, and decide what gets to ship through online evaluation. That last decision is the one this article owns, and you should be able to defend the full online-evaluation plan, not just quote a p-value.
You should also be able to:
How do you handle network effects in LLM A/B tests? Network effects break simple A/B tests when one user's treatment changes another user's experience. Shared AI-edited documents are a typical example: User A's improved edits can make User B happier even if User B is in the control group. Use cluster randomization at the team, organization, or document-cluster level. For short-lived effects, switchback experiments can also work.
Which common guardrails should an LLM product consider? Typical choices include p95/p99 latency, error rate, empty-response rate, safety violations, personally identifiable information (PII) leakage, and cost per successful session. Block objective per-response safety failures inline; pause or roll back exposure according to predefined persistence rules for aggregate signals.
How do you test multi-turn conversational effects? Randomize by session or conversation, not by request. Track conversation-level outcomes such as resolution rate, human handoff, sentiment trajectory, and retention after early turns. A bad first answer can poison every later turn.
When would you use a bandit instead of an A/B test? Use bandits when the goal is to optimize cumulative reward during exploration, such as continuous prompt tuning with fast feedback. Use classical A/B tests when you need clean inference, fixed exposure, long-term outcomes, or high-confidence launch evidence.
If you can defend those decisions, you have closed out production scale. Next comes System Design Capstones, where you assemble serving, observability, guardrails, and online evaluation into one end-to-end design under interview conditions. The first capstone, a real-time content moderation system, leans directly on the guardrail and online-evaluation thinking from this article: it has to keep user-generated content safe at scale using LLMs and specialized classifiers.
Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing.
Kohavi, R., Tang, D., Xu, Y. · 2020
BERTScore: Evaluating Text Generation with BERT.
Zhang, T., et al. · 2020 · ICLR 2020
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment.
Liu, Y., et al. · 2023
RAGAS: Automated Evaluation of Retrieval Augmented Generation.
Es, S., et al. · 2023 · arXiv preprint
Efficient Memory Management for Large Language Model Serving with PagedAttention.
Kwon, W., et al. · 2023 · SOSP 2023
Large-scale Validation and Analysis of Interleaved Search Evaluation.
Chapelle, O., et al. · 2012 · ACM TOIS
Controlling the false discovery rate: a practical and powerful approach to multiple testing.
Benjamini, Y., & Hochberg, Y. · 1995 · Journal of the Royal Statistical Society. Series B
Peeking at A/B Tests: Why it matters, and what to do about it.
Johari, R., et al. · 2017 · KDD 2017
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.
Zheng, L., et al. · 2023 · NeurIPS 2023