LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnInference & Production ScaleA/B Testing for LLMs
📊HardEvaluation & Benchmarks

A/B Testing for LLMs

Master the design of an A/B testing framework for LLM-powered features, including traffic routing, metric selection, sample sizing, and automated guardrails.

45 min read
Learning path
Step 142 of 155 in the full curriculum
GPU Serving & AutoscalingContent Moderation System

A/B Testing for LLMs

GPU Serving & Autoscaling gave you a fleet that can survive production traffic. A/B testing decides whether a large language model (LLM), prompt, routing, or retrieval change should receive that traffic at all.

A/B testing (also called split testing) is a randomized experiment that divides users into two groups: a control group that receives the existing version of a product, and a treatment group that gets the new version[1]. By comparing outcomes between groups, you can determine whether a change actually improves the product or if observed differences are just noise.

Imagine you run the customer support bot for an online retailer. Right now the bot uses a simple system prompt: "Be helpful." Your team proposes a new prompt: "Be helpful, cite the relevant return or shipping policy, and use the customer's name from the order context." The new prompt sounds better, but how do you prove it?

You can't just ask three engineers which replies they prefer. You need a rigorous experiment. A click or completed return is directly observable; LLM answer quality often needs a rubric or reviewer, and generated wording can vary across runs. A 1% improvement in a quality score isn't valuable if it comes with an unacceptable latency increase or a safety regression.

This article walks you through designing a real online experiment for an LLM-powered feature. You'll build a concrete test from hypothesis to verdict, using a customer support bot as the running example. Along the way you'll learn metric selection, statistical sizing, traffic routing, and the biases that can fool you into shipping a regression.

A/B testing for LLMs: traffic is split between control and treatment, variant B improves the primary quality metric, but latency and cost guardrails fail so the treatment should not ship yet. A/B testing for LLMs: traffic is split between control and treatment, variant B improves the primary quality metric, but latency and cost guardrails fail so the treatment should not ship yet.
Notice how Variant B wins on quality but still fails to ship because two guardrails (latency and cost) are breached.

LLM experiment workflow moving through setup, telemetry, and analysis gates LLM experiment workflow moving through setup, telemetry, and analysis gates
Experiment design starts with hypothesis and metric definitions, then moves through routing, telemetry, guardrails, analysis, and rollout.

Why LLM A/B testing is different

Before diving into experimental design, you need to understand what LLM generation adds to ordinary online experimentation.

The scored-ticket analogy

Traditional A/B testing is like counting whether more users clicked a button. The event is objective. LLM A/B testing is like scoring support-ticket replies. You need a judge model or trained reviewer and a scorecard (a rubric) to turn subjective impressions into reproducible numbers.

Without that scorecard, you're just arguing about taste. With it, you can count how many times the judge prefers one response over another and whether the margin is large enough to matter.

Non-determinism

A/B tests already handle variable user outcomes: two users assigned the same checkout button don't behave identically. LLM features add another variation source because the same prompt can yield different text as sampling, model serving, and context change. For offline comparisons, hold sampling and retrieval settings fixed and use deterministic decoding when the runtime supports it. If your stack exposes a seed, log it, but don't assume a seed alone makes live traffic reproducible. Otherwise, sampler or serving variation can obscure the treatment effect.

The subjectivity problem

Unlike a button click (binary), LLM quality is subjective. Two engineers might disagree on whether a response is "helpful." Your experiments must move from "it looks good" to rigorous, quantifiable metrics. That's why modern LLM evaluation uses a layered approach rather than gut feel.

The cascade effect

Small changes in a system prompt can lead to outsized regressions on edge cases that aren't immediately visible. A prompt tweak that improves the median query might catastrophically fail your hardest 5% of cases. This makes guardrail metrics and golden datasets essential, not optional.

What exactly are you testing?

Don't limit your experiments to "Model A vs Model B." LLM A/B tests can evaluate many variables:

VariableExample Test
Model VersionsBaseline hosted model vs. distilled variant vs. fine-tuned in-house model
Prompt InstructionsConcise baseline prompt vs. policy-grounded prompt with approved examples
HyperparametersTemperature (0.3 vs. 0.7), Top-P (0.9 vs. 0.95), frequency penalties
RAG Pipeline ChangesDifferent embedding models, k=3 vs. k=5 retrieved documents
Inference InfrastructureSpeculative decoding on/off, different quantization levels

In our support bot example, the most common test is a prompt duel: the same model with two different system prompts. Understanding what you're testing determines your metrics, sample size, and success criteria.

Three levels of LLM evaluation

Before you pick a metric, you need to know what family of evaluation fits your task. Think of these as three levels of rigor, from cheap and deterministic to flexible and expensive.

Level 1: Rule-based checks

For structured outputs or code generation, deterministic rules are the fastest and most reliable evaluators.

  • Exact match and regex: Does the output contain a valid JSON object with the expected keys?
  • Code execution: Does the generated Python script run and pass a set of unit tests?
  • Format compliance: Does the response follow the requested template?

These are cheap, fast, and objective. Use them whenever you can. They are also reference-based: you compare the output against a known correct answer or schema.

Level 2: Semantic similarity

When the right answer can be phrased many ways, word-overlap metrics like BLEU and ROUGE often miss the point. They reward verbatim copying, which is terrible for creative or conversational tasks.

A better approach is embedding-based similarity.

  • BERTScore[2]: Uses contextual embeddings to compare token-level similarity, capturing paraphrases.
  • Cosine similarity: Measures how close two sentence embeddings are in vector space.

These tell you whether the model's answer means the same thing as a reference answer, even if the wording differs. Like Level 1, they are reference-based: you still need a golden answer to compare against.

Level 3: LLM-as-judge

For open-ended conversation, even embeddings fall short because there's no single correct phrasing. A common evaluation pattern is to use a separately calibrated judge model to grade candidate responses against a rubric and any required source context.

  • G-Eval[3]: A judge LLM scores outputs on a rubric (for example, Clarity 1-5, Accuracy 1-5).
  • Pairwise comparison: The judge sees two responses and picks the better one based on explicit criteria.

This can avoid a single golden wording, but factual or policy tasks still need source context in the rubric. It's scalable and flexible, but judges have their own biases. We'll return to those later.

Reference-based methods (Levels 1 and 2) work best when you know what the output should look like. Reference-free methods (Level 3) are necessary for creative, conversational, or open-ended tasks where the "right" answer isn't fixed.

Worked example: the system prompt duel

Let's walk through a complete offline experiment for our support bot. The goal is to compare Prompt A ("Be helpful.") against Prompt B ("Be helpful, cite the relevant policy, and use the customer's name.") on a small golden dataset.

Step 1: Build a golden dataset

A golden dataset is a curated benchmark of your hardest real-world cases. Unlike generic benchmarks (like MMLU or HumanEval), a golden dataset reflects your specific use case: your customers' most complex queries, your product's most critical workflows.

For the support bot, start with five representative questions:

text
11. "My package hasn't arrived. It's been 8 days." 22. "I want to return a shoe I bought last month. The box is opened." 33. "Why was I charged twice for order #44129?" 44. "Can I change the delivery address for my pending order?" 55. "The item I received is damaged. What are my options?"

These are edge cases that matter. Passing them screens important failures, but it doesn't prove behavior on routine traffic or unseen edge cases. Before an online A/B test, require the variant to pass an offline gate appropriate to its risk. This prevents obvious failures from consuming user traffic.

Step 2: Run the two prompts

Send each question through both system prompts. Record the outputs. For this offline comparison, request deterministic decoding where the selected runtime supports it and record all generation, retrieval, and model-version settings. Temperature zero reduces sampling variation in many runtimes; it doesn't by itself guarantee identical output across provider or backend changes.

In practice, you'd use an API client or a local model. The key is to keep every parameter identical except the prompt itself, so you're measuring the prompt's effect, not sampler noise.

Step 3: Write the judge rubric

A rubric turns subjective quality into countable scores. Here's a simple three-axis rubric for our support bot:

Axis1 (Poor)3 (Okay)5 (Excellent)
ClarityConfusing or vagueReadable but wordyDirect and easy to follow
AccuracyWrong policy or factMostly correctFully correct and complete
Policy CitationNo citationVague mentionExact policy cited with link

You feed the judge LLM the question, the response, and this rubric, then ask it to score each axis from 1 to 5. The rubric is what makes the evaluation reproducible: another engineer can run the same judge with the same rubric and get comparable numbers.

Step 4: Score and calculate a winner

Suppose you run the five questions and get these average scores:

QuestionPrompt A (Clarity, Accuracy, Policy)Prompt B (Clarity, Accuracy, Policy)
1(3, 4, 1)(4, 4, 5)
2(4, 3, 1)(4, 4, 4)
3(3, 2, 1)(4, 4, 5)
4(4, 4, 1)(5, 5, 5)
5(3, 3, 1)(4, 4, 4)
Average(3.4, 3.2, 1.0)(4.2, 4.2, 4.8)

Prompt B wins on every axis. But notice that Prompt A never cited the policy, while Prompt B did so consistently. The rubric made that gap visible. Without the rubric, you might have glanced at the outputs and thought both were "fine."

In pairwise mode, you can also compute a win rate. If the judge compares A and B head-to-head on each question and declares B the winner 4 times, A the winner 0 times, and a tie 1 time, the win rate is:

Win RateB=WinsB+0.5×TiesTotal Trials=4+0.5(1)5=0.90\text{Win Rate}_B = \frac{\text{Wins}_B + 0.5 \times \text{Ties}}{\text{Total Trials}} = \frac{4 + 0.5(1)}{5} = 0.90Win RateB​=Total TrialsWinsB​+0.5×Ties​=54+0.5(1)​=0.90

A 90% win rate over five examples is an encouraging screening signal on this curated set, not strong launch evidence. You would still need broader offline checks and a live A/B test to determine user impact, while this small duel can filter out obvious losers before you risk real traffic.

A golden dataset plus a clear rubric turns "looks good" into recorded screening evidence. Use an offline gate before live exposure whenever the treatment can change answer quality or safety.

Metrics that matter in production

Offline rubrics are useful, but production experiments need metrics that reflect real business value and system health. Successful tests use a taxonomy of metrics organized into three tiers based on how they're measured.

TierMeasurement MethodExamplesUse Case
Traditional (Code-Based)Automated heuristicsLatency (Time to First Token, TTFT), Cost per 1k tokens, Pass@k, JSON validity, token countEfficiency, reliability, format compliance
Model-Based (LLM-as-Judge)Stronger LLM evaluationG-Eval[3], RAGAS (Retrieval-Augmented Generation Assessment Suite) faithfulness[4], toxicity scores, helpfulness ratingsScalable quality assessment
Human-in-the-Loop (HITL)Human judgmentThumbs up/down, ELO (relative skill rating) ratings, Side-by-Side (SBS) ranking, time-to-task-completionGround truth validation

Keep latency terms straight. TTFT is the delay until the first token arrives. TPOT is the average time per generated token after streaming starts. ITL is the gap between consecutive output tokens. Mixing those up makes latency regressions hard to diagnose because each metric points to a different bottleneck.

Within each experiment, further categorize by priority:

  1. Primary metrics: Direct business value. For a support bot, this could be "conversation resolved without agent handoff" or customer satisfaction score.
  2. Guardrail metrics: Predefined constraints. For example, a sustained latency breach or statistically credible safety degradation can block promotion regardless of quality gains; exact policy thresholds depend on your baseline and SLO.
  3. Secondary metrics: Debugging signals. Why did users accept the code? Was it shorter? More complex? These help you understand why the primary metric moved.

For LLM products, cost belongs in the same dashboard as quality and latency. A variant can "win" by generating longer answers, invoking more tools, or consuming more retrieved context. Track cost per response, cost per successful session, and output-token growth as guardrails, not as an after-the-fact finance metric.

The following Python dictionary categorizes typical metrics for evaluating our support bot, then applies a simple promotion policy to observed deltas. A positive primary result isn't enough when a predefined guardrail fails.

metrics-that-matter-in-production.py
1METRICS = { 2 # Primary metrics (what you're optimizing) 3 "primary": { 4 "resolution_rate": "% of chats resolved without human agent", 5 "csat_score": "Average customer satisfaction rating", 6 "policy_compliance": "% of returns/exchanges handled per policy", 7 }, 8 # Guardrail metrics (must not degrade) 9 "guardrail": { 10 "safety_rate": "% of responses passing safety filters (>= 99.5%)", 11 "latency_p95": "95th percentile response time (<= 3s)", 12 "error_rate": "% of failed generations (<= 0.1%)", 13 "hallucination_rate": "% flagged by factuality checker", 14 "cost_per_session_usd": "Total model spend per successful session", 15 }, 16 # Secondary metrics (nice to improve) 17 "secondary": { 18 "response_length": "Average tokens per response", 19 "regeneration_rate": "% of responses user regenerates", 20 "copy_rate": "% of responses user copies", 21 } 22} 23 24observed_change = { 25 "resolution_rate": 0.032, 26 "safety_rate": -0.001, 27 "latency_p95": 0.410, 28 "cost_per_session_usd": 0.018, 29} 30guardrail_limits = { 31 "safety_rate": -0.0005, 32 "latency_p95": 0.300, 33 "cost_per_session_usd": 0.010, 34} 35 36failed_guardrails: list[str] = [] 37for metric, limit in guardrail_limits.items(): 38 failed = observed_change[metric] < limit if metric == "safety_rate" else observed_change[metric] > limit 39 if failed: 40 failed_guardrails.append(metric) 41 42print(f"primary resolution lift: {observed_change['resolution_rate']:+.1%}") 43print(f"failed guardrails: {failed_guardrails}") 44print("decision:", "hold treatment" if failed_guardrails else "eligible for analysis")
Output
1primary resolution lift: +3.2% 2failed guardrails: ['safety_rate', 'latency_p95', 'cost_per_session_usd'] 3decision: hold treatment

How many conversations do you need?

Detecting a small improvement in LLM quality is like hearing a whisper in a noisy room. You need to listen longer (collect more data) to be sure it wasn't just random noise. If your metric has high variance (like token usage) or low signal (like thumbs up/down), you need a larger sample size to distinguish signal from noise.

You need to calculate the required sample size before starting to ensure statistical power (the probability of detecting a real effect).

A numeric warm-up

Suppose your support bot currently resolves 60% of chats on its own (the baseline rate). A five-point improvement to 65% is the smallest lift worth launching. How many conversations do you need in each group to design a test with useful power for that target?

With a standard 5% false-positive rate and 80% power, the normal approximation gives roughly 1,471 conversations per arm, or about 2,942 total. That's a lot for a small support team, which is why offline screening with a golden dataset is so valuable. It lets you kill bad variants before they consume live traffic.

Sample size chart showing how smaller minimum detectable effects require far more conversations Sample size chart showing how smaller minimum detectable effects require far more conversations
Minimum detectable effect drives traffic needs. Smaller lifts are harder to separate from noise, so required sample size rises quickly.

Notice what happens if you aim smaller. Detecting a 2-point lift (from 60% to 62%) balloons the requirement to over 18,600 total conversations. Smaller effects require quadratically more samples.

The formula

For binary metrics (like "resolved" vs "not resolved"), a common two-sided normal approximation for a 50/50 test is:

nper arm≈(zα/22pˉ(1−pˉ)+zβpA(1−pA)+pB(1−pB))2(pB−pA)2n_{\text{per arm}} \approx \frac{\left(z_{\alpha/2}\sqrt{2\bar{p}(1-\bar{p})} + z_{\beta}\sqrt{p_A(1-p_A) + p_B(1-p_B)}\right)^2}{(p_B - p_A)^2}nper arm​≈(pB​−pA​)2(zα/2​2pˉ​(1−pˉ​)​+zβ​pA​(1−pA​)+pB​(1−pB​)​)2​

Here, pAp_ApA​ is the baseline rate, pB=pA+δp_B = p_A + \deltapB​=pA​+δ is the treatment rate implied by your minimum detectable effect (MDE), and pˉ=(pA+pB)/2\bar{p} = (p_A + p_B)/2pˉ​=(pA​+pB​)/2. The zzz values come from the standard normal distribution: zα/2z_{\alpha/2}zα/2​ controls false positives and zβz_\betazβ​ controls false negatives.

Python calculator

The following function computes the necessary sample size for a binary metric. It takes the baseline conversion rate, the minimum detectable effect, alpha, and power as inputs, and returns the total required sample size using the standard normal approximation.

python-calculator.py
1import math 2from statistics import NormalDist 3 4def required_sample_size( 5 baseline_rate: float, # Current metric value (e.g., 0.60) 6 min_detectable_effect: float, # Smallest meaningful change (e.g., 0.05) 7 alpha: float = 0.05, # Significance level (false positive rate) 8 power: float = 0.80 # Statistical power (1 - false negative rate) 9) -> int: 10 """ 11 Calculates the total sample size for a binary metric A/B test with two variants. 12 Uses the standard normal approximation (Z-test). 13 """ 14 p_a = baseline_rate 15 p_b = baseline_rate + min_detectable_effect 16 17 if not 0 < p_a < 1: 18 raise ValueError("baseline_rate must be between 0 and 1") 19 if not 0 < p_b < 1: 20 raise ValueError("baseline_rate + min_detectable_effect must be between 0 and 1") 21 22 pooled = (p_a + p_b) / 2 23 normal = NormalDist() 24 z_alpha = normal.inv_cdf(1 - alpha / 2) 25 z_beta = normal.inv_cdf(power) 26 27 numerator = ( 28 z_alpha * math.sqrt(2 * pooled * (1 - pooled)) + 29 z_beta * math.sqrt(p_a * (1 - p_a) + p_b * (1 - p_b)) 30 ) 31 n_per_arm = (numerator / (p_b - p_a)) ** 2 32 return math.ceil(n_per_arm) * 2 33 34# Example: detect a 5-point lift from 60% resolution 35print(required_sample_size(0.60, 0.05)) 36 37# Smaller effect: detect a 2-point lift from 60% to 62% 38print(required_sample_size(0.60, 0.02))
Output
12942 218672

Use this calculator before launch, not after results arrive. For the support-bot example, a 5-point lift needs thousands of conversations, while a 2-point lift needs far more.

For judge scores, latency, or token counts, estimate variance from historical logs or a pilot run and use the matching power analysis for continuous metrics. Don't reuse a binary-proportion shortcut for everything.

Routing users to the right variant

The randomization unit determines how traffic and users are divided between the control and treatment groups. Choosing the right unit is critical for both the statistical validity of the experiment and the consistency of the user experience.

UnitProsCons
Per-requestMore assignment units when requests are independentInvalid for stateful conversations and inconsistent UX
Per-sessionConsistent within sessionUser may see both variants
Per-userMost consistentRequires more users, slower

For chat products, randomizing per user or per conversation is usually the right default. Pin that assignment for the full thread so follow-up turns, tool calls, and regenerations all hit the same variant. This keeps context and tone consistent, and it reduces within-session contamination where one bad turn changes how the user behaves on later turns.

SUTVA (Stable Unit Treatment Value Assumption) can still fail in collaborative features or shared documents, because users can influence each other. In those cases, you often need cluster-based randomization rather than simple per-user hashing.

Session pinning

Stable assignment sounds simple until you ship a multi-turn product. In practice, you need a deterministic routing key (for example user_id or conversation_id), persistent storage of the assigned variant, and telemetry that logs that assignment on every turn. If a conversation starts on Variant B and the second turn accidentally lands on Variant A, you haven't just damaged UX. You've invalidated the experiment because prompt state, retrieved context, and prior model behavior now leak across variants.

This deterministic router assigns at the conversation level. Each turn of one return inquiry gets the same prompt variant, and the logged assignment can travel with every trace:

sticky-conversation-routing.py
1import hashlib 2 3def assigned_variant(conversation_id: str, treatment_fraction: float = 0.10) -> str: 4 digest = hashlib.sha256(conversation_id.encode()).digest() 5 bucket = int.from_bytes(digest[:4], "big") / 2**32 6 return "treatment" if bucket < treatment_fraction else "control" 7 8conversation = "return-ticket-48291" 9for turn in range(1, 4): 10 print(f"turn={turn} conversation={conversation} variant={assigned_variant(conversation)}") 11 12other_conversation = "delivery-ticket-48292" 13print(f"new conversation variant={assigned_variant(other_conversation)}")
Output
1turn=1 conversation=return-ticket-48291 variant=control 2turn=2 conversation=return-ticket-48291 variant=control 3turn=3 conversation=return-ticket-48291 variant=control 4new conversation variant=control

Shadow, canary, and live A/B tests

Different rollout patterns answer different questions:

PatternWhat users seeWhat you learnMain blind spot
ShadowControl output onlyLatency, errors, output deltas, offline judge scoresNo user preference or engagement signal
CanarySmall percentage see variantReal-user guardrails and operational safetyLow traffic can exaggerate cold-start effects
Full A/BBoth groups see different variantsProduct impact with statistical comparisonMore user exposure and larger sample-size needs

For high-risk changes, a common escalation sequence is offline evaluation, then shadow, then canary, then a user-facing A/B test. A low-risk copy adjustment may not need every step; record the reason for skipping one.

Cache locality and cold-start bias

LLM serving is stateful in ways normal web experiments aren't. Prefix caches, KV cache pages, and warm GPU workers strongly influence TTFT and throughput[5]. A shadow deployment doubles inference work, and a low-traffic canary can look artificially slow simply because it misses warm-cache reuse or lands on cold replicas. Measure warm and cold latency separately, and make sure the treatment has enough steady traffic to stay warm before you call a latency regression "real."

Interleaving: a blind taste test for ranking

For ranking products, standard A/B tests compare outcomes across separate traffic groups. Interleaving is a within-query design[6]: it mixes candidates from both rankers into one list and attributes clicks back to each source. In the large-scale search experiments reported by Chapelle et al., interleaving was substantially more sensitive than A/B comparison for detecting ranking differences. That advantage is task-specific: it applies when both systems can safely contribute items to one ranking, not to two different free-form chatbot responses.

Interleaving evaluation for ranking: results from two systems are mixed into one list, clicks are attributed back to each system, and the winner is determined from those clicks. Interleaving evaluation for ranking: results from two systems are mixed into one list, clicks are attributed back to each system, and the winner is determined from those clicks.
Trace how results from Model A and Model B are mixed into a single list, then notice which items the user clicks. Clicks are attributed back to the source model.

In practice, the system mixes results from both models and measures user preference by which results get clicked. A concrete example for an e-commerce search:

text
1User query: "best wireless headphones under $100" 2 3Interleaved results (A vs B): 41. [Model A] Sony XM5 <- User clicks (Vote for A) 52. [Model B] Bose QC45 63. [Model A] Sennheiser M4 74. [Model B] Apple AirPods <- User clicks (Vote for B) 85. [Model A] Audio-Technica M50x <- User clicks (Vote for A) 9 10Result: Model A wins 2-1 for this query.

Because interleaving compares rankers on the same query for the same user, it controls much of the query and user-intent variation within that trial. That can make it more sensitive than a standard A/B test for ranking problems.

Team draft interleaving

The "Team Draft" method ensures a fair and balanced mix of results. Imagine two team captains (Model A and Model B) taking turns picking their best remaining result to place in the final list.

This implementation illustrates the Team Draft interleaving strategy. It accepts two ranked lists of results as inputs and outputs a single interleaved list along with the items attributed to each model, ensuring a balanced representation.

team-draft-interleaving.py
1import random 2 3def next_unique( 4 results: list[object], 5 start_idx: int, 6 seen: set[object], 7) -> tuple[object | None, int]: 8 """Returns next unseen result and updated index.""" 9 idx = start_idx 10 while idx < len(results) and results[idx] in seen: 11 idx += 1 12 13 if idx >= len(results): 14 return None, idx 15 16 return results[idx], idx + 1 17 18def team_draft_interleave( 19 results_a: list[object], 20 results_b: list[object], 21 k: int = 10, 22 rng: random.Random | None = None, 23) -> tuple[list[object], list[object], list[object]]: 24 """ 25 Interleaves two ranked lists using the Team Draft method. 26 Returns: (interleaved_list, items_from_a, items_from_b) 27 """ 28 interleaved = [] 29 team_a: list[object] = [] 30 team_b: list[object] = [] 31 seen: set[object] = set() 32 idx_a, idx_b = 0, 0 33 rng = rng or random.Random() 34 turn = rng.choice(["a", "b"]) 35 36 while len(interleaved) < k: 37 picked = False 38 39 # Try the current team first, then fall back to the other team if needed. 40 for candidate in (turn, "b" if turn == "a" else "a"): 41 if candidate == "a": 42 item, idx_a = next_unique(results_a, idx_a, seen) 43 if item is None: 44 continue 45 interleaved.append(item) 46 team_a.append(item) 47 seen.add(item) 48 turn = "b" 49 picked = True 50 break 51 52 item, idx_b = next_unique(results_b, idx_b, seen) 53 if item is None: 54 continue 55 interleaved.append(item) 56 team_b.append(item) 57 seen.add(item) 58 turn = "a" 59 picked = True 60 break 61 62 if not picked: 63 break # Both lists exhausted or only duplicates remain. 64 65 return interleaved, team_a, team_b 66 67results_a = ["sony", "sennheiser", "audio-technica", "jbl", "anker"] 68results_b = ["bose", "sony", "apple", "anker", "jabra"] 69 70mixed, from_a, from_b = team_draft_interleave( 71 results_a, 72 results_b, 73 k=5, 74 rng=random.Random(7), 75) 76 77print("mixed:", mixed) 78print("from A:", from_a) 79print("from B:", from_b)
Output
1mixed: ['bose', 'sony', 'apple', 'sennheiser', 'anker'] 2from A: ['sony', 'sennheiser'] 3from B: ['bose', 'apple', 'anker']

Statistical rigor

Multiple testing correction

If you searched through 20 independent secondary metrics at α=0.05\alpha=0.05α=0.05, you would have about a 64% chance of finding at least one "significant" result purely by chance (Family-Wise Error Rate, or FWER). Real experiment metrics are often correlated, so the exact number changes, but the core problem doesn't. Predeclare the primary decision metric; when you interpret a family of exploratory metrics, segments, or rubric axes, apply an appropriate correction such as Benjamini-Hochberg[7] to control the False Discovery Rate (FDR).

The code below applies multiple testing corrections to experimental results. It takes a dictionary of metric names and their corresponding test statistics and p-values, and returns the adjusted significance outcomes.

multiple-testing-correction.py
1from collections.abc import Mapping 2 3def bonferroni_adjust(p_values: list[float]) -> list[float]: 4 n = len(p_values) 5 return [min(p * n, 1.0) for p in p_values] 6 7def benjamini_hochberg_adjust(p_values: list[float]) -> list[float]: 8 n = len(p_values) 9 indexed = sorted(enumerate(p_values), key=lambda item: item[1]) 10 adjusted = [0.0] * n 11 running_min = 1.0 12 13 for rank, (idx, p_value) in reversed(list(enumerate(indexed, start=1))): 14 running_min = min(running_min, p_value * n / rank) 15 adjusted[idx] = min(running_min, 1.0) 16 17 return adjusted 18 19def analyze_experiment( 20 metrics_results: dict[str, tuple[float, float]], 21 alpha: float = 0.05, 22) -> Mapping[str, Mapping[str, float | bool]]: 23 """ 24 Applies multiple testing correction to a dictionary of {metric_name: (statistic, p_value)}. 25 """ 26 p_values = [result[1] for result in metrics_results.values()] 27 p_adj_bonf = bonferroni_adjust(p_values) 28 p_adj_bh = benjamini_hochberg_adjust(p_values) 29 30 return { 31 name: { 32 "p_adj_bonferroni": p_adj_bonf[i], 33 "significant_bonferroni": p_adj_bonf[i] <= alpha, 34 "p_adj_bh": p_adj_bh[i], 35 "significant_bh": p_adj_bh[i] <= alpha, 36 } 37 for i, name in enumerate(metrics_results.keys()) 38 } 39 40# Example: four metrics from a support bot experiment 41results = { 42 "policy_compliance": (3.3, 0.001), 43 "resolution_rate": (2.3, 0.020), 44 "latency_p95": (1.4, 0.12), 45 "cost_per_session": (0.8, 0.42), 46} 47 48for metric, values in analyze_experiment(results).items(): 49 print( 50 f"{metric}: " 51 f"bonf={values['p_adj_bonferroni']:.3f} " 52 f"bh={values['p_adj_bh']:.3f} " 53 f"significant_bh={values['significant_bh']}" 54 )
Output
1policy_compliance: bonf=0.004 bh=0.004 significant_bh=True 2resolution_rate: bonf=0.080 bh=0.040 significant_bh=True 3latency_p95: bonf=0.480 bh=0.160 significant_bh=False 4cost_per_session: bonf=1.000 bh=0.420 significant_bh=False

Here resolution_rate survives Benjamini-Hochberg but not Bonferroni because another related metric is even stronger. latency_p95 and cost_per_session fail both corrections.

Sequential testing

Fixed-horizon A/B tests require choosing the sample size NNN in advance and avoiding early stopping based on ordinary p-values. Checking results daily and stopping whenever they cross p<0.05p < 0.05p<0.05 inflates the false-positive rate. A registered sequential design[8] permits planned interim looks by allocating error budget across those looks, allowing an early decision only when its boundary is crossed.

One approach uses a spending function such as O'Brien-Fleming to allocate total error budget across planned checkpoints. Early checks require stronger evidence. Exact critical values depend on the design, sidedness, information fraction, and analysis method; generate them with a statistical package and register them before traffic starts.

Repeatedly checking p-values without a sequential testing correction inflates your false positive rate. The exact inflation depends on the metric, test, and correlation between looks, which is why sequential methods matter.

Sequential testing chart comparing naive repeated fixed boundaries with a stricter registered boundary schedule Sequential testing chart comparing naive repeated fixed boundaries with a stricter registered boundary schedule
Naive daily peeking keeps reusing the ordinary threshold. A planned sequential design uses precomputed stricter early boundaries; do not copy illustrative numbers without fitting your design.

The monitoring code below deliberately consumes precomputed boundaries instead of pretending to calculate a valid sequential test. It records four registered looks and stops only when the observed statistic exceeds that look's critical value:

sequential-testing.py
1from dataclasses import dataclass 2 3@dataclass(frozen=True) 4class RegisteredLook: 5 fraction: float 6 critical_z: float 7 8# Example boundary schedule supplied by the registered design. 9# Do not derive a real experiment's boundaries by copying these values. 10looks = [ 11 RegisteredLook(0.25, 3.47), 12 RegisteredLook(0.50, 2.45), 13 RegisteredLook(0.75, 2.14), 14 RegisteredLook(1.00, 2.01), 15] 16observed_z = {0.25: 1.20, 0.50: 2.10, 0.75: 2.31, 1.00: 0.0} 17 18for look in looks: 19 z_value = observed_z[look.fraction] 20 crossed = abs(z_value) >= look.critical_z 21 print(f"look={look.fraction:.0%} z={z_value:.2f} boundary={look.critical_z:.2f} crossed={crossed}") 22 if crossed: 23 print(f"stop at planned {look.fraction:.0%} look") 24 break
Output
1look=25% z=1.20 boundary=3.47 crossed=False 2look=50% z=2.10 boundary=2.45 crossed=False 3look=75% z=2.31 boundary=2.14 crossed=True 4stop at planned 75% look

Guardrail monitoring

Guardrails protect users and your production environment during exposure. Different signals require different actions. A synchronous personally identifiable information (PII) or unsafe-content detector can block one response immediately. A sampled judge safety rate, p95 latency, or average cost is an aggregate estimate with noise; it usually needs a defined window and persistence rule before you pause or roll back a treatment.

Setting appropriate thresholds requires balancing safety with experimental velocity. If thresholds are too tight, normal statistical noise will trigger false alarms and halt valid experiments. If they're too loose, users might be exposed to harmful model degradation before the system reacts.

Set guardrails from the baseline distribution you observed during stable periods, not from arbitrary fixed values. This reduces false rollbacks caused by normal day-to-day traffic swings.

This Python class monitors aggregate windows. A single breached window raises an alert; only a sustained breach reaches the predeclared action. In a real system, keep synchronous per-response blocking outside this aggregate loop.

guardrail-monitoring.py
1from typing import TypedDict 2 3class Threshold(TypedDict): 4 direction: str # 'min' or 'max' 5 value: float 6 consecutive_windows: int 7 action: str 8 9class ExperimentGuardrails: 10 def __init__(self, thresholds: dict[str, Threshold]): 11 self.thresholds = thresholds 12 self.streaks = {metric: 0 for metric in thresholds} 13 14 def check(self, experiment_data: dict[str, float]) -> list[str]: 15 actions: list[str] = [] 16 17 for metric, threshold in self.thresholds.items(): 18 current = experiment_data.get(metric) 19 if current is None: 20 continue 21 22 breached = ( 23 current > threshold["value"] 24 if threshold["direction"] == "max" 25 else current < threshold["value"] 26 ) 27 self.streaks[metric] = self.streaks[metric] + 1 if breached else 0 28 if breached: 29 print(f"alert {metric}: streak={self.streaks[metric]}") 30 if self.streaks[metric] == threshold["consecutive_windows"]: 31 actions.append(f"{threshold['action']}: {metric}") 32 33 return actions 34 35guardrails = ExperimentGuardrails({ 36 "latency_p95_ms": {"direction": "max", "value": 3000, "consecutive_windows": 2, "action": "pause exposure"}, 37 "error_rate": {"direction": "max", "value": 0.001, "consecutive_windows": 2, "action": "rollback treatment"}, 38 "cost_per_session_usd": {"direction": "max", "value": 0.08, "consecutive_windows": 2, "action": "review spend"}, 39}) 40 41windows = [ 42 {"latency_p95_ms": 3300, "error_rate": 0.002, "cost_per_session_usd": 0.09}, 43 {"latency_p95_ms": 3400, "error_rate": 0.003, "cost_per_session_usd": 0.091}, 44] 45for index, window in enumerate(windows, start=1): 46 print(f"window {index} actions:", guardrails.check(window))
Output
1alert latency_p95_ms: streak=1 2alert error_rate: streak=1 3alert cost_per_session_usd: streak=1 4window 1 actions: [] 5alert latency_p95_ms: streak=2 6alert error_rate: streak=2 7alert cost_per_session_usd: streak=2 8window 2 actions: ['pause exposure: latency_p95_ms', 'rollback treatment: error_rate', 'review spend: cost_per_session_usd']

The first window alerts without making an aggregate decision. The second sustained window reaches each registered action; an objective sustained error breach triggers rollback while latency and cost cause pause or review.

Multi-armed bandits

Standard A/B tests explore first (to find a winner) and exploit later (when you ship the winner to 100% of users). But what if the exploration phase is too costly? If a bad model variant causes users to churn, you want to stop sending traffic to it as quickly as possible.

Multi-armed bandits (MAB) dynamically adjust traffic allocation during the experiment based on real-time performance. Using algorithms such as Thompson Sampling, the system routes more traffic to variants that appear to be winning and less traffic to variants that appear weak. This minimizes "regret," which is the total penalty incurred by exposing users to suboptimal models. Bandits fit continuous optimization tasks (such as prompt tuning) where you have many variants and aim to automatically deprecate underperforming ones without waiting for a fixed-horizon A/B test.

But MABs introduce architectural complexity. They require a tightly coupled feedback loop where user actions (such as a thumbs up) immediately update the routing logic. In distributed systems with high latency or delayed rewards (for example, measuring 7-day retention), bandits can be challenging to implement correctly compared to static A/B test splits.

The following Python example illustrates a basic Beta-Bernoulli Thompson Sampling bandit. It assumes a binary reward signal such as click/no-click or accept/reject. If your reward is continuous or heavily delayed, the posterior update changes.

multi-armed-bandits.py
1import random 2 3class ThompsonSamplingBandit: 4 def __init__(self, n_arms: int, rng: random.Random | None = None): 5 self.successes = [1] * n_arms # Alpha parameter (prior) 6 self.failures = [1] * n_arms # Beta parameter (prior) 7 self.pulls = [0] * n_arms 8 self.rng = rng or random.Random() 9 10 def select_arm(self) -> int: 11 """Samples from the Beta distribution for each arm and selects the highest.""" 12 sampled_theta = [ 13 self.rng.betavariate(successes, failures) 14 for successes, failures in zip(self.successes, self.failures) 15 ] 16 return max(range(len(sampled_theta)), key=sampled_theta.__getitem__) 17 18 def update(self, arm: int, reward: int): 19 """Updates the posterior with a Bernoulli outcome (0 or 1).""" 20 if reward not in (0, 1): 21 raise ValueError("reward must be 0 or 1 for Beta-Bernoulli Thompson Sampling") 22 23 self.successes[arm] += reward 24 self.failures[arm] += (1 - reward) 25 self.pulls[arm] += 1 26 27rng = random.Random(7) 28true_resolution_rates = [0.55, 0.62] 29bandit = ThompsonSamplingBandit(n_arms=2, rng=rng) 30 31for _ in range(500): 32 arm = bandit.select_arm() # 0 = control, 1 = treatment 33 reward = int(rng.random() < true_resolution_rates[arm]) 34 bandit.update(arm, reward) 35 36print("traffic:", bandit.pulls) 37print("successes:", bandit.successes) 38print("failures:", bandit.failures)
Output
1traffic: [151, 349] 2successes: [89, 227] 3failures: [64, 124]

Biases that distort results

Evaluating generative AI introduces unique psychological and statistical biases that typically don't appear in standard software experiments.

Position bias

Users often prefer the first option shown, simply because it requires less effort to read. In pairwise comparisons (such as Side-by-Side evaluation or Reinforcement Learning from Human Feedback (RLHF) labeling), this bias can skew results significantly.

Solution

Always randomize the order of presentation. If showing two model outputs A and B, ensure A appears first for 50% of users and B appears first for the other 50%. Log the display order to control for this bias during analysis.

Running a head-to-head judge comparison with A always first can make order effects look like model quality. Swap positions and analyze order effects so presentation bias is measured and reduced rather than silently attributed to the model.

Novelty effect

Users initially engage more with new models or features just because they're different, not necessarily better. A new return-label feature might see a spike in usage on Day 1 that evaporates by Day 7.

Solution

Plan duration around known demand cycles and novelty risk. A full weekly cycle is a reasonable starting point when weekday and weekend traffic differ, but it isn't a universal stopping rule. If you want a burn-in window, define it before launch and exclude it consistently. Don't decide to drop early days after seeing a spike.

Simpson's paradox

A model can look better overall even while losing inside every major segment. This usually happens when the analyzed sample has different segment mixes across variants.

SegmentModel A (Resolution %, analyzed n)Model B (Resolution %, analyzed n)Winner
Mobile Users85% (200)80% (800)Model A
Desktop Users70% (800)65% (200)Model A
Combined73%77%Model B

Both segment-level comparisons favor Model A, but the aggregate flips because Model B's analyzed sample contains far more mobile users (who have higher resolution rates overall). That's Simpson's paradox.

Simpson's paradox example where Model A wins mobile and desktop segments, but Model B appears better in the aggregate because analyzed traffic mix changes. Simpson's paradox example where Model A wins mobile and desktop segments, but Model B appears better in the aggregate because analyzed traffic mix changes.
Segment checks reveal the real story: Model A wins both cohorts, while the aggregate flips only because Model B's analyzed sample skews mobile.

How to avoid it

Always segment results by pre-treatment user groups (for example, tenure, subscription tier, device type) and inspect the sample mix in each arm. Be especially careful with triggered analyses, missing-label subsets, or any filter that treatment can influence. Before declaring a winner, verify that the treatment doesn't severely degrade the experience for critical cohorts, even if the aggregate top-line metric improves.

The calculation below reproduces the aggregate reversal. Both pre-treatment cohorts favor Model A even though Model B's mobile-heavy analyzed sample wins after pooling:

simpsons-paradox-segment-check.py
1results = { 2 "A": {"mobile": (170, 200), "desktop": (560, 800)}, 3 "B": {"mobile": (640, 800), "desktop": (130, 200)}, 4} 5 6def rate(successes: int, total: int) -> float: 7 return successes / total 8 9for segment in ["mobile", "desktop"]: 10 a = rate(*results["A"][segment]) 11 b = rate(*results["B"][segment]) 12 print(f"{segment:7} A={a:.0%} B={b:.0%} winner={'A' if a > b else 'B'}") 13 14for arm in ["A", "B"]: 15 successes = sum(pair[0] for pair in results[arm].values()) 16 total = sum(pair[1] for pair in results[arm].values()) 17 print(f"aggregate {arm}={rate(successes, total):.0%}")
Output
1mobile A=85% B=80% winner=A 2desktop A=70% B=65% winner=A 3aggregate A=73% 4aggregate B=77%

LLM-as-judge for offline screening

LLM-as-Judge[9] offers automated quality comparison to screen candidates before committing to a live A/B test. Full A/B tests with human evaluation are expensive and slow.

LLM judges are useful triage systems, not ground truth. They can cheaply filter obvious regressions in offline evaluation or shadow mode, but they also inherit position bias, verbosity bias, and judge-model bias[9]. Use them to narrow the field before you consume valuable live traffic.

The asynchronous function below uses a strong LLM to evaluate two competing model outputs. It takes the original prompt and both responses as inputs, and returns a judgment ("A", "B", or "TIE") with a brief justification.

llm-as-judge-for-offline-screening.py
1from collections.abc import Mapping 2 3async def llm_judge_compare( 4 query: str, 5 response_a: str, 6 response_b: str, 7 criteria: list[str] | None = None, 8) -> Mapping[str, str]: 9 """Use a strong LLM to judge which response is better.""" 10 criteria = criteria or ["accuracy", "helpfulness", "clarity", "completeness"] 11 judge_prompt = f"""You are an expert evaluator. Compare these two responses to the query. 12 13Query: {query} 14 15Response A: {response_a} 16Response B: {response_b} 17 18Rate each response on a scale of 1-5 for: {', '.join(criteria)}. 19Then pick the overall better response: A, B, or TIE. 20Explain your reasoning in one sentence.""" 21 22 result = await judge_model.generate(judge_prompt) 23 return {"winner": parse_winner(result), "reasoning": result}

Key practices for LLM-as-judge

  • Swap positions: To measure position bias, run comparisons in both A/B and B/A order.
  • Use a capable judge candidate: Greater capability may help, but calibration against human labels decides whether the judge is usable.
  • Calibrate against humans: Measure judge agreement on a held-out human-labeled set before you trust it for go/no-go decisions.
  • Audit length bias: Slice wins by response length and refusal style to catch judges that reward verbosity instead of correctness.
  • Don't replace A/B tests entirely: Use LLM-as-Judge to filter candidates, then validate the winning candidate with a real A/B test on production traffic.

Published judge evaluations report position and verbosity biases.[9] Check whether your judge's wins correlate with display order or response length before trusting its verdict.

Online evaluation and the offline-online gap

Everything above the live A/B test (golden datasets, rubrics, offline judges) measures what your variant does on a frozen set you chose. Production measures what it does on traffic you did not choose. That difference is the offline-online gap, and closing it is the whole point of online evaluation. A variant can win 90% offline and still regress live because the real query mix is different, latency and cost shift under load, and rare safety failures only surface at the production distribution.

A common production pattern is to keep the judge running after launch instead of only before it. You sample a predeclared slice of live traffic sized for budget and detection needs, attach the request, retrieved context, tool calls, and output to a trace, then score that trace asynchronously with the same rubric you used offline. Because the eval is linked to a trace, a quality drop points you straight to the failing request rather than to an aggregate dashboard with no drill-down. This is the trace-linked online judge.

Keep two ideas separate, because production systems implement them differently:

MechanismTimingJobLatency budget
Guardrail (inline)Synchronous, in the request pathBlock a specific failure (toxicity, PII, schema break) before the user sees itMilliseconds
Online evaluator (judge)Asynchronous, after the responseScore quality on sampled traffic and watch for drift over timeSeconds, off the hot path

An inline guardrail is a safety gate that must run before the response ships. An online judge is a measurement system that runs after, on a sample; when it is kept off the request path, it doesn't add user-facing latency. Confusing the two leads people to either slow every request with a judge call or to treat a slow async score as a real-time block.

Online judges also drift when the judge version, rubric, prompt, or traffic mix changes. The fix is the same calibration discipline you use offline, run continuously. Track judge-vs-human agreement on a rotating audit set, choose an acceptance threshold for your decision risk, log the full judge configuration on every score, and re-calibrate whenever that configuration changes.

This audit check uses an explicit local policy rather than a universal agreement cutoff:

judge-calibration-gate.py
1human_labels = ["A", "B", "A", "TIE", "B", "A", "B", "B"] 2judge_labels = ["A", "B", "B", "TIE", "B", "A", "A", "B"] 3minimum_agreement = 0.80 4 5matches = sum(human == judge for human, judge in zip(human_labels, judge_labels)) 6agreement = matches / len(human_labels) 7decision = "enable trend monitoring" if agreement >= minimum_agreement else "recalibrate judge" 8 9print(f"audit examples: {len(human_labels)}") 10print(f"judge-human agreement: {agreement:.1%}") 11print(f"policy minimum: {minimum_agreement:.1%}") 12print(f"decision: {decision}")
Output
1audit examples: 8 2judge-human agreement: 75.0% 3policy minimum: 80.0% 4decision: recalibrate judge

Tooling

Managing LLM experiments at scale usually requires more than spreadsheets and ad hoc notebooks. Common choices include:

CategoryExamplesPurpose
Experiment trackingW&B Weave, MLflowTrack runs, compare variants, store metrics and artifacts
Testing and regressionDeepEval, Promptfoo, GiskardAutomate prompt tests, red-team checks, and eval regressions
ObservabilityLangfuse, PhoenixTrace request flows, inspect latency and cost, debug failures

When selecting tools, prioritize integration with your existing ML infrastructure, immutable experiment definitions, trace linkage, and guardrail alerting. A useful toolchain connects offline evaluation to online experiments with shared metric definitions.

Practice: build a prompt arena

Your turn. Build a small "Prompt Arena" CLI tool that automates the workflow we walked through.

Requirements

  1. Create a CSV file with 10 support questions (use the five golden questions from this article plus five more of your own).
  2. Write two system prompts: a baseline and a treatment.
  3. Send each question through both prompts using an API (OpenAI, Groq, or a local model via Ollama).
  4. Use a judge candidate or human reviewers to score each response on Clarity, Accuracy, and Policy Citation, then record how that evaluator was calibrated.
  5. Print a summary report: average scores per prompt, win rate, and which questions had the biggest gaps.

Check your understanding

  • If Prompt A scores higher on Accuracy but lower on Policy Citation, which metric is more important for a support bot? (Answer: You must define that policy before seeing results. Factual policy correctness is a hard requirement; citation completeness may be primary or a guardrail depending on product policy.)
  • You run 50 conversations and the treatment wins 60% of the time. Is that enough to ship? (Answer: Probably not. 50 conversations is far below the sample size needed for a 5-point lift at standard power. Run a power analysis.)
  • Why is per-request randomization dangerous in a multi-turn support chat? (Answer: A user might see different tones, policies, or context states on each turn, which breaks the experiment and creates a terrible experience.)

Mastery check

This is the last article in Inference & Production Scale. You now have the full production loop: serve a model fleet, autoscale it, route and fall back across providers, watch cost and tokens, observe it in production, and decide what gets to ship through online evaluation. That last decision is the one this article owns, and you should be able to defend the full online-evaluation plan, not just quote a p-value.

Evaluation rubric

  1. Design a rubric for a specific business use case and explain the difference between reference-based and reference-free evaluation.
  2. Identify why a per-request randomization breaks a multi-turn conversation experiment.
  3. Calculate a required sample size for a binary metric and explain why small samples lead to false positives.
  4. Name three deployment patterns (shadow, canary, live A/B) and describe when each is appropriate.
  5. Spot position bias, novelty effects, and Simpson's paradox in experimental results.

You should also be able to:

  • Pick a randomization unit that avoids SUTVA violations. For collaborative products, that often means team, organization, document cluster, or switchback randomization instead of per-user randomization.
  • Separate primary, secondary, and guardrail metrics. Resolution rate can be primary; safety rate, p95 latency, error rate, and cost per successful session are guardrails.
  • Size the test before launch. A tiny golden dataset is useful screening evidence, not launch evidence.
  • Choose shadow, canary, or live A/B testing based on risk and measurement goals.
  • Add predefined circuit breakers for objective critical failures and persistence rules for noisy aggregate guardrails.
  • Account for cache, cold-start, output-length, and routing bias before blaming the model.
  • Choose interleaving for ranking comparisons where both systems can be shown in one mixed result list.
  • Apply multiplicity corrections when many metrics, segments, or rubrics are tested at once.

Follow-up questions

How do you handle network effects in LLM A/B tests? Network effects break simple A/B tests when one user's treatment changes another user's experience. Shared AI-edited documents are a typical example: User A's improved edits can make User B happier even if User B is in the control group. Use cluster randomization at the team, organization, or document-cluster level. For short-lived effects, switchback experiments can also work.

Which common guardrails should an LLM product consider? Typical choices include p95/p99 latency, error rate, empty-response rate, safety violations, personally identifiable information (PII) leakage, and cost per successful session. Block objective per-response safety failures inline; pause or roll back exposure according to predefined persistence rules for aggregate signals.

How do you test multi-turn conversational effects? Randomize by session or conversation, not by request. Track conversation-level outcomes such as resolution rate, human handoff, sentiment trajectory, and retention after early turns. A bad first answer can poison every later turn.

When would you use a bandit instead of an A/B test? Use bandits when the goal is to optimize cumulative reward during exploration, such as continuous prompt tuning with fast feedback. Use classical A/B tests when you need clean inference, fixed exposure, long-term outcomes, or high-confidence launch evidence.

Common pitfalls

  • Using vanity metrics instead of action metrics tied to product value.
  • Randomizing every request in a multi-turn product.
  • Skipping sample-size and minimum-detectable-effect planning.
  • Ignoring cost per session and output-length growth.
  • Treating every metric and segment as independent without multiplicity correction.
  • Calling cold-start or cache-locality effects model regressions.
  • Running without safety, reliability, and cost guardrails.
  • Comparing judge outputs with A always first instead of swapping positions.
  • Declaring victory from a tiny golden dataset without live validation.

If you can defend those decisions, you have closed out production scale. Next comes System Design Capstones, where you assemble serving, observability, guardrails, and online evaluation into one end-to-end design under interview conditions. The first capstone, a real-time content moderation system, leans directly on the guardrail and online-evaluation thinking from this article: it has to keep user-generated content safe at scale using LLMs and specialized classifiers.

Next Step
Continue to Content Moderation System

There, you'll combine the production-scale skills from this section into a full real-time moderation architecture built from LLMs and specialized classifiers.

PreviousGPU Serving & Autoscaling
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing.

Kohavi, R., Tang, D., Xu, Y. · 2020

BERTScore: Evaluating Text Generation with BERT.

Zhang, T., et al. · 2020 · ICLR 2020

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment.

Liu, Y., et al. · 2023

RAGAS: Automated Evaluation of Retrieval Augmented Generation.

Es, S., et al. · 2023 · arXiv preprint

Efficient Memory Management for Large Language Model Serving with PagedAttention.

Kwon, W., et al. · 2023 · SOSP 2023

Large-scale Validation and Analysis of Interleaved Search Evaluation.

Chapelle, O., et al. · 2012 · ACM TOIS

Controlling the false discovery rate: a practical and powerful approach to multiple testing.

Benjamini, Y., & Hochberg, Y. · 1995 · Journal of the Royal Statistical Society. Series B

Peeking at A/B Tests: Why it matters, and what to do about it.

Johari, R., et al. · 2017 · KDD 2017

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.

Zheng, L., et al. · 2023 · NeurIPS 2023