Hands-on chapter for experiment design, with first-principles mechanics, runnable code, failure modes, and production checks.
Experiment design is how you learn from users without fooling yourself. A/B testing compares choices under controlled conditions. This chapter starts from zero and builds toward the concrete job skill: Design an A/B test for a retrieval change with primary metric, guardrails, power assumptions, and rollback criteria. [1][2][3]
| Stage | Beginner action | Checkpoint |
|---|---|---|
| Concept | State primary metric before seeing results. | Reader can say input, operation, and output without naming a library. |
| Build | Write treatment, control, guardrails, and rollback rule. | Code prints or asserts one result the reader predicted first. |
| Failure | Multiple metrics do not become multiple chances to declare a win. | The common beginner mistake has a visible symptom and guard. |
| Ship | Experiment brief includes power, duration, and launch decision. | Artifact is small enough for another engineer to rerun. |
Start with one decision, one metric, and one split rule. If you change many things at once or inspect results until one looks good, you are not learning cleanly.
Read this chapter once for the idea, then run the demo and change one value. For Experiment Design, progress means you can name the input, explain the operation, and say what result would prove the idea worked.
By the end, you should be able to explain Experiment Design with a worked example, not a library name. Keep one runnable file and one short note with the result you expected before you ran it.
Experiment Design matters because later LLM work assumes this habit already exists. You will use it when you inspect data, debug model behavior, compare evaluations, or explain why a result should be trusted.
The job skill here is: Design an A/B test for a retrieval change with primary metric, guardrails, power assumptions, and rollback criteria. Treat the snippet as lab equipment: run it, change one input, and write down what changed before you move on.
Imagine a control prompt with 12 percent click-through and a treatment prompt with 12.6 percent. Relative lift says the treatment is 5 percent higher, but you still need sample size and uncertainty before shipping.
A useful beginner checklist for Experiment Design:
Keep the answer concrete. If you can't point to the value, shape, row, metric, or test that proves the point, the Experiment Design concept is still fuzzy.
Use these definitions while reading the demo. Each term should map to a variable, an assertion, or a decision you could explain in review.
Start with the smallest version that can run from a terminal. The goal for this Experiment Design demo is visibility: one file, one output, and no hidden notebook state.
python1def relative_lift(control, treatment): 2 return (treatment - control) / control 3 4baseline_ctr = 0.12 5new_ctr = 0.126 6print(f"lift={relative_lift(baseline_ctr, new_ctr):.1%}")
Read the code in this order:
relative_lift subtracts control from treatment to get absolute change.control turns that change into a percentage relative to baseline.baseline_ctr and new_ctr are rates, not raw counts.After it runs, make three small edits. Add a normal-case test, add an edge-case test, then log the intermediate value a beginner would most likely misunderstand. That turns Experiment Design from a reading exercise into an engineering exercise.
For Experiment Design, a strong submission includes a runnable command, one test file, and notes for any assumptions. If data, randomness, training, or evaluation appears, save the split rule, seed, config, and metric definition.
A beginner may ship a 5 percent lift without noticing the test had too few users or harmed a guardrail metric.
For Experiment Design, make the failure visible before adding the fix. Write the symptom in plain English, then add the smallest guard that would catch it next time.
Good guards for Experiment Design are concrete: assertions, fixture rows, duplicate checks, seed control, metric intervals, or release checks. Pick the guard that makes the hidden assumption executable.
control_n and treatment_n, then print both lift and number of observations beside the result.Keep this ladder small. Experiment Design should feel runnable before it feels impressive. The capstones later reuse the same habit at product scale.
Pre-register the metric, duration, guardrails, and decision rule. Monitor sample-ratio mismatch before trusting the result.
A production check for Experiment Design is proof another engineer can trust the result. At foundation level that means a reproducible command and tests. At capstone level it also means a design note, eval evidence, cost or latency notes, and rollback criteria.
Before moving on, answer four Experiment Design questions: What input does this accept? What output or metric proves it worked? What failure would fool you? What test catches that failure?
Ship a small Experiment Design folder with code, tests, and notes. Make it boring to run: install dependencies, run tests, run the demo. That boring path is what makes the artifact useful in a portfolio.
Experiment Design feeds later LLM engineering work directly. Retrieval, fine-tuning, agents, evals, and serving all depend on small foundations like this being clear before systems get large.