Compare an order-operations coding assistant with paired evidence, uncertainty for lift, and pass@k under a fixed sampling budget.
In the previous lesson, you simulated resolution outcomes and saw why random samples wobble. Now imagine an e-commerce engineering team evaluating a coding assistant. It must write small functions for refunds, return labels, delivery promises, and carrier scans. Hidden tests mark each generated function as pass or fail.
Model B passes one more task than Model A on a six-task evaluation. Is that enough evidence to replace the old model? This chapter teaches how to answer without confusing a promising result with a proven improvement.
The numbers below are deliberately small so you can calculate them by hand. They teach the method, not a production launch threshold.
Each prompt asks for one Python helper used in an order workflow. Both models receive the same prompt and are checked by the same hidden tests.
| Task | Hidden-test requirement | Model A | Model B |
|---|---|---|---|
| 1 | refund deadline is inclusive | pass | pass |
| 2 | damaged-item return label routing | fail | pass |
| 3 | late-delivery credit cap | pass | pass |
| 4 | carrier scan deduplication | fail | fail |
| 5 | partial-refund rounding | pass | pass |
| 6 | backorder cancellation window | fail | fail |
Convert pass to 1 and fail to 0. Then Model A has a pass rate of 3 / 6 = 0.500, Model B has 4 / 6 = 0.667, and the observed lift is 1 / 6 = 0.167, or 16.7 percentage points.
Run the calculation rather than trusting a headline.
1model_a = [1, 0, 1, 0, 1, 0]
2model_b = [1, 1, 1, 0, 1, 0]
3
4rate_a = sum(model_a) / len(model_a)
5rate_b = sum(model_b) / len(model_b)
6lift = rate_b - rate_a
7
8print(f"Model A pass@1: {rate_a:.3f}")
9print(f"Model B pass@1: {rate_b:.3f}")
10print(f"observed lift: {lift:+.3f} ({lift * 100:+.1f} percentage points)")1Model A pass@1: 0.500
2Model B pass@1: 0.667
3observed lift: +0.167 (+16.7 percentage points)A paired evaluation preserves more information than two totals. Subtract A from B for each task:
| Result on one task | Difference B - A | Count |
|---|---|---|
| both pass | 0 | 3 |
| both fail | 0 | 2 |
| B passes, A fails | +1 | 1 |
| A passes, B fails | -1 | 0 |
Only disagreement tasks tell you which model won. Five ties make the benchmark look larger without giving any directional evidence.
1model_a = [1, 0, 1, 0, 1, 0]
2model_b = [1, 1, 1, 0, 1, 0]
3
4differences = [b - a for a, b in zip(model_a, model_b)]
5b_wins = differences.count(1)
6a_wins = differences.count(-1)
7ties = differences.count(0)
8
9print("paired differences:", differences)
10print(f"B wins={b_wins}, A wins={a_wins}, ties={ties}")
11print("directional evidence comes from disagreements:", b_wins + a_wins)1paired differences: [0, 1, 0, 0, 0, 0]
2B wins=1, A wins=0, ties=5
3directional evidence comes from disagreements: 1The null hypothesis says that, on disagreement tasks, Model A and Model B are equally likely to win. The directional alternative says Model B wins more often.
Under that null hypothesis, each disagreement is like a fair coin: B wins or A wins. Our six-task benchmark has one disagreement, and it went to B. A one-sided p-value for the planned claim "B is better" asks:
If both models were equally likely to win a disagreement, how often would B win at least this many of the disagreements?
With one disagreement, B wins it with probability 0.5. That result isn't rare. The sample moved upward, but the evidence is weak.
This exact coin calculation is easy to implement with the binomial coefficients you already know.
1from math import comb
2
3def b_wins_one_sided_p_value(b_wins: int, a_wins: int) -> float:
4 if b_wins < 0 or a_wins < 0:
5 raise ValueError("win counts must be nonnegative")
6 disagreements = b_wins + a_wins
7 if disagreements == 0:
8 return 1.0
9 tail_count = sum(comb(disagreements, wins) for wins in range(b_wins, disagreements + 1))
10 return tail_count / (2 ** disagreements)
11
12print(f"one B win, zero A wins: p={b_wins_one_sided_p_value(1, 0):.3f}")
13print(f"thirteen B wins, three A wins: p={b_wins_one_sided_p_value(13, 3):.4f}")1one B win, zero A wins: p=0.500
2thirteen B wins, three A wins: p=0.0106The second line shows why more disagreement evidence matters. A result with 13 B wins and 3 A wins under a predeclared directional question is much harder to explain with an equal-win coin.
If you had planned to detect a difference in either direction, you would use a two-sided version instead. Pick the question before seeing the winning direction.
A confidence interval should target the decision. If your question is "How much better is B than A on the same tasks?", calculate an interval for the paired lift B - A. Comparing two separate pass-rate intervals can hide the pairing and isn't the right decision rule.
A paired bootstrap interval repeatedly resamples complete task rows with replacement. Each resample keeps Model A's outcome beside Model B's outcome, then recomputes the mean difference. The bootstrap is a general resampling method for estimating sampling uncertainty from observed data.[1]
Six tasks are useful for intuition but painfully sparse. Use a slightly larger illustrative evaluation with 40 paired tasks:
| Paired outcome | Tasks |
|---|---|
| both pass | 17 |
| both fail | 10 |
| B passes, A fails | 8 |
| A passes, B fails | 5 |
The observed lift is (8 - 5) / 40 = 0.075, or +7.5 percentage points. Bootstrap the paired differences to see how unstable that lift remains.
1import numpy as np
2
3differences = np.array([0] * 27 + [1] * 8 + [-1] * 5)
4rng = np.random.default_rng(7)
5
6resamples = rng.choice(differences, size=(20_000, differences.size), replace=True)
7bootstrap_lifts = resamples.mean(axis=1)
8low, high = np.quantile(bootstrap_lifts, [0.025, 0.975])
9
10print(f"observed paired lift: {differences.mean() * 100:+.1f} percentage points")
11print(f"approximate 95% bootstrap interval: {low * 100:+.1f} to {high * 100:+.1f} points")
12print("interval includes zero:", low <= 0 <= high)1observed paired lift: +7.5 percentage points
2approximate 95% bootstrap interval: -10.0 to +25.0 points
3interval includes zero: True
Bootstrap intervals are approximate, especially with tiny or highly discrete samples. They help you see uncertainty; they don't turn weak evidence into certainty. Here the correct statement is: "B gained 7.5 points in this paired sample, and the interval still includes zero."
Write that conclusion as a rule your evaluation report can enforce.
1def comparison_claim(observed_lift: float, interval: tuple[float, float]) -> str:
2 low, high = interval
3 if low > high:
4 raise ValueError("interval low must not exceed high")
5 if low > 0:
6 return f"evidence of improvement: estimated lift {observed_lift:+.3f}"
7 if high < 0:
8 return f"evidence of regression: estimated lift {observed_lift:+.3f}"
9 return f"inconclusive: estimated lift {observed_lift:+.3f}, interval crosses zero"
10
11print(comparison_claim(0.075, (-0.100, 0.250)))
12print(comparison_claim(0.075, (0.010, 0.140)))1inconclusive: estimated lift +0.075, interval crosses zero
2evidence of improvement: estimated lift +0.075pass@k asks a different questionSo far, each task used one candidate completion from each model. A coding assistant can also generate several candidate functions and let an evaluator check whether at least one passes hidden tests. That is the question measured by pass@k in functional code-generation evaluations such as HumanEval.[2]
Consider three order-operations tasks with three sampled completions apiece:
| Task | Attempt 1 | Attempt 2 | Attempt 3 | pass@1 | pass@3 |
|---|---|---|---|---|---|
| return eligibility | pass | fail | fail | 1 | 1 |
| carrier ETA fallback | fail | fail | pass | 0 | 1 |
| refund split rounding | fail | fail | fail | 0 | 0 |
In this ordered toy table, pass@1 records whether Attempt 1 passed. pass@3 asks whether any of three completions passed. In a larger benchmark pool, pass@1 estimates success for one randomly selected candidate under the fixed sampling protocol. The model isn't being awarded the same budget in the two columns.
1attempts = [
2 [1, 0, 0],
3 [0, 0, 1],
4 [0, 0, 0],
5]
6
7pass_at_1 = sum(row[0] for row in attempts) / len(attempts)
8pass_at_3 = sum(any(row[:3]) for row in attempts) / len(attempts)
9
10print(f"pass@1: {pass_at_1:.3f}")
11print(f"pass@3: {pass_at_3:.3f}")
12print("extra solved tasks from extra attempts:", int((pass_at_3 - pass_at_1) * len(attempts)))1pass@1: 0.333
2pass@3: 0.667
3extra solved tasks from extra attempts: 1This is why a report must publish k, the number of generated samples, the decoding policy, and the tests used to judge correctness. A higher score under a larger attempt budget isn't evidence of stronger single-candidate behavior.
Suppose the assistant generates n = 10 candidate implementations for one function and hidden tests accept c = 2 of them. You want the expected pass@5 result if you select five candidates from that pool.
Counting success cases is tedious. Count the failure case instead:
10 - 2 = 8 failing candidates.C(10, 5) = 252 ways to choose five candidates.C(8, 5) = 56 all-failing choices.1 - 56 / 252 = 0.778.In notation:
The formula estimates the chance that a random group of k samples includes at least one correct completion. Chen and colleagues use this unbiased estimator for HumanEval evaluation after generating more samples per task than the reported k.[2]
1from math import comb
2
3n = 10
4c = 2
5k = 5
6all_groups = comb(n, k)
7all_failing_groups = comb(n - c, k)
8score = 1 - all_failing_groups / all_groups
9
10print("all groups:", all_groups)
11print("all-failing groups:", all_failing_groups)
12print(f"pass@5: {score:.3f}")1all groups: 252
2all-failing groups: 56
3pass@5: 0.778Two boundary checks should feel right:
c = 0, no selected group can pass.k failures exist, every k-sized group contains at least one passing candidate.For larger values of n, avoid assembling enormous combinations. The HumanEval paper gives an equivalent product implementation that stays numerically well behaved.[2]
1import numpy as np
2
3def pass_at_k(n: int, c: int, k: int) -> float:
4 if n <= 0 or not 0 <= c <= n or not 1 <= k <= n:
5 raise ValueError("require n > 0, 0 <= c <= n, and 1 <= k <= n")
6 if n - c < k:
7 return 1.0
8 failure_probability = np.prod(1.0 - k / np.arange(n - c + 1, n + 1))
9 return float(1.0 - failure_probability)
10
11print(f"n=10, c=2, k=1: {pass_at_k(10, 2, 1):.3f}")
12print(f"n=10, c=2, k=5: {pass_at_k(10, 2, 5):.3f}")
13print(f"no passing candidates: {pass_at_k(10, 0, 5):.3f}")
14print(f"not enough failures: {pass_at_k(10, 8, 5):.3f}")
15
16try:
17 pass_at_k(10, 11, 5)
18except ValueError as error:
19 print(error)1n=10, c=2, k=1: 0.200
2n=10, c=2, k=5: 0.778
3no passing candidates: 0.000
4not enough failures: 1.000
5require n > 0, 0 <= c <= n, and 1 <= k <= nA benchmark score averages task-level results. One difficult refund function counts as one task; it shouldn't disappear beneath hundreds of samples from an easier address formatter.
Assume four tasks each produce n = 10 candidates, with the following correct-count vector:
1[0, 1, 2, 4]Compute each task's estimate first, then average those four estimates.
1import numpy as np
2
3def pass_at_k(n: int, c: int, k: int) -> float:
4 if n <= 0 or not 0 <= c <= n or not 1 <= k <= n:
5 raise ValueError("require n > 0, 0 <= c <= n, and 1 <= k <= n")
6 if n - c < k:
7 return 1.0
8 return float(1.0 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1)))
9
10correct_counts = [0, 1, 2, 4]
11for k in (1, 3, 5):
12 task_scores = [pass_at_k(10, correct, k) for correct in correct_counts]
13 print(f"pass@{k}: {np.mean(task_scores):.3f} per-task={[round(score, 3) for score in task_scores]}")1pass@1: 0.175 per-task=[0.0, 0.1, 0.2, 0.4]
2pass@3: 0.417 per-task=[0.0, 0.3, 0.533, 0.833]
3pass@5: 0.563 per-task=[0.0, 0.5, 0.778, 0.976]Notice that pass@5 can rise dramatically while pass@1 remains modest. That is useful information if your product can test several generated candidates, but it isn't a substitute for measuring single-candidate behavior under the same protocol.
pass@k only supports a comparison when the protocol matches. Keep the same task set, hidden tests, number of generated samples n, retained attempt budget k, and decoding rule.
A deterministic decoder illustrates the trap. If every generated candidate is identical for a task, extra attempt slots can't discover a different correct solution. In a controlled deterministic fixture, pass@5 collapses to pass@1.
1import numpy as np
2
3def pass_at_k(n: int, c: int, k: int) -> float:
4 if n <= 0 or not 0 <= c <= n or not 1 <= k <= n:
5 raise ValueError("require n > 0, 0 <= c <= n, and 1 <= k <= n")
6 if n - c < k:
7 return 1.0
8 return float(1.0 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1)))
9
10identical_failed_candidates = [0] * 10
11identical_passing_candidates = [1] * 10
12
13for name, outcomes in [
14 ("same failed candidate", identical_failed_candidates),
15 ("same passing candidate", identical_passing_candidates),
16]:
17 correct = sum(outcomes)
18 print(name, f"pass@1={pass_at_k(10, correct, 1):.1f}", f"pass@5={pass_at_k(10, correct, 5):.1f}")1same failed candidate pass@1=0.0 pass@5=0.0
2same passing candidate pass@1=1.0 pass@5=1.0Real serving stacks can introduce nondeterminism from outside the sampling policy, so record the actual decoder and run settings. The lesson is operational: multiple attempts only buy search when they produce meaningfully different candidates.
A second failure is subtler: generated code that passes weak hidden tests can still be wrong on missing cases or unsafe to execute. Unit-test passing measures functional correctness under that test suite. It doesn't authorize running untrusted code in a production environment. HumanEval's authors evaluated generated code in a sandbox for that reason.[2]
Put the pieces together into one report for a model-review meeting. The first comparison measures a paired pass@1 lift. The second metric measures multi-attempt capability under an explicitly recorded stochastic sampling setup.
1import numpy as np
2
3def pass_at_k(n: int, c: int, k: int) -> float:
4 if n <= 0 or not 0 <= c <= n or not 1 <= k <= n:
5 raise ValueError("require n > 0, 0 <= c <= n, and 1 <= k <= n")
6 if n - c < k:
7 return 1.0
8 return float(1.0 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1)))
9
10model_a = np.array([1] * 17 + [0] * 10 + [0] * 8 + [1] * 5)
11model_b = np.array([1] * 17 + [0] * 10 + [1] * 8 + [0] * 5)
12paired_lift = float((model_b - model_a).mean())
13
14correct_counts_b = [0, 1, 2, 4]
15pass5 = float(np.mean([pass_at_k(10, correct, 5) for correct in correct_counts_b]))
16protocol = {
17 "paired_tasks": 40,
18 "metric": "hidden-test functional correctness",
19 "samples_per_task": 10,
20 "reported_k": 5,
21 "decoding": "stochastic sampling, fixed settings for both models",
22}
23
24print(f"paired pass@1 lift: {paired_lift * 100:+.1f} percentage points")
25print(f"Model B pass@5 on candidate pool: {pass5:.3f}")
26for key, value in protocol.items():
27 print(f"{key}: {value}")1paired pass@1 lift: +7.5 percentage points
2Model B pass@5 on candidate pool: 0.563
3paired_tasks: 40
4metric: hidden-test functional correctness
5samples_per_task: 10
6reported_k: 5
7decoding: stochastic sampling, fixed settings for both modelsBefore anyone calls a winner, your report still needs:
Statistical significance and product value answer different questions. Even convincing statistical evidence can't tell you whether a gain justifies additional candidate generation, test execution, latency, or risk.
A teammate writes: "Model B is better because its pass@5 is 64%, while Model A's pass@1 is 58%."
Write a review comment with three corrections:
k, n, task set, tests, and decoding policy for both models.A good rewritten claim would sound like this:
Under the same 200 paired order-operations tasks and fixed
pass@1protocol, Model B improved hidden-test pass rate by 2.5 percentage points; the paired interval and cost guardrails are reported below.pass@5is listed separately because it measures a larger candidate-search budget.
pass@k estimatorpass@k estimator.n, and k, then separates statistical evidence from deployment value.pass@5 is compared with pass@1. Cause: candidate-search budgets differ. Fix: compare matched protocols and label multi-attempt capability separately.k.Answer every question, then check your score. Score above 75% to mark this lesson complete.
9 questions remaining.