LeetLLM
LearnPracticeFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Practice
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnMath & StatisticsHypothesis Tests, Intervals, and pass@k
📊EasyEvaluation & Benchmarks

Hypothesis Tests, Intervals, and pass@k

Compare an order-operations coding assistant with paired evidence, uncertainty for lift, and pass@k under a fixed sampling budget.

12 min read
Learning path
Step 14 of 155 in the full curriculum
Distributions and SamplingNeural Networks from Scratch

In the previous lesson, you simulated resolution outcomes and saw why random samples wobble. Now imagine an e-commerce engineering team evaluating a coding assistant. It must write small functions for refunds, return labels, delivery promises, and carrier scans. Hidden tests mark each generated function as pass or fail.

Model B passes one more task than Model A on a six-task evaluation. Is that enough evidence to replace the old model? This chapter teaches how to answer without confusing a promising result with a proven improvement.

Evaluation flow for an order-operations coding assistant: paired tasks, an interval for Model B minus Model A, and pass at k under a fixed sampling budget. Evaluation flow for an order-operations coding assistant: paired tasks, an interval for Model B minus Model A, and pass at k under a fixed sampling budget.
A trustworthy comparison holds tasks and decoding rules fixed, measures the paired lift, and reports each multi-attempt score with its sampling budget.

The numbers below are deliberately small so you can calculate them by hand. They teach the method, not a production launch threshold.

Start with paired pass or fail outcomes

Each prompt asks for one Python helper used in an order workflow. Both models receive the same prompt and are checked by the same hidden tests.

TaskHidden-test requirementModel AModel B
1refund deadline is inclusivepasspass
2damaged-item return label routingfailpass
3late-delivery credit cappasspass
4carrier scan deduplicationfailfail
5partial-refund roundingpasspass
6backorder cancellation windowfailfail

Convert pass to 1 and fail to 0. Then Model A has a pass rate of 3 / 6 = 0.500, Model B has 4 / 6 = 0.667, and the observed lift is 1 / 6 = 0.167, or 16.7 percentage points.

Run the calculation rather than trusting a headline.

paired-pass-rates.py
1model_a = [1, 0, 1, 0, 1, 0] 2model_b = [1, 1, 1, 0, 1, 0] 3 4rate_a = sum(model_a) / len(model_a) 5rate_b = sum(model_b) / len(model_b) 6lift = rate_b - rate_a 7 8print(f"Model A pass@1: {rate_a:.3f}") 9print(f"Model B pass@1: {rate_b:.3f}") 10print(f"observed lift: {lift:+.3f} ({lift * 100:+.1f} percentage points)")
Output
1Model A pass@1: 0.500 2Model B pass@1: 0.667 3observed lift: +0.167 (+16.7 percentage points)

A paired evaluation preserves more information than two totals. Subtract A from B for each task:

Result on one taskDifference B - ACount
both pass03
both fail02
B passes, A fails+11
A passes, B fails-10

Only disagreement tasks tell you which model won. Five ties make the benchmark look larger without giving any directional evidence.

count-disagreements.py
1model_a = [1, 0, 1, 0, 1, 0] 2model_b = [1, 1, 1, 0, 1, 0] 3 4differences = [b - a for a, b in zip(model_a, model_b)] 5b_wins = differences.count(1) 6a_wins = differences.count(-1) 7ties = differences.count(0) 8 9print("paired differences:", differences) 10print(f"B wins={b_wins}, A wins={a_wins}, ties={ties}") 11print("directional evidence comes from disagreements:", b_wins + a_wins)
Output
1paired differences: [0, 1, 0, 0, 0, 0] 2B wins=1, A wins=0, ties=5 3directional evidence comes from disagreements: 1

A hypothesis test asks how surprising the win is

The null hypothesis says that, on disagreement tasks, Model A and Model B are equally likely to win. The directional alternative says Model B wins more often.

Under that null hypothesis, each disagreement is like a fair coin: B wins or A wins. Our six-task benchmark has one disagreement, and it went to B. A one-sided p-value for the planned claim "B is better" asks:

If both models were equally likely to win a disagreement, how often would B win at least this many of the disagreements?

With one disagreement, B wins it with probability 0.5. That result isn't rare. The sample moved upward, but the evidence is weak.

Paired hypothesis-test view with five ties, one win for Model B, zero wins for Model A, and a directional p-value of 0.50. Paired hypothesis-test view with five ties, one win for Model B, zero wins for Model A, and a directional p-value of 0.50.
For paired binary outcomes, ties don't separate the models. One win out of one disagreement leaves a directional p-value of 0.50, far too weak for a model-replacement claim.

This exact coin calculation is easy to implement with the binomial coefficients you already know.

exact-directional-p-value.py
1from math import comb 2 3def b_wins_one_sided_p_value(b_wins: int, a_wins: int) -> float: 4 if b_wins < 0 or a_wins < 0: 5 raise ValueError("win counts must be nonnegative") 6 disagreements = b_wins + a_wins 7 if disagreements == 0: 8 return 1.0 9 tail_count = sum(comb(disagreements, wins) for wins in range(b_wins, disagreements + 1)) 10 return tail_count / (2 ** disagreements) 11 12print(f"one B win, zero A wins: p={b_wins_one_sided_p_value(1, 0):.3f}") 13print(f"thirteen B wins, three A wins: p={b_wins_one_sided_p_value(13, 3):.4f}")
Output
1one B win, zero A wins: p=0.500 2thirteen B wins, three A wins: p=0.0106

The second line shows why more disagreement evidence matters. A result with 13 B wins and 3 A wins under a predeclared directional question is much harder to explain with an equal-win coin.

If you had planned to detect a difference in either direction, you would use a two-sided version instead. Pick the question before seeing the winning direction.

Put an interval around the quantity you care about

A confidence interval should target the decision. If your question is "How much better is B than A on the same tasks?", calculate an interval for the paired lift B - A. Comparing two separate pass-rate intervals can hide the pairing and isn't the right decision rule.

A paired bootstrap interval repeatedly resamples complete task rows with replacement. Each resample keeps Model A's outcome beside Model B's outcome, then recomputes the mean difference. The bootstrap is a general resampling method for estimating sampling uncertainty from observed data.[1]

Six tasks are useful for intuition but painfully sparse. Use a slightly larger illustrative evaluation with 40 paired tasks:

Paired outcomeTasks
both pass17
both fail10
B passes, A fails8
A passes, B fails5

The observed lift is (8 - 5) / 40 = 0.075, or +7.5 percentage points. Bootstrap the paired differences to see how unstable that lift remains.

paired-bootstrap-interval.py
1import numpy as np 2 3differences = np.array([0] * 27 + [1] * 8 + [-1] * 5) 4rng = np.random.default_rng(7) 5 6resamples = rng.choice(differences, size=(20_000, differences.size), replace=True) 7bootstrap_lifts = resamples.mean(axis=1) 8low, high = np.quantile(bootstrap_lifts, [0.025, 0.975]) 9 10print(f"observed paired lift: {differences.mean() * 100:+.1f} percentage points") 11print(f"approximate 95% bootstrap interval: {low * 100:+.1f} to {high * 100:+.1f} points") 12print("interval includes zero:", low <= 0 <= high)
Output
1observed paired lift: +7.5 percentage points 2approximate 95% bootstrap interval: -10.0 to +25.0 points 3interval includes zero: True
Paired lift interval for Model B minus Model A, centered at plus 7.5 percentage points and crossing zero. Paired lift interval for Model B minus Model A, centered at plus 7.5 percentage points and crossing zero.
Measure the lift directly. An interval that crosses zero says this paired evaluation still supports both a small loss and a meaningful gain, so deployment language must stay cautious.

Bootstrap intervals are approximate, especially with tiny or highly discrete samples. They help you see uncertainty; they don't turn weak evidence into certainty. Here the correct statement is: "B gained 7.5 points in this paired sample, and the interval still includes zero."

Write that conclusion as a rule your evaluation report can enforce.

language-from-lift-interval.py
1def comparison_claim(observed_lift: float, interval: tuple[float, float]) -> str: 2 low, high = interval 3 if low > high: 4 raise ValueError("interval low must not exceed high") 5 if low > 0: 6 return f"evidence of improvement: estimated lift {observed_lift:+.3f}" 7 if high < 0: 8 return f"evidence of regression: estimated lift {observed_lift:+.3f}" 9 return f"inconclusive: estimated lift {observed_lift:+.3f}, interval crosses zero" 10 11print(comparison_claim(0.075, (-0.100, 0.250))) 12print(comparison_claim(0.075, (0.010, 0.140)))
Output
1inconclusive: estimated lift +0.075, interval crosses zero 2evidence of improvement: estimated lift +0.075

pass@k asks a different question

So far, each task used one candidate completion from each model. A coding assistant can also generate several candidate functions and let an evaluator check whether at least one passes hidden tests. That is the question measured by pass@k in functional code-generation evaluations such as HumanEval.[2]

Consider three order-operations tasks with three sampled completions apiece:

TaskAttempt 1Attempt 2Attempt 3pass@1pass@3
return eligibilitypassfailfail11
carrier ETA fallbackfailfailpass01
refund split roundingfailfailfail00

In this ordered toy table, pass@1 records whether Attempt 1 passed. pass@3 asks whether any of three completions passed. In a larger benchmark pool, pass@1 estimates success for one randomly selected candidate under the fixed sampling protocol. The model isn't being awarded the same budget in the two columns.

pass-at-k-from-attempt-table.py
1attempts = [ 2 [1, 0, 0], 3 [0, 0, 1], 4 [0, 0, 0], 5] 6 7pass_at_1 = sum(row[0] for row in attempts) / len(attempts) 8pass_at_3 = sum(any(row[:3]) for row in attempts) / len(attempts) 9 10print(f"pass@1: {pass_at_1:.3f}") 11print(f"pass@3: {pass_at_3:.3f}") 12print("extra solved tasks from extra attempts:", int((pass_at_3 - pass_at_1) * len(attempts)))
Output
1pass@1: 0.333 2pass@3: 0.667 3extra solved tasks from extra attempts: 1

This is why a report must publish k, the number of generated samples, the decoding policy, and the tests used to judge correctness. A higher score under a larger attempt budget isn't evidence of stronger single-candidate behavior.

Derive the HumanEval estimator by hand

Suppose the assistant generates n = 10 candidate implementations for one function and hidden tests accept c = 2 of them. You want the expected pass@5 result if you select five candidates from that pool.

Counting success cases is tedious. Count the failure case instead:

  1. There are 10 - 2 = 8 failing candidates.
  2. There are C(10, 5) = 252 ways to choose five candidates.
  3. There are C(8, 5) = 56 all-failing choices.
  4. The chance of at least one pass is 1 - 56 / 252 = 0.778.

In notation:

pass@k=1−(n−ck)(nk)\text{pass@k} = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}pass@k=1−(kn​)(kn−c​)​

The formula estimates the chance that a random group of k samples includes at least one correct completion. Chen and colleagues use this unbiased estimator for HumanEval evaluation after generating more samples per task than the reported k.[2]

pass-at-k-hand-calculation.py
1from math import comb 2 3n = 10 4c = 2 5k = 5 6all_groups = comb(n, k) 7all_failing_groups = comb(n - c, k) 8score = 1 - all_failing_groups / all_groups 9 10print("all groups:", all_groups) 11print("all-failing groups:", all_failing_groups) 12print(f"pass@5: {score:.3f}")
Output
1all groups: 252 2all-failing groups: 56 3pass@5: 0.778

Two boundary checks should feel right:

  • When c = 0, no selected group can pass.
  • When fewer than k failures exist, every k-sized group contains at least one passing candidate.

For larger values of n, avoid assembling enormous combinations. The HumanEval paper gives an equivalent product implementation that stays numerically well behaved.[2]

stable-pass-at-k.py
1import numpy as np 2 3def pass_at_k(n: int, c: int, k: int) -> float: 4 if n <= 0 or not 0 <= c <= n or not 1 <= k <= n: 5 raise ValueError("require n > 0, 0 <= c <= n, and 1 <= k <= n") 6 if n - c < k: 7 return 1.0 8 failure_probability = np.prod(1.0 - k / np.arange(n - c + 1, n + 1)) 9 return float(1.0 - failure_probability) 10 11print(f"n=10, c=2, k=1: {pass_at_k(10, 2, 1):.3f}") 12print(f"n=10, c=2, k=5: {pass_at_k(10, 2, 5):.3f}") 13print(f"no passing candidates: {pass_at_k(10, 0, 5):.3f}") 14print(f"not enough failures: {pass_at_k(10, 8, 5):.3f}") 15 16try: 17 pass_at_k(10, 11, 5) 18except ValueError as error: 19 print(error)
Output
1n=10, c=2, k=1: 0.200 2n=10, c=2, k=5: 0.778 3no passing candidates: 0.000 4not enough failures: 1.000 5require n > 0, 0 <= c <= n, and 1 <= k <= n

Average per-task estimates, not raw completions

A benchmark score averages task-level results. One difficult refund function counts as one task; it shouldn't disappear beneath hundreds of samples from an easier address formatter.

Assume four tasks each produce n = 10 candidates, with the following correct-count vector:

text
1[0, 1, 2, 4]

Compute each task's estimate first, then average those four estimates.

macro-average-pass-at-k.py
1import numpy as np 2 3def pass_at_k(n: int, c: int, k: int) -> float: 4 if n <= 0 or not 0 <= c <= n or not 1 <= k <= n: 5 raise ValueError("require n > 0, 0 <= c <= n, and 1 <= k <= n") 6 if n - c < k: 7 return 1.0 8 return float(1.0 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1))) 9 10correct_counts = [0, 1, 2, 4] 11for k in (1, 3, 5): 12 task_scores = [pass_at_k(10, correct, k) for correct in correct_counts] 13 print(f"pass@{k}: {np.mean(task_scores):.3f} per-task={[round(score, 3) for score in task_scores]}")
Output
1pass@1: 0.175 per-task=[0.0, 0.1, 0.2, 0.4] 2pass@3: 0.417 per-task=[0.0, 0.3, 0.533, 0.833] 3pass@5: 0.563 per-task=[0.0, 0.5, 0.778, 0.976]

Notice that pass@5 can rise dramatically while pass@1 remains modest. That is useful information if your product can test several generated candidates, but it isn't a substitute for measuring single-candidate behavior under the same protocol.

Failure case: changed protocols create fake wins

pass@k only supports a comparison when the protocol matches. Keep the same task set, hidden tests, number of generated samples n, retained attempt budget k, and decoding rule.

A deterministic decoder illustrates the trap. If every generated candidate is identical for a task, extra attempt slots can't discover a different correct solution. In a controlled deterministic fixture, pass@5 collapses to pass@1.

deterministic-candidates-add-no-search.py
1import numpy as np 2 3def pass_at_k(n: int, c: int, k: int) -> float: 4 if n <= 0 or not 0 <= c <= n or not 1 <= k <= n: 5 raise ValueError("require n > 0, 0 <= c <= n, and 1 <= k <= n") 6 if n - c < k: 7 return 1.0 8 return float(1.0 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1))) 9 10identical_failed_candidates = [0] * 10 11identical_passing_candidates = [1] * 10 12 13for name, outcomes in [ 14 ("same failed candidate", identical_failed_candidates), 15 ("same passing candidate", identical_passing_candidates), 16]: 17 correct = sum(outcomes) 18 print(name, f"pass@1={pass_at_k(10, correct, 1):.1f}", f"pass@5={pass_at_k(10, correct, 5):.1f}")
Output
1same failed candidate pass@1=0.0 pass@5=0.0 2same passing candidate pass@1=1.0 pass@5=1.0

Real serving stacks can introduce nondeterminism from outside the sampling policy, so record the actual decoder and run settings. The lesson is operational: multiple attempts only buy search when they produce meaningfully different candidates.

A second failure is subtler: generated code that passes weak hidden tests can still be wrong on missing cases or unsafe to execute. Unit-test passing measures functional correctness under that test suite. It doesn't authorize running untrusted code in a production environment. HumanEval's authors evaluated generated code in a sandbox for that reason.[2]

Build an evaluation report

Put the pieces together into one report for a model-review meeting. The first comparison measures a paired pass@1 lift. The second metric measures multi-attempt capability under an explicitly recorded stochastic sampling setup.

evaluation-report.py
1import numpy as np 2 3def pass_at_k(n: int, c: int, k: int) -> float: 4 if n <= 0 or not 0 <= c <= n or not 1 <= k <= n: 5 raise ValueError("require n > 0, 0 <= c <= n, and 1 <= k <= n") 6 if n - c < k: 7 return 1.0 8 return float(1.0 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1))) 9 10model_a = np.array([1] * 17 + [0] * 10 + [0] * 8 + [1] * 5) 11model_b = np.array([1] * 17 + [0] * 10 + [1] * 8 + [0] * 5) 12paired_lift = float((model_b - model_a).mean()) 13 14correct_counts_b = [0, 1, 2, 4] 15pass5 = float(np.mean([pass_at_k(10, correct, 5) for correct in correct_counts_b])) 16protocol = { 17 "paired_tasks": 40, 18 "metric": "hidden-test functional correctness", 19 "samples_per_task": 10, 20 "reported_k": 5, 21 "decoding": "stochastic sampling, fixed settings for both models", 22} 23 24print(f"paired pass@1 lift: {paired_lift * 100:+.1f} percentage points") 25print(f"Model B pass@5 on candidate pool: {pass5:.3f}") 26for key, value in protocol.items(): 27 print(f"{key}: {value}")
Output
1paired pass@1 lift: +7.5 percentage points 2Model B pass@5 on candidate pool: 0.563 3paired_tasks: 40 4metric: hidden-test functional correctness 5samples_per_task: 10 6reported_k: 5 7decoding: stochastic sampling, fixed settings for both models

Before anyone calls a winner, your report still needs:

  • an interval for the paired lift, not two unrelated headline intervals
  • the directional or two-sided hypothesis question chosen before inspecting results
  • task and hidden-test quality checks
  • latency, token cost, and sandboxing constraints for any deployment decision

Statistical significance and product value answer different questions. Even convincing statistical evidence can't tell you whether a gain justifies additional candidate generation, test execution, latency, or risk.

Practice: review a benchmark claim

A teammate writes: "Model B is better because its pass@5 is 64%, while Model A's pass@1 is 58%."

Write a review comment with three corrections:

  1. Request the same k, n, task set, tests, and decoding policy for both models.
  2. Ask for the paired lift and its uncertainty interval under that matched protocol.
  3. Ask whether added generation and test-execution budget is acceptable for the product.

A good rewritten claim would sound like this:

Under the same 200 paired order-operations tasks and fixed pass@1 protocol, Model B improved hidden-test pass rate by 2.5 percentage points; the paired interval and cost guardrails are reported below. pass@5 is listed separately because it measures a larger candidate-search budget.

Mastery check

Key concepts

  • null hypothesis and planned alternative
  • paired task differences
  • directional p-value intuition
  • paired bootstrap interval
  • functional correctness
  • unbiased pass@k estimator
  • sampling protocol and practical significance

Evaluation rubric

  • Foundational: Computes paired pass rates and explains why disagreement tasks carry directional evidence.
  • Intermediate: Implements a paired bootstrap interval and the numerically stable pass@k estimator.
  • Advanced: Designs a comparison report that fixes task set, hidden tests, decoding policy, n, and k, then separates statistical evidence from deployment value.

Common pitfalls

  • Symptom: A one-task win becomes a model-replacement announcement. Cause: ties and sample size were ignored. Fix: count paired disagreements and report inference conservatively.
  • Symptom: Two model intervals are compared by eye. Cause: paired task structure was discarded. Fix: bootstrap or test the paired lift directly.
  • Symptom: pass@5 is compared with pass@1. Cause: candidate-search budgets differ. Fix: compare matched protocols and label multi-attempt capability separately.
  • Symptom: Extra attempts never improve results. Cause: the candidates are duplicates under a deterministic setup. Fix: record and inspect sampling diversity before interpreting higher k.
  • Symptom: Passing generated code is treated as production-safe code. Cause: unit-test correctness was confused with execution safety. Fix: keep generated code sandboxed and review test coverage.

Follow-up questions

Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.Two models are tested on the same six tasks. Model A outcomes are [1, 0, 1, 0, 1, 0] and Model B outcomes are [1, 1, 1, 0, 1, 0], where 1 means pass. Which conclusion is supported?
2.A paired binary comparison has 16 disagreement tasks: B wins 13 and A wins 3. The claim 'B is better' was chosen before seeing the data. Under the null that either model is equally likely to win each disagreement, what one-sided p-value should be used?
3.A 40-task paired evaluation has 17 both pass, 10 both fail, 8 B-only passes, and 5 A-only passes. A paired bootstrap interval for B - A is -10.0 to +25.0 percentage points. Which procedure and conclusion match the paired design?
4.For three tasks, a model's ordered attempts are [pass, fail, fail], [fail, fail, pass], and [fail, fail, fail]. What are pass@1 and pass@3 for this fixed ordered sample, and what does the difference mean?
5.For one HumanEval-style task, a model generated n = 10 candidate implementations and c = 2 passed hidden tests. What is the pass@5 estimate for selecting five candidates from that pool?
6.Four tasks each have n = 10 candidates, with correct counts [0, 1, 2, 4]. How should the benchmark pass@5 score be computed?
7.A teammate writes, 'Model B is better because its pass@5 is 64%, while Model A's pass@1 is 58%.' Which review comment identifies the flaw and requests a fair comparison?
8.A deterministic decoder returns ten identical completions for a task. The shared completion either passes every time or fails every time. How should pass@5 compare with pass@1?
9.A generated function passes every hidden test and is being considered for execution on production data. What conclusion is justified?

9 questions remaining.

Next Step
Continue to Neural Networks from Scratch

You can now reason about gradients, vectors, and uncertainty, then judge whether evaluation results support a model claim. Next you will turn that math into a trainable neural network, so tensors, loss, and prediction errors start feeling like one connected system.

PreviousDistributions and Sampling
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Bootstrap Methods: Another Look at the Jackknife.

Efron, B. · 1979 · Annals of Statistics

Evaluating Large Language Models Trained on Code (HumanEval).

Chen, M., et al. · 2021 · arXiv preprint