LeetLLM
LearnPracticeFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Practice
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnMath & StatisticsStatistics and Uncertainty
📊EasyEvaluation & Benchmarks

Statistics and Uncertainty

Estimate fraud risk in a flagged review queue from finite labels, using bootstrap intuition, score intervals, sampling bias checks, and calibrated reporting.

12 min read
Learning path
Step 12 of 155 in the full curriculum
Probability for Machine LearningDistributions and Sampling

The last chapter counted every order in an imagined detector population and found that a flagged order had about 16 percent fraud risk. A live e-commerce system doesn't hand you that answer. You see a queue of flagged orders, pay reviewers to label a sample, and estimate the risk for the queue you haven't reviewed yet.

That difference is statistics: probability reasons from known rates; statistics reasons from measured samples back toward unknown rates.

Four-stage review-queue statistics flow: 16 fraud cases among 100 reviewed flags form a 0.16 point estimate, bootstrap resamples wobble around that center, a Wilson score interval spans 0.101 to 0.244, and the report pairs rate, interval, and slice coverage. Four-stage review-queue statistics flow: 16 fraud cases among 100 reviewed flags form a 0.16 point estimate, bootstrap resamples wobble around that center, a Wilson score interval spans 0.101 to 0.244, and the report pairs rate, interval, and slice coverage.
A reviewed sample produces an estimate, repeated-sample reasoning supplies uncertainty, and representative slices determine whether the estimate supports a production decision.

Estimate the quantity you actually need

Keep the same event from the probability chapter:

NameMeaning here
Populationevery flagged order entering the manual-review queue this week
Sample100 flagged orders selected at random and reviewed by humans
Successreviewer confirms the order is fraudulent
Unknown rateP(fraudulent∣flagged)P(\text{fraudulent} \mid \text{flagged})P(fraudulent∣flagged), also called review-queue precision

Suppose reviewers confirm fraud in 16 of the 100 sampled flagged orders.

p^=16100=0.16\hat{p} = \frac{16}{100} = 0.16p^​=10016​=0.16

The hat on p^\hat{p}p^​ says "estimate." It doesn't say the full queue is exactly 16 percent fraud. It says 16 percent is the point estimate computed from this sample.

This is deliberately not overall accuracy. If only 1 percent of all orders are fraudulent, a useless model that labels every order legitimate is 99 percent accurate. For rare events, report the metric attached to the product decision: here, how much of the costly review queue is worthwhile.

estimate-flagged-queue-precision.py
1reviewed_flags = 100 2confirmed_fraud = 16 3 4precision_estimate = confirmed_fraud / reviewed_flags 5always_legitimate_accuracy = 0.99 6 7print(f"review queue: {confirmed_fraud}/{reviewed_flags} confirmed fraud") 8print(f"estimated P(fraud | flagged): {precision_estimate:.1%}") 9print(f"misleading rare-event baseline accuracy: {always_legitimate_accuracy:.1%}") 10 11assert precision_estimate == 0.16
Queue point estimate
1review queue: 16/100 confirmed fraud 2estimated P(fraud | flagged): 16.0% 3misleading rare-event baseline accuracy: 99.0%

A new sample gives a new estimate

If the true queue risk were 16 percent, random batches of 100 reviewed flags wouldn't all contain exactly 16 fraudulent orders. One might contain 12, another 19, another 17. That movement isn't model drift. It is sampling variation.

Use a seeded simulation only to make the idea visible. The rate is fixed at 0.16; only the sampled orders change.

watch-sample-estimates-move.py
1import numpy as np 2 3rng = np.random.default_rng(17) 4confirmed_counts = rng.binomial(n=100, p=0.16, size=8) 5estimates = confirmed_counts / 100 6 7print("confirmed fraud counts:", confirmed_counts.tolist()) 8print("precision estimates: ", [float(round(value, 2)) for value in estimates]) 9print(f"range across batches: {estimates.min():.2f} to {estimates.max():.2f}")
Repeated reviewed batches
1confirmed fraud counts: [20, 12, 16, 15, 13, 15, 15, 17] 2precision estimates: [0.2, 0.12, 0.16, 0.15, 0.13, 0.15, 0.15, 0.17] 3range across batches: 0.12 to 0.20

This is the central problem: one measured rate is an answer about one sample. A production claim needs both the center and how much that center can move.

Bootstrap makes the movement visible

In real work you don't know the true queue risk used by the simulation above. You only have the reviewed labels: sixteen 1 values and eighty-four 0 values.

The bootstrap repeatedly samples 100 rows with replacement from those observed labels. It asks how the estimate moves when your measured sample is treated as the best available stand-in for the population. It is not new evidence; it is a way to inspect sampling variability from evidence you already collected.[1][2]

bootstrap-review-queue-precision.py
1import numpy as np 2 3reviewed_labels = np.array([1] * 16 + [0] * 84, dtype=float) 4rng = np.random.default_rng(7) 5 6resampled_estimates = np.array([ 7 rng.choice(reviewed_labels, size=len(reviewed_labels), replace=True).mean() 8 for _ in range(5_000) 9]) 10 11low, high = np.quantile(resampled_estimates, [0.025, 0.975]) 12 13print(f"sample estimate: {reviewed_labels.mean():.3f}") 14print(f"bootstrap 95% range: [{low:.3f}, {high:.3f}]") 15print(f"resampled min and max: {resampled_estimates.min():.3f}, {resampled_estimates.max():.3f}")
Bootstrap precision interval
1sample estimate: 0.160 2bootstrap 95% range: [0.090, 0.230] 3resampled min and max: 0.050, 0.300

The bootstrap range is much wider than one point estimate. A queue that measured 16 percent fraud in a 100-order sample could plausibly have a materially lower or higher precision.

Standard error gives a quick first calculation

For a binary rate, the plug-in standard error estimate is:

SE^=p^(1−p^)n\widehat{SE} = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}SE=np^​(1−p^​)​​

Here, p^\hat{p}p^​ is the observed rate and nnn is the number of reviewed flagged orders. With 16 / 100:

SE^=0.16×0.84100≈0.0367\widehat{SE} = \sqrt{\frac{0.16 \times 0.84}{100}} \approx 0.0367SE=1000.16×0.84​​≈0.0367

A quick interval sometimes taught first is the Wald interval:

p^±1.96SE^=0.16±0.0719≈[0.088, 0.232]\hat{p} \pm 1.96 \widehat{SE} = 0.16 \pm 0.0719 \approx [0.088,\ 0.232]p^​±1.96SE=0.16±0.0719≈[0.088, 0.232]
standard-error-first-pass.py
1from math import sqrt 2 3successes = 16 4n = 100 5p_hat = successes / n 6se = sqrt(p_hat * (1 - p_hat) / n) 7low = p_hat - 1.96 * se 8high = p_hat + 1.96 * se 9 10print(f"estimate: {p_hat:.3f}") 11print(f"standard error: {se:.4f}") 12print(f"quick interval: [{low:.3f}, {high:.3f}]")
Quick interval calculation
1estimate: 0.160 2standard error: 0.0367 3quick interval: [0.088, 0.232]

The formula teaches why sample size matters: divide by a larger n, and the standard error shrinks. But this quick interval isn't robust enough to make your default.

Use an interval that behaves at the edges

The quick Wald interval breaks badly when the rate is near 0 or 1. If reviewers inspect ten flagged orders and find zero fraud, the standard error formula returns zero and the interval becomes [0, 0]. That would claim certainty from ten observations. It is plainly the wrong engineering conclusion.

A Wilson score confidence interval handles that case better. Its formula is longer, but the code is still small:

center=p^+z2/(2n)1+z2/n\text{center} = \frac{\hat{p} + z^2/(2n)}{1 + z^2/n}center=1+z2/np^​+z2/(2n)​ margin=z1+z2/n×p^(1−p^)n+z24n2\text{margin} = \frac{z}{1 + z^2/n} \times \sqrt{\frac{\hat{p}(1-\hat{p})}{n} + \frac{z^2}{4n^2}}margin=1+z2/nz​×np^​(1−p^​)​+4n2z2​​

For a rough 95 percent interval, use z=1.96z = 1.96z=1.96, then return center - margin and center + margin. This score interval stays within the probability range and doesn't collapse to certainty when a small sample contains zero positives.

wilson-score-interval.py
1from math import sqrt 2 3def wald_interval(successes: int, n: int, z: float = 1.96) -> tuple[float, float]: 4 if n <= 0 or not 0 <= successes <= n: 5 raise ValueError("require 0 <= successes <= n and n > 0") 6 p_hat = successes / n 7 se = sqrt(p_hat * (1 - p_hat) / n) 8 return p_hat - z * se, p_hat + z * se 9 10def wilson_interval(successes: int, n: int, z: float = 1.96) -> tuple[float, float]: 11 if n <= 0 or not 0 <= successes <= n: 12 raise ValueError("require 0 <= successes <= n and n > 0") 13 p_hat = successes / n 14 denominator = 1 + z * z / n 15 center = (p_hat + z * z / (2 * n)) / denominator 16 margin = z / denominator * sqrt( 17 p_hat * (1 - p_hat) / n + z * z / (4 * n * n) 18 ) 19 return max(0.0, center - margin), min(1.0, center + margin) 20 21for successes, n in [(16, 100), (0, 10), (10, 10)]: 22 wald = wald_interval(successes, n) 23 wilson = wilson_interval(successes, n) 24 print( 25 f"{successes:>2}/{n:<3} wald=[{wald[0]:.3f}, {wald[1]:.3f}] " 26 f"wilson=[{wilson[0]:.3f}, {wilson[1]:.3f}]" 27 )
Wald failure and score interval
116/100 wald=[0.088, 0.232] wilson=[0.101, 0.244] 2 0/10 wald=[0.000, 0.000] wilson=[0.000, 0.278] 310/10 wald=[1.000, 1.000] wilson=[0.722, 1.000]

For this chapter's binary reports, we will use the Wilson interval. The next interval-and-testing chapter will examine how interval choice and decision rules connect to formal comparison.

Larger representative samples sharpen the claim

Suppose later labeling keeps the same center: 16 percent of reviewed flagged orders are fraud. Increasing only the representative sample size narrows the interval.

sample-size-tightens-precision.py
1from math import sqrt 2 3def wilson_interval(successes: int, n: int, z: float = 1.96) -> tuple[float, float]: 4 if n <= 0 or not 0 <= successes <= n: 5 raise ValueError("require 0 <= successes <= n and n > 0") 6 p_hat = successes / n 7 denominator = 1 + z * z / n 8 center = (p_hat + z * z / (2 * n)) / denominator 9 margin = z / denominator * sqrt( 10 p_hat * (1 - p_hat) / n + z * z / (4 * n * n) 11 ) 12 return max(0.0, center - margin), min(1.0, center + margin) 13 14for successes, n in [(16, 100), (160, 1_000), (1_600, 10_000)]: 15 low, high = wilson_interval(successes, n) 16 print(f"{successes:>4}/{n:<6} estimate={successes / n:.3f} interval=[{low:.3f}, {high:.3f}]")
Sample size tightens interval
116/100 estimate=0.160 interval=[0.101, 0.244] 2 160/1000 estimate=0.160 interval=[0.139, 0.184] 31600/10000 estimate=0.160 interval=[0.153, 0.167]
Wilson interval plot showing the same 0.16 queue-precision estimate tightening from 0.101 to 0.244 at n = 100 to 0.153 to 0.167 at n = 10,000, beside a warning that a large biased domestic-only sample still misses higher-risk slices. Wilson interval plot showing the same 0.16 queue-precision estimate tightening from 0.101 to 0.244 at n = 100 to 0.153 to 0.167 at n = 10,000, beside a warning that a large biased domestic-only sample still misses higher-risk slices.
More representative reviews reduce random uncertainty around the same queue precision estimate; they do not repair a sample drawn from the wrong traffic slice.

Same center, different strength. 16 / 100 is an early read. 1,600 / 10,000 is far tighter evidence about the population, assuming both samples represent the queue you intend to serve.

More labels cannot fix the wrong slice

Precision about the wrong population is still wrong. Imagine your queue contains several slices with different fraud rates:

Flagged-order sliceShare of queueConfirmed-fraud rate
routine domestic orders70 percent12 percent
international orders20 percent25 percent
high-value orders10 percent42 percent

A review project that samples only routine domestic flags can return a very tight estimate near 12 percent while missing the higher-risk parts of the queue. To estimate overall queue precision, sample each important slice or sample randomly from the actual queue, then account for the slice mix.

representative-slices-matter.py
1slices = [ 2 ("routine domestic", 0.70, 0.12), 3 ("international", 0.20, 0.25), 4 ("high value", 0.10, 0.42), 5] 6 7overall_precision = sum(share * rate for _, share, rate in slices) 8domestic_only_precision = slices[0][2] 9 10print(f"domestic-only estimate: {domestic_only_precision:.1%}") 11print(f"queue-weighted estimate: {overall_precision:.1%}") 12print(f"bias from wrong slice: {domestic_only_precision - overall_precision:+.1%}") 13 14assert round(overall_precision, 3) == 0.176
Slice weighting exposes bias
1domestic-only estimate: 12.0% 2queue-weighted estimate: 17.6% 3bias from wrong slice: -5.6%

Random uncertainty and sampling bias require different fixes:

FailureSymptomCorrect response
Too few representative reviewsinterval is widelabel more randomly selected queue items
Wrong slices reviewedinterval may be narrow but misses deployment riskrepair sampling plan and report slice coverage
Labels change after model output is seenmetric drifts toward model guesseslock rubric and review ambiguous labels independently

Build it: emit an evaluation report

A small reporting function makes the contract concrete. It prints numerator, denominator, estimate, interval, and a low-data warning. It doesn't decide whether to launch; that decision requires costs, thresholds, and slice coverage.

build-review-queue-report.py
1from math import sqrt 2 3def wilson_interval(successes: int, n: int, z: float = 1.96) -> tuple[float, float]: 4 if n <= 0 or not 0 <= successes <= n: 5 raise ValueError("require 0 <= successes <= n and n > 0") 6 p_hat = successes / n 7 denominator = 1 + z * z / n 8 center = (p_hat + z * z / (2 * n)) / denominator 9 margin = z / denominator * sqrt( 10 p_hat * (1 - p_hat) / n + z * z / (4 * n * n) 11 ) 12 return max(0.0, center - margin), min(1.0, center + margin) 13 14def report_precision(successes: int, n: int, slice_name: str) -> str: 15 low, high = wilson_interval(successes, n) 16 caution = " collect more labels" if n < 200 else "" 17 return ( 18 f"{slice_name}: {successes}/{n} = {successes / n:.1%}; " 19 f"95% score interval [{low:.1%}, {high:.1%}].{caution}" 20 ) 21 22print(report_precision(16, 100, "all sampled flags")) 23print(report_precision(25, 100, "international flags")) 24 25try: 26 report_precision(2, 0, "empty slice") 27except ValueError as error: 28 print(error)
Review queue report
1all sampled flags: 16/100 = 16.0%; 95% score interval [10.1%, 24.4%]. collect more labels 2international flags: 25/100 = 25.0%; 95% score interval [17.5%, 34.3%]. collect more labels 3require 0 <= successes <= n and n > 0

An evaluation report should pair that line with how the sample was chosen and which important slices have enough labels. A number without its collection process invites overconfidence.

MLE and MAP name two estimation choices

The sample rate did not come from habit alone. Treat each reviewed flag as a Bernoulli outcome with unknown fraud probability ppp. If k of n reviewed flags are fraud, the likelihood of that observation is proportional to:

L(p)=pk(1−p)n−kL(p) = p^k(1-p)^{n-k}L(p)=pk(1−p)n−k

Maximum likelihood estimation (MLE) chooses the value of ppp that makes the observed labels most likely:

p^MLE=kn\hat{p}_{\text{MLE}} = \frac{k}{n}p^​MLE​=nk​

For 16 / 100, MLE is 0.16.

Maximum a posteriori estimation (MAP) makes the prior explicit. With a mild Beta(2,2)\text{Beta}(2,2)Beta(2,2) prior, centered at 0.5 but weak relative to a large labeled sample, the posterior mode is:

p^MAP=k+1n+2\hat{p}_{\text{MAP}} = \frac{k + 1}{n + 2}p^​MAP​=n+2k+1​
mle-and-map-for-queue-risk.py
1def validate_counts(successes: int, n: int) -> None: 2 if n <= 0 or not 0 <= successes <= n: 3 raise ValueError("require 0 <= successes <= n and n > 0") 4 5def mle_rate(successes: int, n: int) -> float: 6 validate_counts(successes, n) 7 return successes / n 8 9def map_rate_beta_2_2(successes: int, n: int) -> float: 10 validate_counts(successes, n) 11 return (successes + 1) / (n + 2) 12 13for successes, n in [(16, 100), (1, 5), (160, 1_000)]: 14 mle = mle_rate(successes, n) 15 map_estimate = map_rate_beta_2_2(successes, n) 16 print(f"{successes}/{n:<4} MLE={mle:.3f} MAP_Beta(2,2)={map_estimate:.3f}")
MLE and explicit prior
116/100 MLE=0.160 MAP_Beta(2,2)=0.167 21/5 MLE=0.200 MAP_Beta(2,2)=0.286 3160/1000 MLE=0.160 MAP_Beta(2,2)=0.161

The prior moves a five-label estimate more than a thousand-label estimate. This isn't permission to hide weak evidence behind a convenient prior. It is a reminder to state the prior and show how much it changes the result.[3][4]

Calibration needs its own evidence

The previous chapter introduced calibration: a model score near 0.20 should correspond to fraud roughly 20 percent of the time across similarly scored orders. Now statistics changes what you can claim from a finite bucket.

Suppose 100 flagged orders had predicted risk near 20 percent and reviewers confirm fraud in 16. The observed gap is four percentage points, but the Wilson interval for the observed rate still includes 20 percent. With this sample, you have not established a calibration failure.

finite-calibration-bucket.py
1from math import sqrt 2 3def wilson_interval(successes: int, n: int, z: float = 1.96) -> tuple[float, float]: 4 if n <= 0 or not 0 <= successes <= n: 5 raise ValueError("require 0 <= successes <= n and n > 0") 6 p_hat = successes / n 7 denominator = 1 + z * z / n 8 center = (p_hat + z * z / (2 * n)) / denominator 9 margin = z / denominator * sqrt( 10 p_hat * (1 - p_hat) / n + z * z / (4 * n * n) 11 ) 12 return max(0.0, center - margin), min(1.0, center + margin) 13 14predicted_risk = 0.20 15observed_fraud = 16 16n = 100 17low, high = wilson_interval(observed_fraud, n) 18 19print(f"predicted bucket risk: {predicted_risk:.1%}") 20print(f"observed fraud rate: {observed_fraud / n:.1%}") 21print(f"observed interval: [{low:.1%}, {high:.1%}]") 22print(f"20% is plausible here: {low <= predicted_risk <= high}")
Calibration needs labels
1predicted bucket risk: 20.0% 2observed fraud rate: 16.0% 3observed interval: [10.1%, 24.4%] 420% is plausible here: True

Calibration matters for classifiers, and modern neural networks can be miscalibrated, so it must be evaluated rather than assumed.[5] This example adds the statistical discipline: a bucket gap without enough labels isn't strong evidence either.

Name uncertainty before proposing a fix

An interval around queue precision measures estimation uncertainty: how much a finite reviewed sample can move. It does not identify every reason a detector can fail.

ProblemWhat it looks likeDoes a tighter interval fix it?
finite representative samplemeasured precision moves across random label batchesyes, more representative labels narrow it
sampling biasonly one routine slice was reviewedno, change the sample design
ambiguous labelsreviewers disagree on whether an order is fraudno, improve rubric and measure agreement
unfamiliar deployment trafficnew payment path or geography appearsno, add coverage and retrain or route safely
miscalibrated scorepredicted-risk buckets disagree with observed ratesno, measure and calibrate on held-out labels

Some literature groups irreducible input or label ambiguity under aleatoric uncertainty, and lack of knowledge about unfamiliar cases under epistemic uncertainty. For this lesson, the practical move matters more than the label: don't use an interval for sampling noise as if it solved biased sampling, unclear labels, or unfamiliar traffic.

Practice: write the report before the conclusion

You review 24 / 120 flagged high-value orders and confirm fraud.

  1. Compute the point estimate.
  2. Run the wilson_interval function for 24, 120.
  3. Explain whether this sample alone proves the high-value slice meets a 25 percent minimum precision requirement.
  4. Your routine-domestic slice has 120 / 1000 fraud. Explain why you cannot replace one slice with the other.
  5. A predicted-risk bucket averages 25 percent risk and contains 24 / 120 fraud. Is this enough to declare it miscalibrated?

Solution checks:

ItemCheck
Point estimate24 / 120 = 0.20
Wilson intervalapproximately [0.138, 0.280]
25 percent requirementthe interval crosses 25 percent, so this sample doesn't establish that precision is below or above the requirement
Slice swaphigh-value and routine-domestic flags represent different populations; combine them only with a valid sampling/weighting plan
Calibration callno; 25 percent remains plausible within this finite bucket's interval

The habit is intentionally conservative. Report what your labels support; do not convert weak evidence into a confident product claim.

Mastery check

Key concepts

  • A sample rate estimates a population rate; it is not the rate itself.
  • For rare fraud, queue precision is more informative than overall accuracy for manual-review usefulness.
  • Bootstrap resampling visualizes finite-sample movement without creating new evidence.
  • Standard error describes movement in an estimator, while a score interval makes the report operational.
  • A Wilson score interval behaves sensibly when a binary sample has zero or all successes.
  • More representative labels reduce random uncertainty; they don't correct sampling bias.
  • MLE and MAP differ because MAP states an explicit prior.
  • Calibration gaps also need enough labeled evidence before they justify action.

Evaluation rubric

  • Foundational: Names population, sample, event, and point estimate for a flagged review queue.
  • Intermediate: Implements bootstrap resampling and a Wilson interval, then interprets a wide interval without overclaiming.
  • Applied: Produces a report with numerator, denominator, interval, sampling method, and slice coverage.
  • Advanced: Distinguishes sample noise, biased sampling, label ambiguity, unfamiliar traffic, and calibration error before proposing a fix.

Follow-up questions

Common pitfalls

  • Symptom: A rare-event detector is celebrated for 99 percent accuracy. Cause: accuracy ignored the review decision and class imbalance. Fix: report review-queue precision and fraud recall for the action being taken.
  • Symptom: 0 / 10 is reported as zero risk. Cause: the quick normal interval collapsed at an edge. Fix: use a binary-rate interval that preserves uncertainty, such as the Wilson score interval.
  • Symptom: A narrow interval fails in production. Cause: only a routine traffic slice was labeled. Fix: sample from deployment traffic and report slice coverage.
  • Symptom: A four-point calibration gap triggers a model rollback from 100 labels. Cause: point estimates were compared without uncertainty. Fix: interval-check the observed bucket rate and gather more labels if the decision remains unclear.
  • Symptom: MAP is reported without naming a prior. Cause: shrinkage was hidden in a formula. Fix: state the prior distribution and show both MLE and MAP estimates.
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.A weekly queue contains many flagged orders. Reviewers randomly sample 100 and confirm 16 fraud. Overall fraud is rare, and an always-legitimate detector could be 99% accurate. Which statement correctly interprets 16/100 for the review decision?
2.Two random, representative review samples both find 16% fraud: 16/100 and 160/1,000. Each team bootstraps by resampling its observed binary labels at the original sample size with replacement. What should they expect?
3.Reviewers inspect 10 randomly selected flagged orders and confirm 0 fraud. The plug-in Wald interval uses p_hat = 0 and returns [0, 0]. Which interpretation is correct?
4.An audit of a flagged-order queue finds these slice shares and confirmed-fraud rates: routine domestic 70% at 12%, international 20% at 25%, and high-value 10% at 42%. A separate review sample from only routine domestic flags has a very tight interval around 12%. Which queue-wide conclusion is supported?
5.Reviewers confirm fraud in 24 of 120 randomly sampled high-value flags. The point estimate is 20%, and the 95% Wilson interval is [13.8%, 28.0%]. A policy requires at least 25% precision. Which report is justified?
6.A calibration bucket has average predicted risk 20%. Reviewers label 100 orders in the bucket and find 16 fraud; the observed Wilson interval is [10.1%, 24.4%]. Which conclusion is supported?
7.For reviewed flagged orders, treat fraud as a Bernoulli outcome with k successes in n labels. MLE is k / n; with the Beta(2,2) prior used here, MAP is (k + 1) / (n + 2). Which calculation is correct?
8.A narrow Wilson interval is reported, but reviewers saw the model's prediction before deciding ambiguous cases, and their labels increasingly match the model's guesses. What should the team do?

8 questions remaining.

Next Step
Continue to Distributions and Sampling

You can now estimate a binary rate and state how much finite labels support the claim. Next you will model different kinds of random outcomes and simulate the distributions that produce those samples.

PreviousProbability for Machine Learning
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Bootstrap Methods: Another Look at the Jackknife.

Efron, B. · 1979 · Annals of Statistics

The Elements of Statistical Learning.

Hastie, T., Tibshirani, R., Friedman, J. · 2009

Machine Learning: A Probabilistic Perspective.

Murphy, K. P. · 2012

Pattern Recognition and Machine Learning.

Bishop, C. M. · 2006

On Calibration of Modern Neural Networks

Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. · 2017