Estimate fraud risk in a flagged review queue from finite labels, using bootstrap intuition, score intervals, sampling bias checks, and calibrated reporting.
The last chapter counted every order in an imagined detector population and found that a flagged order had about 16 percent fraud risk. A live e-commerce system doesn't hand you that answer. You see a queue of flagged orders, pay reviewers to label a sample, and estimate the risk for the queue you haven't reviewed yet.
That difference is statistics: probability reasons from known rates; statistics reasons from measured samples back toward unknown rates.
Keep the same event from the probability chapter:
| Name | Meaning here |
|---|---|
| Population | every flagged order entering the manual-review queue this week |
| Sample | 100 flagged orders selected at random and reviewed by humans |
| Success | reviewer confirms the order is fraudulent |
| Unknown rate | , also called review-queue precision |
Suppose reviewers confirm fraud in 16 of the 100 sampled flagged orders.
The hat on says "estimate." It doesn't say the full queue is exactly 16 percent fraud. It says 16 percent is the point estimate computed from this sample.
This is deliberately not overall accuracy. If only 1 percent of all orders are fraudulent, a useless model that labels every order legitimate is 99 percent accurate. For rare events, report the metric attached to the product decision: here, how much of the costly review queue is worthwhile.
1reviewed_flags = 100
2confirmed_fraud = 16
3
4precision_estimate = confirmed_fraud / reviewed_flags
5always_legitimate_accuracy = 0.99
6
7print(f"review queue: {confirmed_fraud}/{reviewed_flags} confirmed fraud")
8print(f"estimated P(fraud | flagged): {precision_estimate:.1%}")
9print(f"misleading rare-event baseline accuracy: {always_legitimate_accuracy:.1%}")
10
11assert precision_estimate == 0.161review queue: 16/100 confirmed fraud
2estimated P(fraud | flagged): 16.0%
3misleading rare-event baseline accuracy: 99.0%If the true queue risk were 16 percent, random batches of 100 reviewed flags wouldn't all contain exactly 16 fraudulent orders. One might contain 12, another 19, another 17. That movement isn't model drift. It is sampling variation.
Use a seeded simulation only to make the idea visible. The rate is fixed at 0.16; only the sampled orders change.
1import numpy as np
2
3rng = np.random.default_rng(17)
4confirmed_counts = rng.binomial(n=100, p=0.16, size=8)
5estimates = confirmed_counts / 100
6
7print("confirmed fraud counts:", confirmed_counts.tolist())
8print("precision estimates: ", [float(round(value, 2)) for value in estimates])
9print(f"range across batches: {estimates.min():.2f} to {estimates.max():.2f}")1confirmed fraud counts: [20, 12, 16, 15, 13, 15, 15, 17]
2precision estimates: [0.2, 0.12, 0.16, 0.15, 0.13, 0.15, 0.15, 0.17]
3range across batches: 0.12 to 0.20This is the central problem: one measured rate is an answer about one sample. A production claim needs both the center and how much that center can move.
In real work you don't know the true queue risk used by the simulation above. You only have the reviewed labels: sixteen 1 values and eighty-four 0 values.
The bootstrap repeatedly samples 100 rows with replacement from those observed labels. It asks how the estimate moves when your measured sample is treated as the best available stand-in for the population. It is not new evidence; it is a way to inspect sampling variability from evidence you already collected.[1][2]
1import numpy as np
2
3reviewed_labels = np.array([1] * 16 + [0] * 84, dtype=float)
4rng = np.random.default_rng(7)
5
6resampled_estimates = np.array([
7 rng.choice(reviewed_labels, size=len(reviewed_labels), replace=True).mean()
8 for _ in range(5_000)
9])
10
11low, high = np.quantile(resampled_estimates, [0.025, 0.975])
12
13print(f"sample estimate: {reviewed_labels.mean():.3f}")
14print(f"bootstrap 95% range: [{low:.3f}, {high:.3f}]")
15print(f"resampled min and max: {resampled_estimates.min():.3f}, {resampled_estimates.max():.3f}")1sample estimate: 0.160
2bootstrap 95% range: [0.090, 0.230]
3resampled min and max: 0.050, 0.300The bootstrap range is much wider than one point estimate. A queue that measured 16 percent fraud in a 100-order sample could plausibly have a materially lower or higher precision.
For a binary rate, the plug-in standard error estimate is:
Here, is the observed rate and is the number of reviewed flagged orders. With 16 / 100:
A quick interval sometimes taught first is the Wald interval:
1from math import sqrt
2
3successes = 16
4n = 100
5p_hat = successes / n
6se = sqrt(p_hat * (1 - p_hat) / n)
7low = p_hat - 1.96 * se
8high = p_hat + 1.96 * se
9
10print(f"estimate: {p_hat:.3f}")
11print(f"standard error: {se:.4f}")
12print(f"quick interval: [{low:.3f}, {high:.3f}]")1estimate: 0.160
2standard error: 0.0367
3quick interval: [0.088, 0.232]The formula teaches why sample size matters: divide by a larger n, and the standard error shrinks. But this quick interval isn't robust enough to make your default.
The quick Wald interval breaks badly when the rate is near 0 or 1. If reviewers inspect ten flagged orders and find zero fraud, the standard error formula returns zero and the interval becomes [0, 0]. That would claim certainty from ten observations. It is plainly the wrong engineering conclusion.
A Wilson score confidence interval handles that case better. Its formula is longer, but the code is still small:
For a rough 95 percent interval, use , then return center - margin and center + margin. This score interval stays within the probability range and doesn't collapse to certainty when a small sample contains zero positives.
1from math import sqrt
2
3def wald_interval(successes: int, n: int, z: float = 1.96) -> tuple[float, float]:
4 if n <= 0 or not 0 <= successes <= n:
5 raise ValueError("require 0 <= successes <= n and n > 0")
6 p_hat = successes / n
7 se = sqrt(p_hat * (1 - p_hat) / n)
8 return p_hat - z * se, p_hat + z * se
9
10def wilson_interval(successes: int, n: int, z: float = 1.96) -> tuple[float, float]:
11 if n <= 0 or not 0 <= successes <= n:
12 raise ValueError("require 0 <= successes <= n and n > 0")
13 p_hat = successes / n
14 denominator = 1 + z * z / n
15 center = (p_hat + z * z / (2 * n)) / denominator
16 margin = z / denominator * sqrt(
17 p_hat * (1 - p_hat) / n + z * z / (4 * n * n)
18 )
19 return max(0.0, center - margin), min(1.0, center + margin)
20
21for successes, n in [(16, 100), (0, 10), (10, 10)]:
22 wald = wald_interval(successes, n)
23 wilson = wilson_interval(successes, n)
24 print(
25 f"{successes:>2}/{n:<3} wald=[{wald[0]:.3f}, {wald[1]:.3f}] "
26 f"wilson=[{wilson[0]:.3f}, {wilson[1]:.3f}]"
27 )116/100 wald=[0.088, 0.232] wilson=[0.101, 0.244]
2 0/10 wald=[0.000, 0.000] wilson=[0.000, 0.278]
310/10 wald=[1.000, 1.000] wilson=[0.722, 1.000]For this chapter's binary reports, we will use the Wilson interval. The next interval-and-testing chapter will examine how interval choice and decision rules connect to formal comparison.
Suppose later labeling keeps the same center: 16 percent of reviewed flagged orders are fraud. Increasing only the representative sample size narrows the interval.
1from math import sqrt
2
3def wilson_interval(successes: int, n: int, z: float = 1.96) -> tuple[float, float]:
4 if n <= 0 or not 0 <= successes <= n:
5 raise ValueError("require 0 <= successes <= n and n > 0")
6 p_hat = successes / n
7 denominator = 1 + z * z / n
8 center = (p_hat + z * z / (2 * n)) / denominator
9 margin = z / denominator * sqrt(
10 p_hat * (1 - p_hat) / n + z * z / (4 * n * n)
11 )
12 return max(0.0, center - margin), min(1.0, center + margin)
13
14for successes, n in [(16, 100), (160, 1_000), (1_600, 10_000)]:
15 low, high = wilson_interval(successes, n)
16 print(f"{successes:>4}/{n:<6} estimate={successes / n:.3f} interval=[{low:.3f}, {high:.3f}]")116/100 estimate=0.160 interval=[0.101, 0.244]
2 160/1000 estimate=0.160 interval=[0.139, 0.184]
31600/10000 estimate=0.160 interval=[0.153, 0.167]
Same center, different strength. 16 / 100 is an early read. 1,600 / 10,000 is far tighter evidence about the population, assuming both samples represent the queue you intend to serve.
Precision about the wrong population is still wrong. Imagine your queue contains several slices with different fraud rates:
| Flagged-order slice | Share of queue | Confirmed-fraud rate |
|---|---|---|
| routine domestic orders | 70 percent | 12 percent |
| international orders | 20 percent | 25 percent |
| high-value orders | 10 percent | 42 percent |
A review project that samples only routine domestic flags can return a very tight estimate near 12 percent while missing the higher-risk parts of the queue. To estimate overall queue precision, sample each important slice or sample randomly from the actual queue, then account for the slice mix.
1slices = [
2 ("routine domestic", 0.70, 0.12),
3 ("international", 0.20, 0.25),
4 ("high value", 0.10, 0.42),
5]
6
7overall_precision = sum(share * rate for _, share, rate in slices)
8domestic_only_precision = slices[0][2]
9
10print(f"domestic-only estimate: {domestic_only_precision:.1%}")
11print(f"queue-weighted estimate: {overall_precision:.1%}")
12print(f"bias from wrong slice: {domestic_only_precision - overall_precision:+.1%}")
13
14assert round(overall_precision, 3) == 0.1761domestic-only estimate: 12.0%
2queue-weighted estimate: 17.6%
3bias from wrong slice: -5.6%Random uncertainty and sampling bias require different fixes:
| Failure | Symptom | Correct response |
|---|---|---|
| Too few representative reviews | interval is wide | label more randomly selected queue items |
| Wrong slices reviewed | interval may be narrow but misses deployment risk | repair sampling plan and report slice coverage |
| Labels change after model output is seen | metric drifts toward model guesses | lock rubric and review ambiguous labels independently |
A small reporting function makes the contract concrete. It prints numerator, denominator, estimate, interval, and a low-data warning. It doesn't decide whether to launch; that decision requires costs, thresholds, and slice coverage.
1from math import sqrt
2
3def wilson_interval(successes: int, n: int, z: float = 1.96) -> tuple[float, float]:
4 if n <= 0 or not 0 <= successes <= n:
5 raise ValueError("require 0 <= successes <= n and n > 0")
6 p_hat = successes / n
7 denominator = 1 + z * z / n
8 center = (p_hat + z * z / (2 * n)) / denominator
9 margin = z / denominator * sqrt(
10 p_hat * (1 - p_hat) / n + z * z / (4 * n * n)
11 )
12 return max(0.0, center - margin), min(1.0, center + margin)
13
14def report_precision(successes: int, n: int, slice_name: str) -> str:
15 low, high = wilson_interval(successes, n)
16 caution = " collect more labels" if n < 200 else ""
17 return (
18 f"{slice_name}: {successes}/{n} = {successes / n:.1%}; "
19 f"95% score interval [{low:.1%}, {high:.1%}].{caution}"
20 )
21
22print(report_precision(16, 100, "all sampled flags"))
23print(report_precision(25, 100, "international flags"))
24
25try:
26 report_precision(2, 0, "empty slice")
27except ValueError as error:
28 print(error)1all sampled flags: 16/100 = 16.0%; 95% score interval [10.1%, 24.4%]. collect more labels
2international flags: 25/100 = 25.0%; 95% score interval [17.5%, 34.3%]. collect more labels
3require 0 <= successes <= n and n > 0An evaluation report should pair that line with how the sample was chosen and which important slices have enough labels. A number without its collection process invites overconfidence.
The sample rate did not come from habit alone. Treat each reviewed flag as a Bernoulli outcome with unknown fraud probability . If k of n reviewed flags are fraud, the likelihood of that observation is proportional to:
Maximum likelihood estimation (MLE) chooses the value of that makes the observed labels most likely:
For 16 / 100, MLE is 0.16.
Maximum a posteriori estimation (MAP) makes the prior explicit. With a mild prior, centered at 0.5 but weak relative to a large labeled sample, the posterior mode is:
1def validate_counts(successes: int, n: int) -> None:
2 if n <= 0 or not 0 <= successes <= n:
3 raise ValueError("require 0 <= successes <= n and n > 0")
4
5def mle_rate(successes: int, n: int) -> float:
6 validate_counts(successes, n)
7 return successes / n
8
9def map_rate_beta_2_2(successes: int, n: int) -> float:
10 validate_counts(successes, n)
11 return (successes + 1) / (n + 2)
12
13for successes, n in [(16, 100), (1, 5), (160, 1_000)]:
14 mle = mle_rate(successes, n)
15 map_estimate = map_rate_beta_2_2(successes, n)
16 print(f"{successes}/{n:<4} MLE={mle:.3f} MAP_Beta(2,2)={map_estimate:.3f}")116/100 MLE=0.160 MAP_Beta(2,2)=0.167
21/5 MLE=0.200 MAP_Beta(2,2)=0.286
3160/1000 MLE=0.160 MAP_Beta(2,2)=0.161The prior moves a five-label estimate more than a thousand-label estimate. This isn't permission to hide weak evidence behind a convenient prior. It is a reminder to state the prior and show how much it changes the result.[3][4]
The previous chapter introduced calibration: a model score near 0.20 should correspond to fraud roughly 20 percent of the time across similarly scored orders. Now statistics changes what you can claim from a finite bucket.
Suppose 100 flagged orders had predicted risk near 20 percent and reviewers confirm fraud in 16. The observed gap is four percentage points, but the Wilson interval for the observed rate still includes 20 percent. With this sample, you have not established a calibration failure.
1from math import sqrt
2
3def wilson_interval(successes: int, n: int, z: float = 1.96) -> tuple[float, float]:
4 if n <= 0 or not 0 <= successes <= n:
5 raise ValueError("require 0 <= successes <= n and n > 0")
6 p_hat = successes / n
7 denominator = 1 + z * z / n
8 center = (p_hat + z * z / (2 * n)) / denominator
9 margin = z / denominator * sqrt(
10 p_hat * (1 - p_hat) / n + z * z / (4 * n * n)
11 )
12 return max(0.0, center - margin), min(1.0, center + margin)
13
14predicted_risk = 0.20
15observed_fraud = 16
16n = 100
17low, high = wilson_interval(observed_fraud, n)
18
19print(f"predicted bucket risk: {predicted_risk:.1%}")
20print(f"observed fraud rate: {observed_fraud / n:.1%}")
21print(f"observed interval: [{low:.1%}, {high:.1%}]")
22print(f"20% is plausible here: {low <= predicted_risk <= high}")1predicted bucket risk: 20.0%
2observed fraud rate: 16.0%
3observed interval: [10.1%, 24.4%]
420% is plausible here: TrueCalibration matters for classifiers, and modern neural networks can be miscalibrated, so it must be evaluated rather than assumed.[5] This example adds the statistical discipline: a bucket gap without enough labels isn't strong evidence either.
An interval around queue precision measures estimation uncertainty: how much a finite reviewed sample can move. It does not identify every reason a detector can fail.
| Problem | What it looks like | Does a tighter interval fix it? |
|---|---|---|
| finite representative sample | measured precision moves across random label batches | yes, more representative labels narrow it |
| sampling bias | only one routine slice was reviewed | no, change the sample design |
| ambiguous labels | reviewers disagree on whether an order is fraud | no, improve rubric and measure agreement |
| unfamiliar deployment traffic | new payment path or geography appears | no, add coverage and retrain or route safely |
| miscalibrated score | predicted-risk buckets disagree with observed rates | no, measure and calibrate on held-out labels |
Some literature groups irreducible input or label ambiguity under aleatoric uncertainty, and lack of knowledge about unfamiliar cases under epistemic uncertainty. For this lesson, the practical move matters more than the label: don't use an interval for sampling noise as if it solved biased sampling, unclear labels, or unfamiliar traffic.
You review 24 / 120 flagged high-value orders and confirm fraud.
wilson_interval function for 24, 120.120 / 1000 fraud. Explain why you cannot replace one slice with the other.24 / 120 fraud. Is this enough to declare it miscalibrated?Solution checks:
| Item | Check |
|---|---|
| Point estimate | 24 / 120 = 0.20 |
| Wilson interval | approximately [0.138, 0.280] |
| 25 percent requirement | the interval crosses 25 percent, so this sample doesn't establish that precision is below or above the requirement |
| Slice swap | high-value and routine-domestic flags represent different populations; combine them only with a valid sampling/weighting plan |
| Calibration call | no; 25 percent remains plausible within this finite bucket's interval |
The habit is intentionally conservative. Report what your labels support; do not convert weak evidence into a confident product claim.
0 / 10 is reported as zero risk. Cause: the quick normal interval collapsed at an edge. Fix: use a binary-rate interval that preserves uncertainty, such as the Wilson score interval.Answer every question, then check your score. Score above 75% to mark this lesson complete.
8 questions remaining.
Bootstrap Methods: Another Look at the Jackknife.
Efron, B. · 1979 · Annals of Statistics
The Elements of Statistical Learning.
Hastie, T., Tibshirani, R., Friedman, J. · 2009
Machine Learning: A Probabilistic Perspective.
Murphy, K. P. · 2012
Pattern Recognition and Machine Learning.
Bishop, C. M. · 2006
On Calibration of Modern Neural Networks
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. · 2017