A beginner-first probability article that teaches events, priors, conditional probability, independence, Bayes rule, and base-rate mistakes through one e-commerce order-risk detector story.
Machine learning often gives engineers a number before it gives them a decision. A classifier returns 0.82, a retriever assigns a high similarity score, or a fraud detector flags an order for review. Probability is how you turn those numbers into named claims about an event, a population, and the evidence you observed.
The preceding optimization chapter showed how gradients change model parameters. Once those parameters assign an order a risk score, the next question isn't an optimizer question. It's a probability question: what event became more likely, among which orders, after what evidence?
Machine learning usually gives you a sentence like this:
Given what I observed, this event seems more or less likely.
Probability is the language for making that sentence precise. It doesn't remove uncertainty. It gives you a disciplined way to count under uncertainty.[1][2][3]
We'll first compute a posterior from a pile of orders by hand, then encode the exact accounting in small programs. Near the end, we'll connect the same idea to score calibration and token probabilities without stealing the next chapter's job: statistics will ask how trustworthy estimated rates are when your sample is finite.
We'll use one example for the whole article: a fraud detector that flags e-commerce orders for manual review.
The point isn't fraud detection itself. The point is learning how to read any ML score without fooling yourself.
Before the detector runs, you need a world to count in.
| Name | In this article | Why it matters |
|---|---|---|
| sample space | 10,000 orders | the full world you are counting over |
| event | order is fraudulent | the thing you care about |
| evidence | detector flagged the order | what you observed |
| prior | fraud rate before the flag | belief before new evidence |
| posterior | fraud rate after the flag | belief after new evidence |
If any of those pieces are missing, the number is floating. Floating numbers are how teams turn model scores into bad product decisions.
Imagine 10,000 recent orders from the same product surface.
| Order type | Count | Probability |
|---|---|---|
| Fraudulent | 100 | 0.01 |
| Legitimate | 9,900 | 0.99 |
| Total | 10,000 | 1.00 |
The probability of a fraudulent order is a fraction:
Read that as: before the detector says anything, 1 percent of orders are fraudulent.
That starting probability is the prior. A prior isn't a guess pulled from nowhere. In an engineering system, it should usually come from a measured population: production logs, labeled eval data, reviewed tickets, or another concrete sample.
In probability language, a random variable is a number that depends on which case you happened to pick. If we let F = 1 when an order is fraudulent and F = 0 when it's legitimate, the expectation E[F] is the long-run average value, each outcome weighted by its probability:
That 0.01 is the base rate written as an average instead of a fraction. A 0/1 variable like this is called a Bernoulli variable, and its expectation is always just the probability of the 1 outcome.
Expectation alone hides how much outcomes move around. Variance measures the average squared distance from the expectation, Var[F] = E[(F - E[F])^2]. For a Bernoulli variable with success probability p, it simplifies to p(1 - p):
Variance names outcome-level spread around the average. Later statistics chapters separate that spread from uncertainty in an estimated rate and from variation across product slices or repeated training runs.
Here is the count table written as a Bernoulli indicator. 1 means a fraudulent order and 0 means a legitimate order. The mean of that indicator is the prior.
1fraud_indicator = [1] * 100 + [0] * 9_900
2
3prior = sum(fraud_indicator) / len(fraud_indicator)
4variance = sum((value - prior) ** 2 for value in fraud_indicator) / len(fraud_indicator)
5
6print(f"orders: {len(fraud_indicator):,}")
7print(f"prior = E[F]: {prior:.4f}")
8print(f"Var[F]: {variance:.4f}")
9
10assert prior == 0.01
11assert abs(variance - prior * (1 - prior)) < 1e-121orders: 10,000
2prior = E[F]: 0.0100
3Var[F]: 0.0099Now add the detector.
| If the order is... | Detector behavior | Probability |
|---|---|---|
| Fraudulent | flags it | 0.95 |
| Legitimate | falsely flags it | 0.05 |
That top row is strong. If an order is fraudulent, the detector catches it 95 percent of the time.
But product teams ask a different question after seeing a flag:
Given that this order was flagged, how likely is it fraudulent?
That's a different question. Probability notation makes the difference visible:
| Notation | Plain English |
|---|---|
| If an order is fraudulent, how often does the detector flag it? | |
| If an order is flagged, how often is it actually fraudulent? |
Those two lines aren't interchangeable. Reversing them is one of the most common probability mistakes in ML systems.
Work from the 10,000 orders. Don't start with Bayes rule yet.
Fraudulent orders:
Legitimate orders:
Now put the flagged orders into one pile.
| Source of flagged order | Count |
|---|---|
| fraudulent and flagged | 95 |
| legitimate and flagged | 495 |
| all flagged orders | 590 |
The false-positive rate is small, but it acts on the huge legitimate pile. Five percent of 9,900 is larger than 95 percent of 100.
That's why the posterior is surprising:
A flagged order is about 16 percent likely to be fraudulent, not 95 percent likely.
The detector didn't become bad. The question changed. We stopped asking how often fraudulent orders are caught and started asking what lives inside the flagged pile.
Turn the arithmetic into a confusion-table calculation. Notice that the program prints counts before it prints the posterior.
1total_orders = 10_000
2fraud_orders = 100
3legitimate_orders = total_orders - fraud_orders
4true_positive_rate = 0.95
5false_positive_rate = 0.05
6
7true_flags = round(fraud_orders * true_positive_rate)
8false_flags = round(legitimate_orders * false_positive_rate)
9flagged_orders = true_flags + false_flags
10posterior = true_flags / flagged_orders
11
12print(f"true flags: {true_flags}")
13print(f"false flags: {false_flags}")
14print(f"all flags: {flagged_orders}")
15print(f"P(fraud | flagged): {posterior:.3f}")
16
17assert (true_flags, false_flags, flagged_orders) == (95, 495, 590)1true flags: 95
2false flags: 495
3all flags: 590
4P(fraud | flagged): 0.161Conditional probability always narrows the world first.
For , the denominator isn't all 10,000 orders. The denominator is only the 590 flagged orders.
| Probability | World you count inside | Numerator |
|---|---|---|
| all 10,000 orders | 100 fraudulent orders | |
| 590 flagged orders | 95 fraudulent flagged orders |
That's the whole mental move. Ask "among which cases?" before you divide.
This is the same flow as the diagram:
When a probability problem feels abstract, draw that flow. The formula should summarize the drawing, not replace it.
The following tiny dataset makes conditioning literal: filter to the evidence pile first, then count fraud only inside the filtered records.
1orders = (
2 [{"fraudulent": True, "flagged": True}] * 95
3 + [{"fraudulent": True, "flagged": False}] * 5
4 + [{"fraudulent": False, "flagged": True}] * 495
5 + [{"fraudulent": False, "flagged": False}] * 9_405
6)
7
8flagged = [order for order in orders if order["flagged"]]
9fraudulent_flagged = [order for order in flagged if order["fraudulent"]]
10
11print(f"all orders denominator: {len(orders):,}")
12print(f"flagged denominator: {len(flagged)}")
13print(f"fraud within flagged: {len(fraudulent_flagged) / len(flagged):.3f}")
14
15assert len(flagged) == 590
16assert len(fraudulent_flagged) == 951all orders denominator: 10,000
2flagged denominator: 590
3fraud within flagged: 0.161Two more words tie the counts together.
A joint probability is the chance that two things happen together. In the population, 95 orders are both fraudulent and flagged, so:
A marginal probability is the chance of one event by itself, ignoring the other. The marginal probability of a flag adds up every way a flag can happen:
Joint, conditional, and marginal probabilities are linked by one identity. When , conditional probability is the joint divided by the world you conditioned on:
Rearranged, that gives the multiplication rule: a joint probability is a conditional times a marginal.
Check it against the counts: , the same posterior as before. This identity is also where Bayes rule comes from. The next section just reads it from the other direction.
The same counts can be computed as shares of the full population. The two routes are mathematically identical, but the program below also shows why you should not compare floating-point results with exact equality: the last line is False even though the values match to every printed digit.
1total = 10_000
2fraudulent_and_flagged = 95
3flagged = 590
4
5joint = fraudulent_and_flagged / total
6marginal_flagged = flagged / total
7conditional = joint / marginal_flagged
8pile_fraction = fraudulent_and_flagged / flagged
9
10print(f"P(fraud and flagged): {joint:.4f}")
11print(f"P(flagged): {marginal_flagged:.4f}")
12print(f"P(fraud | flagged): {conditional:.3f}")
13print(f"same as pile count: {conditional == pile_fraction}")1P(fraud and flagged): 0.0095
2P(flagged): 0.0590
3P(fraud | flagged): 0.161
4same as pile count: FalseBayes rule is the formula version of the flagged-pile count.
For events and with :
In this article:
| Symbol | Meaning | Value |
|---|---|---|
| order is fraudulent | ||
| detector flagged the order | ||
| fraud base rate | 0.01 | |
| true-positive rate | 0.95 | |
| false-positive rate | 0.05 |
When is the observed evidence, is its likelihood under hypothesis . Here it asks how likely a flag would be if the order really were fraudulent. Bayes rule combines that likelihood with the prior to obtain the posterior.
The denominator means "how often does a flag happen at all?"
It includes true flags and false alarms:
The denominator is the marginal probability of a flag. It averages over both kinds of orders, weighted by how common each kind is.
Plug in the numbers:
Now compute the posterior:
Same answer as the count table. Bayes rule is the compact form of the same accounting: track where the flagged orders came from before you divide.
Evidence helps when it changes the event rate.
When , seeing leaves the probability of unchanged if the events are independent:
Imagine a broken fraud detector:
| If the order is... | Broken detector flags it |
|---|---|
| Fraudulent | 20 percent |
| Legitimate | 20 percent |
Out of 10,000 orders, this detector produces:
| Source of flagged order | Count |
|---|---|
| fraudulent and flagged | 20 |
| legitimate and flagged | 1,980 |
| all flagged orders | 2,000 |
The flagged pile is still 1 percent fraudulent:
The flag created work, but it didn't create information. Useful ML signals are useful because they change the rate of the event you care about.
Code makes the "no update" claim testable. If both groups are flagged at the same rate, that rate cancels from Bayes rule.
1def posterior_if_flagged(prior, true_positive_rate, false_positive_rate):
2 true_flags = true_positive_rate * prior
3 false_flags = false_positive_rate * (1 - prior)
4 return true_flags / (true_flags + false_flags)
5
6prior = 0.01
7posterior = posterior_if_flagged(prior, 0.20, 0.20)
8
9print(f"prior: {prior:.3f}")
10print(f"posterior: {posterior:.3f}")
11print(f"update: {posterior - prior:+.3f}")
12
13assert abs(posterior - prior) < 1e-121prior: 0.010
2posterior: 0.010
3update: +0.000Keep the detector fixed:
| Detector property | Value |
|---|---|
| true-positive rate | 0.95 |
| false-positive rate | 0.05 |
Now change only the population.
| Fraud base rate | Posterior after flag | What changed? |
|---|---|---|
| 1 percent | about 16 percent | legitimate orders dominate the flagged pile |
| 10 percent | about 68 percent | true flags become a much larger share |
| 50 percent | about 95 percent | both classes are equally common before evidence |
The model didn't change. The world around the model changed.
This is why the same classifier can behave differently across products, countries, languages, traffic sources, or time periods. A score without a population is like a map without a scale. It may look precise, but it isn't enough to act carefully.
The code should read like the table:
true_positive * prior.false_positive * (1 - prior).Put this in probability_demo.py. It is small enough to audit line by line, but already exposes the input validation and evidence guard that production code needs.
1def check_probability(x, name):
2 if not 0 <= x <= 1:
3 raise ValueError(f"{name} must be between 0 and 1")
4
5def flagged_posterior(prior, true_positive, false_positive):
6 check_probability(prior, "prior")
7 check_probability(true_positive, "true_positive")
8 check_probability(false_positive, "false_positive")
9
10 true_flags = true_positive * prior
11 false_flags = false_positive * (1 - prior)
12 all_flags = true_flags + false_flags
13
14 if all_flags == 0:
15 raise ValueError("evidence probability must be greater than 0")
16
17 return true_flags / all_flags
18
19def main():
20 priors = [0.01, 0.10, 0.50]
21
22 for prior in priors:
23 posterior = flagged_posterior(prior, 0.95, 0.05)
24 print(prior, round(posterior, 3))
25
26 try:
27 flagged_posterior(1.4, 0.95, 0.05)
28 except ValueError as error:
29 print(error)
30
31if __name__ == "__main__":
32 main()10.01 0.161
20.1 0.679
30.5 0.95
4prior must be between 0 and 1The detector stayed fixed. The prior changed, so the posterior changed.
A deployed detector rarely emits only flag or not flag. It emits a score, and a threshold creates the flag. Raising a threshold normally sends fewer orders to reviewers and may make the flagged pile cleaner, but it can also miss more fraudulent orders.
Use an illustrative measurement table from the same 1 percent fraud population. These rates would need to be measured on labeled data in a real system.
| Threshold policy | ||
|---|---|---|
| broad review | 0.95 | 0.05 |
| stricter review | 0.80 | 0.01 |
1def review_metrics(prior, recall, false_positive_rate):
2 true_flags = recall * prior
3 false_flags = false_positive_rate * (1 - prior)
4 review_rate = true_flags + false_flags
5 if review_rate == 0:
6 raise ValueError("review rate must be greater than 0")
7 precision = true_flags / review_rate
8 return review_rate, precision
9
10prior = 0.01
11policies = [
12 ("broad review", 0.95, 0.05),
13 ("stricter review", 0.80, 0.01),
14]
15
16for name, recall, false_positive_rate in policies:
17 review_rate, precision = review_metrics(prior, recall, false_positive_rate)
18 print(
19 f"{name:16} review={review_rate:6.2%} "
20 f"fraud_in_queue={precision:6.2%} recall={recall:6.2%}"
21 )
22
23try:
24 review_metrics(prior, recall=0.0, false_positive_rate=0.0)
25except ValueError as error:
26 print("empty queue:", error)1broad review review= 5.90% fraud_in_queue=16.10% recall=95.00%
2stricter review review= 1.79% fraud_in_queue=44.69% recall=80.00%
3empty queue: review rate must be greater than 0The stricter policy reduces review load and improves the fraction of reviewed orders that are fraud, which is precision, but it catches fewer fraud cases, which lowers recall. This is a precision-recall tradeoff. Probability exposes the tradeoff; product cost and safety policy choose among the options. A threshold can also send no orders to review. In that case, queue precision is undefined because there is no flagged pile, so code should reject or explicitly represent the empty queue instead of dividing by zero.
Every line in flagged_posterior has a probability meaning.
| Code | Probability meaning |
|---|---|
prior | |
true_positive | |
false_positive | |
true_flags | true flagged share of the population |
false_flags | false flagged share of the population |
all_flags | |
true_flags / all_flags |
The guard for all_flags == 0 matters. If evidence never appears, the conditional probability is undefined. Production code should reject that case instead of inventing a number.
The function needs a contract: the known worked example must remain correct, invalid inputs must fail, and impossible evidence must fail instead of returning an invented posterior.
1import math
2
3def flagged_posterior(prior, true_positive, false_positive):
4 values = {"prior": prior, "true_positive": true_positive, "false_positive": false_positive}
5 for name, value in values.items():
6 if not 0 <= value <= 1:
7 raise ValueError(f"{name} must be between 0 and 1")
8 true_flags = true_positive * prior
9 false_flags = false_positive * (1 - prior)
10 if true_flags + false_flags == 0:
11 raise ValueError("evidence probability must be greater than 0")
12 return true_flags / (true_flags + false_flags)
13
14assert math.isclose(flagged_posterior(0.01, 0.95, 0.05), 95 / 590)
15
16for args in [(1.4, 0.95, 0.05), (0.01, 0.0, 0.0)]:
17 try:
18 flagged_posterior(*args)
19 except ValueError as error:
20 print(error)
21 else:
22 raise AssertionError("invalid probability case did not fail")
23
24print("posterior calculation checks passed")1prior must be between 0 and 1
2evidence probability must be greater than 0
3posterior calculation checks passedThe first test protects the numeric story. The second protects the failure path. Both are part of learning probability as an engineering skill, not just as a formula.
The fraud example is one surface.
| ML system | Event | Evidence | Base-rate question |
|---|---|---|---|
| fraud detector | order is fraudulent | risk score above threshold | how common is fraud for this product category? |
| product retrieval | product is relevant | embedding similarity above threshold | how many returned products match the query intent? |
| label judge | predicted label is correct | automated judge marks it correct | how often does the judge agree with human reviewers? |
| delivery anomaly detector | delivery is delayed | anomaly score is high | how common are delays for this route or season? |
The same habit works everywhere:
Skip those steps and a score starts pretending to be a conclusion.
So far, the evidence was a thresholded flag and all rates were given. A model may instead emit a score such as 0.80. That score earns the interpretation "80 percent probability of fraud" only if similarly scored orders are fraudulent about 80 percent of the time. That property is calibration. Modern neural classifiers can be accurate while still producing poorly calibrated confidence scores, which is why score calibration is measured rather than assumed.[4]
This toy bucket demonstrates the check. Every order was assigned a score near 0.80, but only half of these labeled orders were actually fraudulent.
1predicted_risk = [0.80, 0.82, 0.78, 0.81, 0.79, 0.80]
2observed_fraud = [1, 0, 1, 0, 1, 0]
3
4advertised_risk = sum(predicted_risk) / len(predicted_risk)
5observed_rate = sum(observed_fraud) / len(observed_fraud)
6gap = advertised_risk - observed_rate
7
8print(f"average predicted risk: {advertised_risk:.0%}")
9print(f"observed fraud rate: {observed_rate:.0%}")
10print(f"calibration gap: {gap:.0%}")1average predicted risk: 80%
2observed fraud rate: 50%
3calibration gap: 30%Six orders can't prove how a deployed model is calibrated. The calculation names the question. Statistics will teach how much evidence you need before trusting the measured gap.
This chapter used one binary event, but an LLM produces a categorical distribution over the next token. For a known target token, training cares about the probability assigned to that target. With a one-hot target, that token's cross-entropy contribution is its negative log probability, -log(p).[3]
1import math
2
3for target_probability in [0.90, 0.50, 0.01]:
4 loss = -math.log(target_probability)
5 print(f"target probability={target_probability:>4.2f} loss={loss:>5.3f}")1target probability=0.90 loss=0.105
2target probability=0.50 loss=0.693
3target probability=0.01 loss=4.605Low probability for the observed token costs more because the observation was more surprising under the model.
For a sequence, the model's joint probability follows the chain rule: multiply the conditional probability of each next token given the previous tokens.
Here, is the token at position , and the expression to the right of the bar is the preceding context. Multiplying many small probabilities eventually underflows in floating-point arithmetic. Logs convert the product into a stable sum.
1import math
2
3token_probability = 0.01
4token_count = 200
5
6raw_product = token_probability ** token_count
7log_probability = token_count * math.log(token_probability)
8
9print(f"raw product in float: {raw_product}")
10print(f"log probability: {log_probability:.1f}")
11print(f"finite in log space: {math.isfinite(log_probability)}")1raw product in float: 0.0
2log probability: -921.0
3finite in log space: TrueThe 0.0 doesn't mean the sequence was impossible. It means ordinary floating-point multiplication lost a representable nonzero number. Later language-modeling chapters use this same log-space habit for cross-entropy and perplexity.
Most probability bugs are question bugs.
| Symptom | Mistake | Better move |
|---|---|---|
| "The flag means 95 percent fraudulent." | reversed the conditional probabilities | write both questions in plain English |
| Posterior feels too high | ignored rare base rate | start from counts before formulas |
| Denominator is all orders | forgot conditioning | denominator should be the evidence pile |
| One threshold used everywhere | ignored population shift | recompute base rates per product slice |
| Model confidence treated as truth | event never named | define the event and compare to labels |
| Code returns a number for impossible evidence | divided by zero-probability evidence | reject undefined cases loudly |
A score of 0.80 is used as 80 percent risk without checking labels | calibration was assumed | bucket predictions and compare score with observed rate |
A long sequence gets probability 0.0 in code | small probabilities were multiplied directly | sum log probabilities instead |
The debugging question is short:
Among which cases am I counting?
If you can answer that, the formula usually becomes much easier.
Use the same 10,000-order population.
0.05 to 0.01.0.95.-log(0.80) and -log(0.10). Which observed token is more surprising to a model?Then translate the lesson to product search:
| Product search version | Your answer |
|---|---|
| event | product is truly relevant |
| population | candidate products returned for a query class |
| evidence | similarity score above threshold |
| prior | relevance rate before thresholding |
| posterior | relevance rate after thresholding |
The goal isn't to memorize the fraud numbers. The goal is to carry the counting habit into any model score.
Check your work after you try the practice.
| Practice item | Answer |
|---|---|
| Prior | |
| Original false flags | |
| New false flags | |
| New posterior | |
| Why it changed | the false-alarm pile got smaller, so true flags became a larger share of all flags |
| Threshold tradeoff | broad review catches more fraud; stricter review produces a cleaner, smaller queue |
| Token loss comparison | and ; the 0.10 target is more surprising |
If your explanation starts with Bayes rule, translate it back into piles. A good answer can move between counts, words, code, and notation.
Probability turns model scores into named claims.
For an e-commerce product, a useful probability statement sounds like this:
Event: order is fraudulent. Population: electronics orders in the last 30 days. Evidence: fraud detector score above threshold. Decision: send to manual review when posterior risk is above 20 percent.
That sentence is longer than "score is high," but it's much safer.
Answer every question, then check your score. Score above 75% to mark this lesson complete.
10 questions remaining.