A beginner-first probability chapter that starts with counts, then teaches conditional probability, Bayes rule, and base-rate mistakes through a moderation-filter example.
Probability is what you use when you don't have perfect information.
In ordinary programming, a branch is often clean:
python1if user_is_admin: 2 print("show admin panel") 3 show_admin_panel()
Machine learning is messier. A model says "this looks 82 percent likely." A safety filter says "this message looks risky." A retrieval system says "this document seems relevant." Those aren't facts. They are uncertain claims about events.
This chapter teaches the first habit of probability: name the event, count the cases, then update when new evidence arrives.[1][2][3]
| Step | Question | What you should be able to do |
|---|---|---|
| 1 | What can happen? | Define the event in plain English. |
| 2 | How often does it happen? | Turn counts into a probability. |
| 3 | What evidence did we see? | Use conditional probability. |
| 4 | How should belief change? | Apply Bayes rule. |
| 5 | What can go wrong? | Spot base-rate mistakes before shipping. |
You already practiced Python and arrays. Now the same habit becomes statistical: make the invisible assumption visible.
Imagine an LLM moderation filter.
Out of 1,000 messages:
| Message type | Count |
|---|---|
| Unsafe | 10 |
| Safe | 990 |
| Total | 1,000 |
Before the filter says anything, unsafe messages are rare:
That number is called the prior. It is the probability before new evidence.
Now suppose the filter is pretty good:
| If message is... | Filter flags it how often? |
|---|---|
| Unsafe | 95 percent |
| Safe | 5 percent |
Beginner guess: "If the filter flags a message, it is probably unsafe."
Careful answer: not necessarily. Safe messages are so common that false alarms can dominate the flagged pile.
Let's count the flagged messages out of the same 1,000 messages.
Unsafe messages:
Safe messages:
So the filter flags about 59 messages:
Only 9.5 of those flagged messages are actually unsafe:
The result is about 16 percent, not 95 percent.
That is the base-rate lesson. A strong signal can still be less convincing than it feels when the event is rare.
message is unsafe.filter flagged the message.Keep every word tied to the moderation story. If a term can't point to a row, count, or event, it is still floating.
Conditional probability asks:
Read it as:
Among the cases where B happened, what fraction also had A?
Bayes rule turns the question around:
For the moderation filter:
This isn't a trick formula. It is just careful counting.
Put this in probability_demo.py:
python1def flagged_posterior(prior, true_positive, false_positive): 2 real_flags = true_positive * prior 3 false_flags = false_positive * (1 - prior) 4 all_flags = real_flags + false_flags 5 return real_flags / all_flags 6 7posterior = flagged_posterior( 8 prior=0.01, 9 true_positive=0.95, 10 false_positive=0.05, 11) 12 13print(round(posterior, 3))
Expected output:
text10.161
Read the code like a word problem:
prior=0.01 means 1 percent of messages are unsafe before filtering.true_positive=0.95 means unsafe messages are usually flagged.false_positive=0.05 means safe messages are sometimes flagged.real_flags / all_flags means "of flagged messages, how many are truly unsafe?"Now change only prior:
python1for prior in [0.01, 0.10, 0.50]: 2 print(prior, round(flagged_posterior(prior, 0.95, 0.05), 3))
The detector didn't change. Only the base rate changed. The answer should move a lot.
Keep the filter quality fixed:
| Setting | Value |
|---|---|
| True-positive rate | 0.95 |
| False-positive rate | 0.05 |
Now change only the world the filter lives in:
| Unsafe base rate | Posterior after flag |
|---|---|
| 1 percent | about 16 percent |
| 10 percent | about 68 percent |
| 50 percent | about 95 percent |
Same detector. Different population. Different answer.
This is why product context matters. A fraud model, safety filter, and spam classifier can share the same math but make different decisions because their base rates differ.
The common trap is mixing up these two questions:
| Question | Meaning |
|---|---|
| If the message is unsafe, how often does the filter catch it? | |
| If the filter flagged it, how likely is it unsafe? |
They look similar, but they answer opposite questions.
This mistake appears constantly in LLM products:
Add a guard before you trust this function:
python1def test_flagged_posterior(): 2 result = flagged_posterior(0.01, 0.95, 0.05) 3 assert 0.15 < result < 0.17
Then add input checks:
python1def check_probability(x, name): 2 assert 0 <= x <= 1, f"{name} must be between 0 and 1"
Probability code should fail loudly when a value like 1.4 enters the system. Silent invalid probabilities create fake certainty.
P(unsafe) from the table without code.0.05 to 0.01.Before using probability in a product decision, write down:
For an LLM system, this might be:
Event: answer contains a policy violation. Population: support-chat answers in English. Evidence: safety classifier score above threshold. Decision: send answer to human review if posterior risk is above 20 percent.
That sentence is boring, but it is what makes probability useful.
Next, continue to Statistics and Uncertainty. Probability gave you a way to reason about uncertain events. Statistics asks how much you should trust a probability estimated from data.