LeetLLM
LearnPracticeFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Practice
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnMath & StatisticsProbability for Machine Learning
📊EasyEvaluation & Benchmarks

Probability for Machine Learning

A beginner-first probability article that teaches events, priors, conditional probability, independence, Bayes rule, and base-rate mistakes through one e-commerce order-risk detector story.

19 min read
Learning path
Step 11 of 155 in the full curriculum
Adam, Momentum, SchedulersStatistics and Uncertainty

Machine learning often gives engineers a number before it gives them a decision. A classifier returns 0.82, a retriever assigns a high similarity score, or a fraud detector flags an order for review. Probability is how you turn those numbers into named claims about an event, a population, and the evidence you observed.

Why probability shows up in ML

The preceding optimization chapter showed how gradients change model parameters. Once those parameters assign an order a risk score, the next question isn't an optimizer question. It's a probability question: what event became more likely, among which orders, after what evidence?

Machine learning usually gives you a sentence like this:

Given what I observed, this event seems more or less likely.

Probability is the language for making that sentence precise. It doesn't remove uncertainty. It gives you a disciplined way to count under uncertainty.[1][2][3]

We'll first compute a posterior from a pile of orders by hand, then encode the exact accounting in small programs. Near the end, we'll connect the same idea to score calibration and token probabilities without stealing the next chapter's job: statistics will ask how trustworthy estimated rates are when your sample is finite.

One running example

We'll use one example for the whole article: a fraud detector that flags e-commerce orders for manual review.

The point isn't fraud detection itself. The point is learning how to read any ML score without fooling yourself.

Flagged orders narrow to 95 fraud cases out of 590 flags: 16 percent. Flagged orders narrow to 95 fraud cases out of 590 flags: 16 percent.
Posterior uses fraud among flagged orders, not all orders.

Before the detector runs, you need a world to count in.

NameIn this articleWhy it matters
sample space10,000 ordersthe full world you are counting over
eventorder is fraudulentthe thing you care about
evidencedetector flagged the orderwhat you observed
priorfraud rate before the flagbelief before new evidence
posteriorfraud rate after the flagbelief after new evidence

If any of those pieces are missing, the number is floating. Floating numbers are how teams turn model scores into bad product decisions.

Start with the population

Imagine 10,000 recent orders from the same product surface.

Order typeCountProbability
Fraudulent1000.01
Legitimate9,9000.99
Total10,0001.00

The probability of a fraudulent order is a fraction:

P(fraudulent)=10010000=0.01P(\text{fraudulent}) = \frac{100}{10000} = 0.01P(fraudulent)=10000100​=0.01

Read that as: before the detector says anything, 1 percent of orders are fraudulent.

That starting probability is the prior. A prior isn't a guess pulled from nowhere. In an engineering system, it should usually come from a measured population: production logs, labeled eval data, reviewed tickets, or another concrete sample.

In probability language, a random variable is a number that depends on which case you happened to pick. If we let F = 1 when an order is fraudulent and F = 0 when it's legitimate, the expectation E[F] is the long-run average value, each outcome weighted by its probability:

E[F]=1×0.01+0×0.99=0.01E[F] = 1 \times 0.01 + 0 \times 0.99 = 0.01E[F]=1×0.01+0×0.99=0.01

That 0.01 is the base rate written as an average instead of a fraction. A 0/1 variable like this is called a Bernoulli variable, and its expectation is always just the probability of the 1 outcome.

Expectation alone hides how much outcomes move around. Variance measures the average squared distance from the expectation, Var[F] = E[(F - E[F])^2]. For a Bernoulli variable with success probability p, it simplifies to p(1 - p):

Var[F]=0.01×(1−0.01)=0.0099\text{Var}[F] = 0.01 \times (1 - 0.01) = 0.0099Var[F]=0.01×(1−0.01)=0.0099

Variance names outcome-level spread around the average. Later statistics chapters separate that spread from uncertainty in an estimated rate and from variation across product slices or repeated training runs.

Here is the count table written as a Bernoulli indicator. 1 means a fraudulent order and 0 means a legitimate order. The mean of that indicator is the prior.

indicator-mean-is-probability.py
1fraud_indicator = [1] * 100 + [0] * 9_900 2 3prior = sum(fraud_indicator) / len(fraud_indicator) 4variance = sum((value - prior) ** 2 for value in fraud_indicator) / len(fraud_indicator) 5 6print(f"orders: {len(fraud_indicator):,}") 7print(f"prior = E[F]: {prior:.4f}") 8print(f"Var[F]: {variance:.4f}") 9 10assert prior == 0.01 11assert abs(variance - prior * (1 - prior)) < 1e-12
Indicator summary
1orders: 10,000 2prior = E[F]: 0.0100 3Var[F]: 0.0099

Evidence changes the question

Now add the detector.

If the order is...Detector behaviorProbability
Fraudulentflags it0.95
Legitimatefalsely flags it0.05

That top row is strong. If an order is fraudulent, the detector catches it 95 percent of the time.

But product teams ask a different question after seeing a flag:

Given that this order was flagged, how likely is it fraudulent?

That's a different question. Probability notation makes the difference visible:

NotationPlain English
P(flagged∣fraudulent)P(\text{flagged} \mid \text{fraudulent})P(flagged∣fraudulent)If an order is fraudulent, how often does the detector flag it?
P(fraudulent∣flagged)P(\text{fraudulent} \mid \text{flagged})P(fraudulent∣flagged)If an order is flagged, how often is it actually fraudulent?

Those two lines aren't interchangeable. Reversing them is one of the most common probability mistakes in ML systems.

Count the flagged pile

Work from the 10,000 orders. Don't start with Bayes rule yet.

Fraudulent orders:

100×0.95=95100 \times 0.95 = 95100×0.95=95

Legitimate orders:

9900×0.05=4959900 \times 0.05 = 4959900×0.05=495

Now put the flagged orders into one pile.

Source of flagged orderCount
fraudulent and flagged95
legitimate and flagged495
all flagged orders590

The false-positive rate is small, but it acts on the huge legitimate pile. Five percent of 9,900 is larger than 95 percent of 100.

That's why the posterior is surprising:

P(fraudulent∣flagged)=95590≈0.161P(\text{fraudulent} \mid \text{flagged}) = \frac{95}{590} \approx 0.161P(fraudulent∣flagged)=59095​≈0.161

A flagged order is about 16 percent likely to be fraudulent, not 95 percent likely.

The detector didn't become bad. The question changed. We stopped asking how often fraudulent orders are caught and started asking what lives inside the flagged pile.

Turn the arithmetic into a confusion-table calculation. Notice that the program prints counts before it prints the posterior.

count-the-flagged-pile.py
1total_orders = 10_000 2fraud_orders = 100 3legitimate_orders = total_orders - fraud_orders 4true_positive_rate = 0.95 5false_positive_rate = 0.05 6 7true_flags = round(fraud_orders * true_positive_rate) 8false_flags = round(legitimate_orders * false_positive_rate) 9flagged_orders = true_flags + false_flags 10posterior = true_flags / flagged_orders 11 12print(f"true flags: {true_flags}") 13print(f"false flags: {false_flags}") 14print(f"all flags: {flagged_orders}") 15print(f"P(fraud | flagged): {posterior:.3f}") 16 17assert (true_flags, false_flags, flagged_orders) == (95, 495, 590)
Flagged-pile counts
1true flags: 95 2false flags: 495 3all flags: 590 4P(fraud | flagged): 0.161

Conditioning means narrowing the world

Conditional probability always narrows the world first.

For P(fraudulent∣flagged)P(\text{fraudulent} \mid \text{flagged})P(fraudulent∣flagged), the denominator isn't all 10,000 orders. The denominator is only the 590 flagged orders.

ProbabilityWorld you count insideNumerator
P(fraudulent)P(\text{fraudulent})P(fraudulent)all 10,000 orders100 fraudulent orders
P(fraudulent∣flagged)P(\text{fraudulent} \mid \text{flagged})P(fraudulent∣flagged)590 flagged orders95 fraudulent flagged orders

That's the whole mental move. Ask "among which cases?" before you divide.

This is the same flow as the diagram:

Conditional probability diagram showing that a fraud posterior is computed inside the 590 flagged orders, not inside the full 10,000-order population. Conditional probability diagram showing that a fraud posterior is computed inside the 590 flagged orders, not inside the full 10,000-order population.
Conditioning changes the denominator. Once you condition on a flag, the relevant world is the 590 flagged orders.

When a probability problem feels abstract, draw that flow. The formula should summarize the drawing, not replace it.

The following tiny dataset makes conditioning literal: filter to the evidence pile first, then count fraud only inside the filtered records.

filter-to-the-evidence-pile.py
1orders = ( 2 [{"fraudulent": True, "flagged": True}] * 95 3 + [{"fraudulent": True, "flagged": False}] * 5 4 + [{"fraudulent": False, "flagged": True}] * 495 5 + [{"fraudulent": False, "flagged": False}] * 9_405 6) 7 8flagged = [order for order in orders if order["flagged"]] 9fraudulent_flagged = [order for order in flagged if order["fraudulent"]] 10 11print(f"all orders denominator: {len(orders):,}") 12print(f"flagged denominator: {len(flagged)}") 13print(f"fraud within flagged: {len(fraudulent_flagged) / len(flagged):.3f}") 14 15assert len(flagged) == 590 16assert len(fraudulent_flagged) == 95
Conditioned denominator
1all orders denominator: 10,000 2flagged denominator: 590 3fraud within flagged: 0.161

Joint and marginal probability

Two more words tie the counts together.

A joint probability is the chance that two things happen together. In the population, 95 orders are both fraudulent and flagged, so:

P(fraudulent and flagged)=9510000=0.0095P(\text{fraudulent and flagged}) = \frac{95}{10000} = 0.0095P(fraudulent and flagged)=1000095​=0.0095

A marginal probability is the chance of one event by itself, ignoring the other. The marginal probability of a flag adds up every way a flag can happen:

P(flagged)=95+49510000=59010000=0.059P(\text{flagged}) = \frac{95 + 495}{10000} = \frac{590}{10000} = 0.059P(flagged)=1000095+495​=10000590​=0.059

Joint, conditional, and marginal probabilities are linked by one identity. When P(B)>0P(B) > 0P(B)>0, conditional probability is the joint divided by the world you conditioned on:

P(A∣B)=P(A and B)P(B)P(A \mid B) = \frac{P(A \text{ and } B)}{P(B)}P(A∣B)=P(B)P(A and B)​

Rearranged, that gives the multiplication rule: a joint probability is a conditional times a marginal.

P(A and B)=P(A∣B) P(B)=P(B∣A) P(A)P(A \text{ and } B) = P(A \mid B)\,P(B) = P(B \mid A)\,P(A)P(A and B)=P(A∣B)P(B)=P(B∣A)P(A)

Check it against the counts: P(fraudulent∣flagged)=0.0095/0.059≈0.161P(\text{fraudulent} \mid \text{flagged}) = 0.0095 / 0.059 \approx 0.161P(fraudulent∣flagged)=0.0095/0.059≈0.161, the same posterior as before. This identity is also where Bayes rule comes from. The next section just reads it from the other direction.

The same counts can be computed as shares of the full population. The two routes are mathematically identical, but the program below also shows why you should not compare floating-point results with exact equality: the last line is False even though the values match to every printed digit.

joint-marginal-conditional.py
1total = 10_000 2fraudulent_and_flagged = 95 3flagged = 590 4 5joint = fraudulent_and_flagged / total 6marginal_flagged = flagged / total 7conditional = joint / marginal_flagged 8pile_fraction = fraudulent_and_flagged / flagged 9 10print(f"P(fraud and flagged): {joint:.4f}") 11print(f"P(flagged): {marginal_flagged:.4f}") 12print(f"P(fraud | flagged): {conditional:.3f}") 13print(f"same as pile count: {conditional == pile_fraction}")
Joint to conditional
1P(fraud and flagged): 0.0095 2P(flagged): 0.0590 3P(fraud | flagged): 0.161 4same as pile count: False

Bayes rule after the counts

Bayes rule is the formula version of the flagged-pile count.

For events AAA and BBB with P(B)>0P(B) > 0P(B)>0:

P(A∣B)=P(B∣A)P(A)P(B)P(A \mid B) = \frac{P(B \mid A)P(A)}{P(B)}P(A∣B)=P(B)P(B∣A)P(A)​

In this article:

SymbolMeaningValue
AAAorder is fraudulent
BBBdetector flagged the order
P(A)P(A)P(A)fraud base rate0.01
P(B∣A)P(B \mid A)P(B∣A)true-positive rate0.95
P(B∣not A)P(B \mid \text{not }A)P(B∣not A)false-positive rate0.05

When BBB is the observed evidence, P(B∣A)P(B \mid A)P(B∣A) is its likelihood under hypothesis AAA. Here it asks how likely a flag would be if the order really were fraudulent. Bayes rule combines that likelihood with the prior to obtain the posterior.

The denominator P(B)P(B)P(B) means "how often does a flag happen at all?"

It includes true flags and false alarms:

P(B)=P(B∣A)P(A)+P(B∣not A)P(not A)P(B) = P(B \mid A)P(A) + P(B \mid \text{not }A)P(\text{not }A)P(B)=P(B∣A)P(A)+P(B∣not A)P(not A)

The denominator P(B)P(B)P(B) is the marginal probability of a flag. It averages over both kinds of orders, weighted by how common each kind is.

Plug in the numbers:

P(B)=0.95×0.01+0.05×0.99=0.059P(B) = 0.95 \times 0.01 + 0.05 \times 0.99 = 0.059P(B)=0.95×0.01+0.05×0.99=0.059

Now compute the posterior:

P(A∣B)=0.95×0.010.059≈0.161P(A \mid B) = \frac{0.95 \times 0.01}{0.059} \approx 0.161P(A∣B)=0.0590.95×0.01​≈0.161

Same answer as the count table. Bayes rule is the compact form of the same accounting: track where the flagged orders came from before you divide.

Independence means no update

Evidence helps when it changes the event rate.

When P(B)>0P(B) > 0P(B)>0, seeing BBB leaves the probability of AAA unchanged if the events are independent:

P(A∣B)=P(A)P(A \mid B) = P(A)P(A∣B)=P(A)

Imagine a broken fraud detector:

If the order is...Broken detector flags it
Fraudulent20 percent
Legitimate20 percent

Out of 10,000 orders, this detector produces:

Source of flagged orderCount
fraudulent and flagged20
legitimate and flagged1,980
all flagged orders2,000

The flagged pile is still 1 percent fraudulent:

202000=0.01\frac{20}{2000} = 0.01200020​=0.01

The flag created work, but it didn't create information. Useful ML signals are useful because they change the rate of the event you care about.

Code makes the "no update" claim testable. If both groups are flagged at the same rate, that rate cancels from Bayes rule.

independence-means-no-update.py
1def posterior_if_flagged(prior, true_positive_rate, false_positive_rate): 2 true_flags = true_positive_rate * prior 3 false_flags = false_positive_rate * (1 - prior) 4 return true_flags / (true_flags + false_flags) 5 6prior = 0.01 7posterior = posterior_if_flagged(prior, 0.20, 0.20) 8 9print(f"prior: {prior:.3f}") 10print(f"posterior: {posterior:.3f}") 11print(f"update: {posterior - prior:+.3f}") 12 13assert abs(posterior - prior) < 1e-12
Independent evidence
1prior: 0.010 2posterior: 0.010 3update: +0.000

Same detector, different world

Keep the detector fixed:

Detector propertyValue
true-positive rate0.95
false-positive rate0.05

Now change only the population.

Fraud base ratePosterior after flagWhat changed?
1 percentabout 16 percentlegitimate orders dominate the flagged pile
10 percentabout 68 percenttrue flags become a much larger share
50 percentabout 95 percentboth classes are equally common before evidence

The model didn't change. The world around the model changed.

This is why the same classifier can behave differently across products, countries, languages, traffic sources, or time periods. A score without a population is like a map without a scale. It may look precise, but it isn't enough to act carefully.

Build it: compute posterior risk

The code should read like the table:

  1. Check that each input is a valid probability.
  2. Count true flags as true_positive * prior.
  3. Count false flags as false_positive * (1 - prior).
  4. Divide true flags by all flags.

Put this in probability_demo.py. It is small enough to audit line by line, but already exposes the input validation and evidence guard that production code needs.

code-the-calculation.py
1def check_probability(x, name): 2 if not 0 <= x <= 1: 3 raise ValueError(f"{name} must be between 0 and 1") 4 5def flagged_posterior(prior, true_positive, false_positive): 6 check_probability(prior, "prior") 7 check_probability(true_positive, "true_positive") 8 check_probability(false_positive, "false_positive") 9 10 true_flags = true_positive * prior 11 false_flags = false_positive * (1 - prior) 12 all_flags = true_flags + false_flags 13 14 if all_flags == 0: 15 raise ValueError("evidence probability must be greater than 0") 16 17 return true_flags / all_flags 18 19def main(): 20 priors = [0.01, 0.10, 0.50] 21 22 for prior in priors: 23 posterior = flagged_posterior(prior, 0.95, 0.05) 24 print(prior, round(posterior, 3)) 25 26 try: 27 flagged_posterior(1.4, 0.95, 0.05) 28 except ValueError as error: 29 print(error) 30 31if __name__ == "__main__": 32 main()
Posterior sweep
10.01 0.161 20.1 0.679 30.5 0.95 4prior must be between 0 and 1

The detector stayed fixed. The prior changed, so the posterior changed.

A threshold changes two probabilities at once

A deployed detector rarely emits only flag or not flag. It emits a score, and a threshold creates the flag. Raising a threshold normally sends fewer orders to reviewers and may make the flagged pile cleaner, but it can also miss more fraudulent orders.

Use an illustrative measurement table from the same 1 percent fraud population. These rates would need to be measured on labeled data in a real system.

Threshold policyP(flagged∣fraud)P(\text{flagged} \mid \text{fraud})P(flagged∣fraud)P(flagged∣legitimate)P(\text{flagged} \mid \text{legitimate})P(flagged∣legitimate)
broad review0.950.05
stricter review0.800.01
threshold-tradeoff.py
1def review_metrics(prior, recall, false_positive_rate): 2 true_flags = recall * prior 3 false_flags = false_positive_rate * (1 - prior) 4 review_rate = true_flags + false_flags 5 if review_rate == 0: 6 raise ValueError("review rate must be greater than 0") 7 precision = true_flags / review_rate 8 return review_rate, precision 9 10prior = 0.01 11policies = [ 12 ("broad review", 0.95, 0.05), 13 ("stricter review", 0.80, 0.01), 14] 15 16for name, recall, false_positive_rate in policies: 17 review_rate, precision = review_metrics(prior, recall, false_positive_rate) 18 print( 19 f"{name:16} review={review_rate:6.2%} " 20 f"fraud_in_queue={precision:6.2%} recall={recall:6.2%}" 21 ) 22 23try: 24 review_metrics(prior, recall=0.0, false_positive_rate=0.0) 25except ValueError as error: 26 print("empty queue:", error)
Threshold tradeoff
1broad review review= 5.90% fraud_in_queue=16.10% recall=95.00% 2stricter review review= 1.79% fraud_in_queue=44.69% recall=80.00% 3empty queue: review rate must be greater than 0

The stricter policy reduces review load and improves the fraction of reviewed orders that are fraud, which is precision, but it catches fewer fraud cases, which lowers recall. This is a precision-recall tradeoff. Probability exposes the tradeoff; product cost and safety policy choose among the options. A threshold can also send no orders to review. In that case, queue precision is undefined because there is no flagged pile, so code should reject or explicitly represent the empty queue instead of dividing by zero.

Read the code as probability

Every line in flagged_posterior has a probability meaning.

CodeProbability meaning
priorP(fraudulent)P(\text{fraudulent})P(fraudulent)
true_positiveP(flagged∣fraudulent)P(\text{flagged} \mid \text{fraudulent})P(flagged∣fraudulent)
false_positiveP(flagged∣legitimate)P(\text{flagged} \mid \text{legitimate})P(flagged∣legitimate)
true_flagstrue flagged share of the population
false_flagsfalse flagged share of the population
all_flagsP(flagged)P(\text{flagged})P(flagged)
true_flags / all_flagsP(fraudulent∣flagged)P(\text{fraudulent} \mid \text{flagged})P(fraudulent∣flagged)

The guard for all_flags == 0 matters. If evidence never appears, the conditional probability is undefined. Production code should reject that case instead of inventing a number.

Protect the calculation

The function needs a contract: the known worked example must remain correct, invalid inputs must fail, and impossible evidence must fail instead of returning an invented posterior.

test-probability-contract.py
1import math 2 3def flagged_posterior(prior, true_positive, false_positive): 4 values = {"prior": prior, "true_positive": true_positive, "false_positive": false_positive} 5 for name, value in values.items(): 6 if not 0 <= value <= 1: 7 raise ValueError(f"{name} must be between 0 and 1") 8 true_flags = true_positive * prior 9 false_flags = false_positive * (1 - prior) 10 if true_flags + false_flags == 0: 11 raise ValueError("evidence probability must be greater than 0") 12 return true_flags / (true_flags + false_flags) 13 14assert math.isclose(flagged_posterior(0.01, 0.95, 0.05), 95 / 590) 15 16for args in [(1.4, 0.95, 0.05), (0.01, 0.0, 0.0)]: 17 try: 18 flagged_posterior(*args) 19 except ValueError as error: 20 print(error) 21 else: 22 raise AssertionError("invalid probability case did not fail") 23 24print("posterior calculation checks passed")
Probability contract
1prior must be between 0 and 1 2evidence probability must be greater than 0 3posterior calculation checks passed

The first test protects the numeric story. The second protects the failure path. Both are part of learning probability as an engineering skill, not just as a formula.

Where this shows up in ML work

The fraud example is one surface.

ML systemEventEvidenceBase-rate question
fraud detectororder is fraudulentrisk score above thresholdhow common is fraud for this product category?
product retrievalproduct is relevantembedding similarity above thresholdhow many returned products match the query intent?
label judgepredicted label is correctautomated judge marks it correcthow often does the judge agree with human reviewers?
delivery anomaly detectordelivery is delayedanomaly score is highhow common are delays for this route or season?

The same habit works everywhere:

  1. Name the event.
  2. Name the population.
  3. Measure the base rate.
  4. Name the evidence.
  5. Ask how the evidence changes the event rate.

Skip those steps and a score starts pretending to be a conclusion.

A score isn't automatically a probability

So far, the evidence was a thresholded flag and all rates were given. A model may instead emit a score such as 0.80. That score earns the interpretation "80 percent probability of fraud" only if similarly scored orders are fraudulent about 80 percent of the time. That property is calibration. Modern neural classifiers can be accurate while still producing poorly calibrated confidence scores, which is why score calibration is measured rather than assumed.[4]

This toy bucket demonstrates the check. Every order was assigned a score near 0.80, but only half of these labeled orders were actually fraudulent.

check-one-score-bucket.py
1predicted_risk = [0.80, 0.82, 0.78, 0.81, 0.79, 0.80] 2observed_fraud = [1, 0, 1, 0, 1, 0] 3 4advertised_risk = sum(predicted_risk) / len(predicted_risk) 5observed_rate = sum(observed_fraud) / len(observed_fraud) 6gap = advertised_risk - observed_rate 7 8print(f"average predicted risk: {advertised_risk:.0%}") 9print(f"observed fraud rate: {observed_rate:.0%}") 10print(f"calibration gap: {gap:.0%}")
Calibration bucket
1average predicted risk: 80% 2observed fraud rate: 50% 3calibration gap: 30%

Six orders can't prove how a deployed model is calibrated. The calculation names the question. Statistics will teach how much evidence you need before trusting the measured gap.

A token model multiplies probabilities

This chapter used one binary event, but an LLM produces a categorical distribution over the next token. For a known target token, training cares about the probability assigned to that target. With a one-hot target, that token's cross-entropy contribution is its negative log probability, -log(p).[3]

negative-log-probability.py
1import math 2 3for target_probability in [0.90, 0.50, 0.01]: 4 loss = -math.log(target_probability) 5 print(f"target probability={target_probability:>4.2f} loss={loss:>5.3f}")
Token surprise
1target probability=0.90 loss=0.105 2target probability=0.50 loss=0.693 3target probability=0.01 loss=4.605

Low probability for the observed token costs more because the observation was more surprising under the model.

For a sequence, the model's joint probability follows the chain rule: multiply the conditional probability of each next token given the previous tokens.

P(t1,t2,…,tn)=P(t1)∏i=2nP(ti∣t1,…,ti−1)P(t_1, t_2, \ldots, t_n) = P(t_1) \prod_{i=2}^{n} P(t_i \mid t_1, \ldots, t_{i-1})P(t1​,t2​,…,tn​)=P(t1​)i=2∏n​P(ti​∣t1​,…,ti−1​)

Here, tit_iti​ is the token at position iii, and the expression to the right of the bar is the preceding context. Multiplying many small probabilities eventually underflows in floating-point arithmetic. Logs convert the product into a stable sum.

keep-sequence-probabilities-in-log-space.py
1import math 2 3token_probability = 0.01 4token_count = 200 5 6raw_product = token_probability ** token_count 7log_probability = token_count * math.log(token_probability) 8 9print(f"raw product in float: {raw_product}") 10print(f"log probability: {log_probability:.1f}") 11print(f"finite in log space: {math.isfinite(log_probability)}")
Log-space sequence probability
1raw product in float: 0.0 2log probability: -921.0 3finite in log space: True

The 0.0 doesn't mean the sequence was impossible. It means ordinary floating-point multiplication lost a representable nonzero number. Later language-modeling chapters use this same log-space habit for cross-entropy and perplexity.

Common mistakes

Most probability bugs are question bugs.

SymptomMistakeBetter move
"The flag means 95 percent fraudulent."reversed the conditional probabilitieswrite both questions in plain English
Posterior feels too highignored rare base ratestart from counts before formulas
Denominator is all ordersforgot conditioningdenominator should be the evidence pile
One threshold used everywhereignored population shiftrecompute base rates per product slice
Model confidence treated as truthevent never nameddefine the event and compare to labels
Code returns a number for impossible evidencedivided by zero-probability evidencereject undefined cases loudly
A score of 0.80 is used as 80 percent risk without checking labelscalibration was assumedbucket predictions and compare score with observed rate
A long sequence gets probability 0.0 in codesmall probabilities were multiplied directlysum log probabilities instead

The debugging question is short:

Among which cases am I counting?

If you can answer that, the formula usually becomes much easier.

Try it yourself

Use the same 10,000-order population.

  1. Recompute P(fraudulent)P(\text{fraudulent})P(fraudulent) from the first table.
  2. Recompute the 495 false flags by hand.
  3. Change the false-positive rate from 0.05 to 0.01.
  4. Compute the new posterior.
  5. Explain why the posterior changed even though the true-positive rate stayed 0.95.
  6. Compare the broad and stricter review policies: which catches more fraud, and which sends a cleaner queue to reviewers?
  7. Compute -log(0.80) and -log(0.10). Which observed token is more surprising to a model?

Then translate the lesson to product search:

Product search versionYour answer
eventproduct is truly relevant
populationcandidate products returned for a query class
evidencesimilarity score above threshold
priorrelevance rate before thresholding
posteriorrelevance rate after thresholding

The goal isn't to memorize the fraud numbers. The goal is to carry the counting habit into any model score.

Solution checks

Check your work after you try the practice.

Practice itemAnswer
Prior100/10000=0.01100 / 10000 = 0.01100/10000=0.01
Original false flags9900×0.05=4959900 \times 0.05 = 4959900×0.05=495
New false flags9900×0.01=999900 \times 0.01 = 999900×0.01=99
New posterior95/(95+99)≈0.4995 / (95 + 99) \approx 0.4995/(95+99)≈0.49
Why it changedthe false-alarm pile got smaller, so true flags became a larger share of all flags
Threshold tradeoffbroad review catches more fraud; stricter review produces a cleaner, smaller queue
Token loss comparison−log⁡(0.80)≈0.223-\log(0.80) \approx 0.223−log(0.80)≈0.223 and −log⁡(0.10)≈2.303-\log(0.10) \approx 2.303−log(0.10)≈2.303; the 0.10 target is more surprising

If your explanation starts with Bayes rule, translate it back into piles. A good answer can move between counts, words, code, and notation.

What to carry forward

Probability turns model scores into named claims.

For an e-commerce product, a useful probability statement sounds like this:

Event: order is fraudulent. Population: electronics orders in the last 30 days. Evidence: fraud detector score above threshold. Decision: send to manual review when posterior risk is above 20 percent.

That sentence is longer than "score is high," but it's much safer.

Mastery check

Key concepts

  • events and sample spaces
  • priors and posteriors
  • conditional probability
  • independence
  • Bayes rule
  • base rates
  • calibration as an empirical check on score meaning
  • negative log probability for observed tokens
  • log-space arithmetic for long probability products

Evaluation rubric

  • Foundational: Names the event, population, evidence, prior, and posterior in one concrete ML example.
  • Foundational: Computes a posterior from counts before using Bayes rule.
  • Intermediate: Explains why P(flagged∣fraudulent)P(\text{flagged} \mid \text{fraudulent})P(flagged∣fraudulent) and P(fraudulent∣flagged)P(\text{fraudulent} \mid \text{flagged})P(fraudulent∣flagged) answer different questions.
  • Intermediate: Uses Bayes rule and the runnable Python function to reproduce the same posterior.
  • Advanced: Explains how base-rate shift changes a production threshold decision, checks whether a score is calibrated, and uses log probabilities to avoid sequence underflow.

Follow-up questions

Common pitfalls

  • Symptom: A team treats a score like “0.82” as if it were automatically the probability of truth. Cause: The event, population, and evidence were never named. Fix: Rewrite the score as a full sentence: event, population, evidence, and action threshold.
  • Symptom: Someone says “the model catches 95 percent of fraud, so a flagged order is 95 percent fraud.” Cause: They reversed the conditional. Fix: Write both questions in plain English before using notation.
  • Symptom: The same threshold behaves very differently after a product launch in a new market. Cause: The base rate changed even if the detector stayed fixed. Fix: Recompute priors and posteriors for the new slice instead of carrying old percentages forward.
  • Symptom: A detector produces many alerts but almost none are useful. Cause: The signal fires at similar rates on positive and negative cases, so it behaves like independent noise. Fix: Check whether the evidence changes the event rate at all.
  • Symptom: Code quietly returns nonsense when the evidence pile is empty. Cause: The conditional probability is undefined when evidence has zero probability. Fix: Guard against zero-probability evidence and fail loudly. If a measured pile is tiny rather than empty, report uncertainty about the estimate.
  • Symptom: A confidence score is shipped as a probability without a labeled bucket check. Cause: calibration was assumed. Fix: compare predicted risk with observed rates and quantify uncertainty in that estimate.
  • Symptom: Sequence probabilities become zero during evaluation. Cause: many small probabilities were multiplied in ordinary floating-point arithmetic. Fix: add log probabilities and exponentiate only when a representable probability is required.
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.In 10,000 orders, 100 are fraudulent. A detector flags 95% of fraudulent orders and 5% of legitimate orders. What is P(fraudulent | flagged)?
2.A detector flags 20% of fraudulent orders and 20% of legitimate orders. If the fraud prior is 1%, what is P(fraudulent | flagged)?
3.At a 1% fraud prior, a broad threshold has 95% recall and a 5% false-positive rate. A strict threshold has 80% recall and a 1% false-positive rate. What changes?
4.A model multiplies 200 token probabilities of 0.01, and the floating-point product is 0.0 while the summed log probability is finite. What should the evaluator do?
5.Six orders have scores averaging 80%, but only 3 of the 6 are fraudulent. What does this bucket check show?
6.A retrieval system defines the event as product relevance and the evidence as a similarity score above threshold. Which counting worlds define its prior and posterior?
7.In retrieval, 20% of candidates are relevant. A threshold selects 80% of relevant candidates and 10% of irrelevant candidates. Which Bayes calculation is correct?
8.A detector keeps 95% recall and a 5% false-positive rate, but the fraud prior rises from 1% to 10%. Why does P(fraudulent | flagged) rise?
9.A flagged-posterior function is called with (prior=1.4, true_positive=0.95, false_positive=0.05) and with (prior=0.01, true_positive=0, false_positive=0). What should its contract require?
10.An observed target token has probability 0.10 in one prediction and 0.80 in another. Which prediction has the larger negative log probability?

10 questions remaining.

Next Step
Continue to Statistics and Uncertainty

Probability taught you how to reason inside a known population. Statistics asks how much you should trust probabilities estimated from finite data.

PreviousAdam, Momentum, Schedulers
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Machine Learning: A Probabilistic Perspective.

Murphy, K. P. · 2012

Pattern Recognition and Machine Learning.

Bishop, C. M. · 2006

Deep Learning.

Goodfellow, I., Bengio, Y., Courville, A. · 2016

On Calibration of Modern Neural Networks

Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. · 2017