LeetLLM
LearnFeaturesPricingBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Pricing
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

ยฉ 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 104 articles completed

๐Ÿ› ๏ธComputing Foundations0/3
Python for AI EngineeringNumPy and Tensor ShapesData Structures for AI
๐Ÿ“ŠMath & Statistics0/4
Probability for MLStatistics and UncertaintyDistributions and SamplingHypothesis Tests and pass@k
๐Ÿ“šPreparation & Prerequisites0/9
Vectors, Matrices & TensorsNeural Networks from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsThe LLM Lifecycle
๐ŸงชCore LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFunction Calling & Tool UseChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
๐ŸงฎML Algorithms & Evaluation0/8
Linear Regression from ScratchValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment DesignPyTorch Training LoopsDataset Pipelines
๐ŸงฐApplied LLM Engineering0/17
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingMCP & Tool Protocol StandardsPrompt Injection DefenseAI Agent Evaluation and BenchmarkingProduction RAG PipelinesHybrid Search: Dense + SparseLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringPre-training Data at ScaleMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsDesign an Automated Support Agent
๐ŸŽ“Portfolio Capstones0/4
Capstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
๐Ÿง Transformer Deep Dives0/6
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNDecoding Strategies: Greedy to Nucleus
๐ŸงฌAdvanced Training & Adaptation0/10
Scaling Laws & Compute-Optimal TrainingDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
๐Ÿค–Advanced Agents & Retrieval0/12
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingAgent Memory & PersistenceHuman-in-the-Loop AgentsAgent Failure & RecoveryMulti-Agent Orchestration
โšกInference & Production Scale0/14
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Quantization: GPTQ, AWQ & GGUFSpeculative DecodingLong Context Window ManagementMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeGPU Serving & AutoscalingA/B Testing for LLMs
๐Ÿ—๏ธSystem Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
Track Your Progress

Create a free account to save your reading progress across devices and unlock the full learning experience.

LeetLLM Premium
  • All question breakdowns
  • Architecture diagrams
  • Model answers & rubrics
  • Follow-up Q&A analysis
  • New content weekly
Back to Topics
LearnMath & StatisticsProbability for ML
๐Ÿ“ŠEasyEvaluation & Benchmarks

Probability for ML

A beginner-first probability chapter that starts with counts, then teaches conditional probability, Bayes rule, and base-rate mistakes through a moderation-filter example.

35 min readOpenAI, Anthropic, Google +16 key concepts

Probability is what you use when you don't have perfect information.

In ordinary programming, a branch is often clean:

python
1if user_is_admin: 2 print("show admin panel") 3 show_admin_panel()

Machine learning is messier. A model says "this looks 82 percent likely." A safety filter says "this message looks risky." A retrieval system says "this document seems relevant." Those aren't facts. They are uncertain claims about events.

What You Learn

This chapter teaches the first habit of probability: name the event, count the cases, then update when new evidence arrives.[1][2][3]

Probability table showing outcomes, conditional evidence, normalized posterior probabilities, and one decision threshold Probability table showing outcomes, conditional evidence, normalized posterior probabilities, and one decision threshold
Visual anchor: compare the prior column with the posterior column. New evidence moves probability mass before a decision is made.

Step Map

StepQuestionWhat you should be able to do
1What can happen?Define the event in plain English.
2How often does it happen?Turn counts into a probability.
3What evidence did we see?Use conditional probability.
4How should belief change?Apply Bayes rule.
5What can go wrong?Spot base-rate mistakes before shipping.

You already practiced Python and arrays. Now the same habit becomes statistical: make the invisible assumption visible.


Tiny Story

Imagine an LLM moderation filter.

Out of 1,000 messages:

Message typeCount
Unsafe10
Safe990
Total1,000

Before the filter says anything, unsafe messages are rare:

P(unsafe)=101000=0.01P(\text{unsafe}) = \frac{10}{1000} = 0.01P(unsafe)=100010โ€‹=0.01

That number is called the prior. It is the probability before new evidence.

Now suppose the filter is pretty good:

If message is...Filter flags it how often?
Unsafe95 percent
Safe5 percent

Beginner guess: "If the filter flags a message, it is probably unsafe."

Careful answer: not necessarily. Safe messages are so common that false alarms can dominate the flagged pile.

Worked Example

Let's count the flagged messages out of the same 1,000 messages.

Unsafe messages:

10ร—0.95=9.510 \times 0.95 = 9.510ร—0.95=9.5

Safe messages:

990ร—0.05=49.5990 \times 0.05 = 49.5990ร—0.05=49.5

So the filter flags about 59 messages:

9.5+49.5=599.5 + 49.5 = 599.5+49.5=59

Only 9.5 of those flagged messages are actually unsafe:

P(unsafeโˆฃflagged)=9.559โ‰ˆ0.161P(\text{unsafe} \mid \text{flagged}) = \frac{9.5}{59} \approx 0.161P(unsafeโˆฃflagged)=599.5โ€‹โ‰ˆ0.161

The result is about 16 percent, not 95 percent.

That is the base-rate lesson. A strong signal can still be less convincing than it feels when the event is rare.

Vocabulary

  • event: something that can happen, such as message is unsafe.
  • sample space: the full set of possible outcomes you are considering.
  • prior: probability before new evidence.
  • evidence: something you observe, such as filter flagged the message.
  • likelihood: probability of seeing the evidence if a hypothesis were true.
  • posterior: updated probability after evidence.
  • conditional probability: probability of one event after you know another event happened.

Keep every word tied to the moderation story. If a term can't point to a row, count, or event, it is still floating.

Formula

Conditional probability asks:

P(AโˆฃB)=P(AโˆฉB)P(B)P(A \mid B) = \frac{P(A \cap B)}{P(B)}P(AโˆฃB)=P(B)P(AโˆฉB)โ€‹

Read it as:

Among the cases where B happened, what fraction also had A?

Bayes rule turns the question around:

P(AโˆฃB)=P(BโˆฃA)P(A)P(B)P(A \mid B) = \frac{P(B \mid A)P(A)}{P(B)}P(AโˆฃB)=P(B)P(BโˆฃA)P(A)โ€‹

For the moderation filter:

  • AAA means "message is unsafe"
  • BBB means "filter flagged it"
  • P(A)P(A)P(A) is the base rate
  • P(BโˆฃA)P(B \mid A)P(BโˆฃA) is the true-positive rate
  • P(B)P(B)P(B) includes both true flags and false alarms

This isn't a trick formula. It is just careful counting.

Code Lab

Put this in probability_demo.py:

python
1def flagged_posterior(prior, true_positive, false_positive): 2 real_flags = true_positive * prior 3 false_flags = false_positive * (1 - prior) 4 all_flags = real_flags + false_flags 5 return real_flags / all_flags 6 7posterior = flagged_posterior( 8 prior=0.01, 9 true_positive=0.95, 10 false_positive=0.05, 11) 12 13print(round(posterior, 3))

Expected output:

text
10.161

Read the code like a word problem:

  • prior=0.01 means 1 percent of messages are unsafe before filtering.
  • true_positive=0.95 means unsafe messages are usually flagged.
  • false_positive=0.05 means safe messages are sometimes flagged.
  • real_flags / all_flags means "of flagged messages, how many are truly unsafe?"

Now change only prior:

python
1for prior in [0.01, 0.10, 0.50]: 2 print(prior, round(flagged_posterior(prior, 0.95, 0.05), 3))

The detector didn't change. Only the base rate changed. The answer should move a lot.

Second Pass: Change One Number

Keep the filter quality fixed:

SettingValue
True-positive rate0.95
False-positive rate0.05

Now change only the world the filter lives in:

Unsafe base ratePosterior after flag
1 percentabout 16 percent
10 percentabout 68 percent
50 percentabout 95 percent

Same detector. Different population. Different answer.

This is why product context matters. A fraud model, safety filter, and spam classifier can share the same math but make different decisions because their base rates differ.

Common Trap

The common trap is mixing up these two questions:

QuestionMeaning
P(flaggedโˆฃunsafe)P(\text{flagged} \mid \text{unsafe})P(flaggedโˆฃunsafe)If the message is unsafe, how often does the filter catch it?
P(unsafeโˆฃflagged)P(\text{unsafe} \mid \text{flagged})P(unsafeโˆฃflagged)If the filter flagged it, how likely is it unsafe?

They look similar, but they answer opposite questions.

This mistake appears constantly in LLM products:

  • treating model confidence as probability of truth
  • treating a retrieval score as probability of relevance
  • treating one judge score as proof an answer is correct
  • treating a rare safety event as common after seeing one alert

Mini Test

Add a guard before you trust this function:

python
1def test_flagged_posterior(): 2 result = flagged_posterior(0.01, 0.95, 0.05) 3 assert 0.15 < result < 0.17

Then add input checks:

python
1def check_probability(x, name): 2 assert 0 <= x <= 1, f"{name} must be between 0 and 1"

Probability code should fail loudly when a value like 1.4 enters the system. Silent invalid probabilities create fake certainty.

Practice

  1. Compute P(unsafe) from the table without code.
  2. Compute how many safe messages get falsely flagged.
  3. Change the false-positive rate from 0.05 to 0.01.
  4. Explain why the posterior changes.
  5. Write one sentence that starts: "A flagged message isn't automatically unsafe because..."

Production Check

Before using probability in a product decision, write down:

  • event: what exactly is being predicted
  • population: which users, requests, documents, or tasks count
  • base rate: how common the event is before evidence
  • evidence: what signal was observed
  • decision: what threshold turns probability into action

For an LLM system, this might be:

Event: answer contains a policy violation. Population: support-chat answers in English. Evidence: safety classifier score above threshold. Decision: send answer to human review if posterior risk is above 20 percent.

That sentence is boring, but it is what makes probability useful.

Next, continue to Statistics and Uncertainty. Probability gave you a way to reason about uncertain events. Statistics asks how much you should trust a probability estimated from data.

Evaluation Rubric
  • 1
    Explains events, priors, evidence, likelihoods, and posteriors using counts before formulas
  • 2
    Computes Bayes rule by hand and in runnable Python
  • 3
    Spots base-rate and reversed-conditional mistakes in ML product decisions
Common Pitfalls
  • People often treat model confidence as probability of truth. A calibrated probability needs data, a target event, and a population.
  • People often reverse P(unsafe | flagged) and P(flagged | unsafe). Those are different questions.
  • Ignoring the population makes the same percentage mean different things in different products.
Follow-up Questions to Expect

Key Concepts Tested
eventsconditional probabilityindependenceBayes ruleexpectationvariance
References

Machine Learning: A Probabilistic Perspective.

Murphy, K. P. ยท 2012

Pattern Recognition and Machine Learning.

Bishop, C. M. ยท 2006

Deep Learning.

Goodfellow, I., Bengio, Y., Courville, A. ยท 2016

Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail

Your account is free and you can post anonymously if you choose.