LeetLLM
LearnFeaturesPricingBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Pricing
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

ยฉ 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 104 articles completed

๐Ÿ› ๏ธComputing Foundations0/3
Python for AI EngineeringNumPy and Tensor ShapesData Structures for AI
๐Ÿ“ŠMath & Statistics0/4
Probability for MLStatistics and UncertaintyDistributions and SamplingHypothesis Tests and pass@k
๐Ÿ“šPreparation & Prerequisites0/9
Vectors, Matrices & TensorsNeural Networks from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsThe LLM Lifecycle
๐ŸงชCore LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFunction Calling & Tool UseChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
๐ŸงฎML Algorithms & Evaluation0/8
Linear Regression from ScratchValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment DesignPyTorch Training LoopsDataset Pipelines
๐ŸงฐApplied LLM Engineering0/17
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingMCP & Tool Protocol StandardsPrompt Injection DefenseAI Agent Evaluation and BenchmarkingProduction RAG PipelinesHybrid Search: Dense + SparseLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringPre-training Data at ScaleMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsDesign an Automated Support Agent
๐ŸŽ“Portfolio Capstones0/4
Capstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
๐Ÿง Transformer Deep Dives0/6
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNDecoding Strategies: Greedy to Nucleus
๐ŸงฌAdvanced Training & Adaptation0/10
Scaling Laws & Compute-Optimal TrainingDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
๐Ÿค–Advanced Agents & Retrieval0/12
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingAgent Memory & PersistenceHuman-in-the-Loop AgentsAgent Failure & RecoveryMulti-Agent Orchestration
โšกInference & Production Scale0/14
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Quantization: GPTQ, AWQ & GGUFSpeculative DecodingLong Context Window ManagementMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeGPU Serving & AutoscalingA/B Testing for LLMs
๐Ÿ—๏ธSystem Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
Track Your Progress

Create a free account to save your reading progress across devices and unlock the full learning experience.

LeetLLM Premium
  • All question breakdowns
  • Architecture diagrams
  • Model answers & rubrics
  • Follow-up Q&A analysis
  • New content weekly
Back to Topics
LearnMath & StatisticsStatistics and Uncertainty
๐Ÿ“ŠEasyEvaluation & Benchmarks

Statistics and Uncertainty

A math-book style chapter on estimates, sample size, standard error, and confidence intervals using model accuracy as the running example.

35 min readOpenAI, Anthropic, Google +17 key concepts

Statistics starts with a humbling idea: your dataset is only a sample.

If an evaluation says a model got 84 out of 100 examples right, the model's observed accuracy is 84 percent. But that doesn't mean the model's true accuracy is exactly 84 percent on future traffic.

The sample is a spoonful, not the whole pot.

What You Learn

This chapter teaches how to separate a score from the uncertainty around that score. That habit is the difference between "the metric went up" and "the evidence is strong enough to launch."[1][2][3]

Statistics illustration with sample observations, bootstrap resamples, a mean estimate, and confidence interval around an evaluation metric Statistics illustration with sample observations, bootstrap resamples, a mean estimate, and confidence interval around an evaluation metric
Visual anchor: the center dot is only the estimate. The interval around it tells you how much uncertainty the sample still carries.

Step Map

StepQuestionWhat you should be able to do
1What did we observe?Compute a sample metric.
2What are we trying to know?Name the unknown true value.
3How noisy is the estimate?Compute standard error or bootstrap spread.
4What range is plausible?Report an interval, not just one score.
5What decision follows?Compare the interval to the launch threshold.

Probability taught you how to reason about uncertain events. Statistics teaches you how much to trust estimates from finite data.


Tiny Story

You evaluate an answer classifier on 100 held-out examples.

It gets 84 correct.

QuantityValue
Correct answers84
Total examples100
Observed accuracy0.84

The point estimate is:

p^=84100=0.84\hat{p} = \frac{84}{100} = 0.84p^โ€‹=10084โ€‹=0.84

The hat in p^\hat{p}p^โ€‹ means "estimate." It isn't the exact truth. It is your best guess from this sample.

Worked Example

A useful first approximation for accuracy uncertainty is the standard error:

SE=p^(1โˆ’p^)nSE = \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}SE=np^โ€‹(1โˆ’p^โ€‹)โ€‹โ€‹

For 84 correct out of 100:

SE=0.84ร—0.16100โ‰ˆ0.0367SE = \sqrt{\frac{0.84 \times 0.16}{100}} \approx 0.0367SE=1000.84ร—0.16โ€‹โ€‹โ‰ˆ0.0367

A rough 95 percent interval uses about two standard errors on each side:

0.84ยฑ1.96ร—0.03670.84 \pm 1.96 \times 0.03670.84ยฑ1.96ร—0.0367

That gives approximately:

[0.768,ย 0.912][0.768,\ 0.912][0.768,ย 0.912]

Plain English:

The observed score is 84 percent, but with only 100 examples, a plausible range is roughly 77 percent to 91 percent.

That range is wide. A one-point improvement isn't convincing when the uncertainty band is this large.

Same Score, Different Evidence

Now compare two evaluations:

ResultObserved accuracyWhat changes?
84 / 1000.84Small sample, wide interval
8400 / 100000.84Large sample, narrow interval

Both have the same center. They don't have the same evidence strength.

That is one of the most important lessons in applied ML evaluation:

A metric without sample size isn't a conclusion.

What Sample Size Buys You

Use the same formula again, but slow down the story.

The estimate stays at 0.84:

p^=0.84\hat{p} = 0.84p^โ€‹=0.84

The part that changes is the denominator:

SE=0.84ร—0.16nSE = \sqrt{\frac{0.84 \times 0.16}{n}}SE=n0.84ร—0.16โ€‹โ€‹

When n grows, the fraction gets smaller. The square root gets smaller too. That means the interval gets tighter.

EvaluationStandard errorRough 95 percent interval
84 / 100about 0.03670.768 to 0.912
840 / 1000about 0.01160.817 to 0.863
8400 / 10000about 0.00370.833 to 0.847

This table is the bridge from math to judgment.

With 100 examples, 84 percent could still mean the true performance is in the high 70s. With 10,000 examples, the score is pinned much closer to 84 percent.

That doesn't make big evals automatically honest. Bad sampling can still fool you. It only means random wobble got smaller.

A Report That Reads Like Evidence

A useful evaluation note has four parts:

PartExample
Metricaccuracy on held-out answer labels
Estimate84 correct out of 100, so 0.84
Uncertaintyrough 95 percent interval from 0.768 to 0.912
Decisionnot enough evidence for a 1-point launch threshold

That form is plain, but it prevents metric theater.

When someone asks "is the new model better?", you can answer with evidence:

We saw a one-point lift, but the interval is wider than that. The honest answer is that we need a larger or better-targeted eval before launch.

Vocabulary

  • sample: the data you actually observed.
  • population: the larger world you want to make a claim about.
  • estimator: a rule that calculates a value from data, such as sample accuracy.
  • estimate: the number produced by the estimator.
  • bias: systematic error that pushes estimates in one direction.
  • variance: how much an estimate moves from sample to sample.
  • standard error: rough size of that sample-to-sample wobble.
  • confidence interval: a range produced by a repeatable procedure. Under its assumptions, the procedure captures the true value at the chosen long-run rate.

Don't memorize these as vocabulary cards. Tie them to the table above.

Code Lab

Put this in statistics_demo.py:

python
1import math 2 3def accuracy_interval(successes: int, n: int): 4 p = successes / n 5 standard_error = math.sqrt(p * (1 - p) / n) 6 half_width = 1.96 * standard_error 7 return p, p - half_width, p + half_width 8 9for successes, n in [(84, 100), (8400, 10000)]: 10 estimate, low, high = accuracy_interval(successes, n) 11 print(successes, n, round(estimate, 3), round(low, 3), round(high, 3))

Read the output as:

text
1successes total estimate low high

The two rows should have the same estimate and different interval widths.

Why Intervals Matter

Suppose your current model scores 84 percent and a new model scores 85 percent.

With 100 examples, that difference is probably not enough evidence. With 100,000 examples, it might be.

Statistics doesn't remove judgment. It makes judgment harder to fake.

MLE And MAP

You will see MLE and MAP later.

At foundation level, keep the mental model simple:

  • MLE chooses the parameter that makes the observed data most likely.
  • MAP does the same thing, but also includes prior belief.

For the accuracy example, the MLE estimate for the success probability is just:

p^=successesn\hat{p} = \frac{\text{successes}}{n}p^โ€‹=nsuccessesโ€‹

Later, when models have millions or billions of parameters, the idea is still familiar: choose parameters that explain the data well.

Common Trap

The common trap is treating the metric as the truth.

Bad report:

Model B is better. It got 85 percent and Model A got 84 percent.

Better report:

Model B scored 85 percent on 100 examples. Model A scored 84 percent. The interval is wider than the difference, so this isn't enough evidence to launch.

The second report is less flashy. It is also much more useful.

Mini Test

Add this test:

python
1def test_more_examples_narrows_interval(): 2 _, low_small, high_small = accuracy_interval(84, 100) 3 _, low_big, high_big = accuracy_interval(8400, 10000) 4 5 small_width = high_small - low_small 6 big_width = high_big - low_big 7 8 assert big_width < small_width

This test protects the core idea: more data should reduce uncertainty when the observed proportion stays the same.

Practice

  1. Compute 84 / 100 by hand.
  2. Explain why 84 / 100 and 8400 / 10000 aren't equally strong evidence.
  3. Change the code to evaluate 52 / 60.
  4. Write one sentence that starts: "I wouldn't launch yet because..."
  5. Add a guard that rejects n = 0.

Production Check

Every evaluation report should include:

  • sample size
  • metric definition
  • data split or sampling rule
  • estimate
  • uncertainty interval
  • decision threshold
  • slices where the model might fail differently

For LLM systems, slices matter. A model can look good overall and still fail on long documents, non-English prompts, or code-heavy answers.

Next, continue to Distributions and Sampling. You now know a sample can wobble. The next chapter shows how to simulate that wobble before production traffic exists.

Evaluation Rubric
  • 1
    Explains why a sample score is an estimate, not the exact future score
  • 2
    Computes standard error and a confidence interval by hand and in Python
  • 3
    Reports sample size, interval width, and slice uncertainty before making launch claims
Common Pitfalls
  • A single score without an interval hides risk. 84 percent on 100 examples and 84 percent on 10,000 examples shouldn't drive the same launch decision.
  • A small slice can look great or terrible by luck. Report slice sample size before trusting the slice.
  • A confidence interval doesn't prove the true value is inside. It describes the behavior of the estimation process.
Follow-up Questions to Expect

Key Concepts Tested
estimatorsbiasvarianceMLEMAPconfidence intervalscalibration
References

The Elements of Statistical Learning.

Hastie, T., Tibshirani, R., Friedman, J. ยท 2009

Machine Learning: A Probabilistic Perspective.

Murphy, K. P. ยท 2012

Pattern Recognition and Machine Learning.

Bishop, C. M. ยท 2006

Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail

Your account is free and you can post anonymously if you choose.