LeetLLM
LearnPracticeFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Practice
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnMath & StatisticsDistributions and Sampling
📊EasyEvaluation & Benchmarks

Distributions and Sampling

Model an e-commerce support agent with binary outcomes, request intents, tool-call counts, and tail latency, then challenge each simulation before trusting it.

12 min read
Learning path
Step 13 of 155 in the full curriculum
Statistics and UncertaintyHypothesis Tests, Intervals, and pass@k

The previous chapter estimated one unknown rate: the precision of a fraud-review queue. Real AI products emit more than one kind of measurement. An e-commerce support agent may resolve a request or escalate it, route a request into an intent, call tools several times, and take a variable number of seconds to answer.

A distribution describes the values a random quantity can produce and how likely those values are. A wrong distribution does more than make a chart look odd. It can make a simulator produce impossible latency, hide retry bursts, or claim a launch is safe when it isn't.[1][2]

E-commerce support-agent simulation showing Bernoulli resolution, categorical intent routing, Poisson baseline tool calls, and lognormal response time. E-commerce support-agent simulation showing Bernoulli resolution, categorical intent routing, Poisson baseline tool calls, and lognormal response time.
One agent produces several kinds of data. Match each quantity to a possible first distribution, then inspect the summary and assumption that matter for a product decision.

One agent, four random quantities

Suppose you are planning a load test for an agent that answers order-support requests. For now, the parameter values below are illustrative assumptions, not measured production results. A real launch would estimate them from logged and reviewed traffic.

Quantity from one requestExample valueValue shapeFirst simulation modelSummary to report
Resolved without a human1 or 0binaryBernoulliresolution rate
Customer intenttracking, refund, return_label, damaged_itemone labelcategoricallabel mix
Tool calls made0, 1, 2, ...countPoisson baselinemean and share above a cost threshold
End-to-end response time7.4 secondspositive continuouslognormal baselinemedian and p95

The phrase first simulation model matters. A distribution is a claim about data shape. If traces disagree with the claim, you revise the model instead of defending the convenient formula.

Simulation workflow diagram showing observed values mapped to distribution shapes, seeded samples, a fit challenge, and a risk summary before trust. Simulation workflow diagram showing observed values mapped to distribution shapes, seeded samples, a fit challenge, and a risk summary before trust.
A simulator is useful only when it includes a check that can prove its chosen shape is inadequate.

Exact outcomes and measured ranges

The first split is between discrete and continuous quantities.

KindMeaningIn the support agent
Discretevalues are separate outcomes you can enumerateresolved or escalated, one intent label, number of tool calls
Continuousvalues live on a measured rangeresponse time in seconds

For a discrete quantity, you can ask for the probability of one exact value, such as P(tool_calls = 2). For a continuous response time, one exact infinitely precise value has probability zero under the model. Ask for a range or a percentile instead: P(time > 15 seconds) or the p95 response time.

Before sampling, ask:

  1. What values are allowed?
  2. Which parameter sets the rate, center, or spread?
  3. Which tail or threshold would hurt users or cost?
  4. What observation would prove this model is a poor fit?

That fourth question separates an engineering simulation from random-number decoration.

Bernoulli: one request either resolves or escalates

Begin with the smallest distribution. Let R = 1 if the agent resolves an order-support request without a human, and R = 0 if it escalates. Assume a planning value of p = 0.72.

OutcomeRProbability
escalation00.28
resolved10.72

The expected value is:

E[R]=0×0.28+1×0.72=0.72E[R] = 0 \times 0.28 + 1 \times 0.72 = 0.72E[R]=0×0.28+1×0.72=0.72

Because R is encoded as 0 or 1, its long-run average is the resolution rate.

bernoulli-resolution-rate.py
1p_resolved = 0.72 2outcomes = {"escalated": 0, "resolved": 1} 3probabilities = {"escalated": 1 - p_resolved, "resolved": p_resolved} 4 5expected_value = sum( 6 outcomes[name] * probabilities[name] for name in outcomes 7) 8 9print(f"P(resolved): {probabilities['resolved']:.2f}") 10print(f"P(escalated): {probabilities['escalated']:.2f}") 11print(f"E[R]: {expected_value:.2f}") 12 13assert expected_value == p_resolved
Bernoulli resolution setup
1P(resolved): 0.72 2P(escalated): 0.28 3E[R]: 0.72

A batch of requests still moves around that long-run rate. Here, the distribution is held fixed at 0.72, while six independent 100-request batches vary.

sample-resolution-batches.py
1import numpy as np 2 3rng = np.random.default_rng(12) 4resolved_per_batch = rng.binomial(n=100, p=0.72, size=6) 5rates = resolved_per_batch / 100 6 7print("resolved counts:", resolved_per_batch.tolist()) 8print("batch rates: ", [float(round(rate, 2)) for rate in rates]) 9print(f"range: {rates.min():.2f} to {rates.max():.2f}")
Resolution batches wobble
1resolved counts: [75, 65, 76, 76, 74, 75] 2batch rates: [0.75, 0.65, 0.76, 0.76, 0.74, 0.75] 3range: 0.65 to 0.76

The previous chapter taught you how to put an interval around a rate measured from real labels. Here, you are running the distribution forward to see the outcomes it could generate.

Categorical: one request takes one route

A customer request isn't always binary. It may be exactly one of several intent labels. For the route used by the support agent, suppose the planning mix is:

IntentProbabilityExpected count among 500 requests
tracking0.45225
refund0.20100
return label0.20100
damaged item0.1575

The four probabilities sum to 1.00 because every request receives one route in this simplified taxonomy. A categorical distribution draws one of those labels according to the supplied weights.

sample-intent-routing.py
1import numpy as np 2 3intents = np.array(["tracking", "refund", "return_label", "damaged_item"]) 4probabilities = np.array([0.45, 0.20, 0.20, 0.15]) 5requests = 500 6 7assert np.isclose(probabilities.sum(), 1.0) 8 9rng = np.random.default_rng(21) 10sampled = rng.choice(intents, size=requests, p=probabilities) 11expected = { 12 str(intent): int(count) 13 for intent, count in zip(intents, requests * probabilities) 14} 15observed = {str(intent): int((sampled == intent).sum()) for intent in intents} 16 17print("expected counts:", expected) 18print("sampled counts: ", observed) 19print("all labels known:", set(sampled).issubset(set(intents)))
Categorical route sample
1expected counts: {'tracking': 225, 'refund': 100, 'return_label': 100, 'damaged_item': 75} 2sampled counts: {'tracking': 230, 'refund': 97, 'return_label': 102, 'damaged_item': 71} 3all labels known: True

The sampled counts won't equal the expected counts exactly. That is ordinary variation. An unknown label such as chargeback_legal would be a different problem: the taxonomy or the sampler contract is wrong.

Poisson: a first baseline for tool-call counts

Tool calls are counts. A simple baseline is a Poisson distribution with parameter lambda, written λ\lambdaλ. If lambda = 2.2, the simulated agent makes 2.2 tool calls per request on average.

For a Poisson quantity C, the probability of exactly k calls is:

P(C=k)=λke−λk!P(C=k) = \frac{\lambda^k e^{-\lambda}}{k!}P(C=k)=k!λke−λ​

With λ=2.2\lambda=2.2λ=2.2, compute a few values before sampling:

poisson-tool-call-probabilities.py
1from math import exp, factorial 2 3lam = 2.2 4 5def poisson_probability(k: int) -> float: 6 if k < 0: 7 raise ValueError("k must be nonnegative") 8 return lam**k * exp(-lam) / factorial(k) 9 10for k in range(5): 11 print(f"P(calls = {k}): {poisson_probability(k):.3f}") 12 13share_above_5 = 1 - sum(poisson_probability(k) for k in range(6)) 14print(f"P(calls > 5): {share_above_5:.3f}") 15 16try: 17 poisson_probability(-1) 18except ValueError as error: 19 print(error)
Poisson count probabilities
1P(calls = 0): 0.111 2P(calls = 1): 0.244 3P(calls = 2): 0.268 4P(calls = 3): 0.197 5P(calls = 4): 0.108 6P(calls > 5): 0.025 7k must be nonnegative

The Poisson model makes a stronger claim than "counts can't be negative." Its mean and variance are both λ\lambdaλ. That assumption can be reasonable for independent events arriving at a steady rate. It's often questionable for an agent: a difficult refund request may trigger a search, a policy lookup, a retry, and a handoff together. Bursts create overdispersion, where variance is much larger than the mean.[1][2]

Use a stress-test fixture to make that failure visible. The steady stream comes from the Poisson baseline. The bursty stream mixes ordinary requests with a small fraction of hard requests requiring many tools.

challenge-poisson-tool-calls.py
1import numpy as np 2 3rng = np.random.default_rng(31) 4steady = rng.poisson(lam=2.2, size=5_000) 5hard_request = rng.random(5_000) < 0.12 6bursty = np.where( 7 hard_request, 8 rng.poisson(lam=10.0, size=5_000), 9 rng.poisson(lam=1.2, size=5_000), 10) 11 12def report(name: str, counts: np.ndarray) -> None: 13 mean = counts.mean() 14 variance = counts.var() 15 above_five = (counts > 5).mean() 16 print( 17 f"{name:>6}: mean={mean:.2f} variance={variance:.2f} " 18 f"variance/mean={variance / mean:.2f} share>5={above_five:.3f}" 19 ) 20 21report("steady", steady) 22report("bursty", bursty)
Poisson fit diagnostic
1steady: mean=2.24 variance=2.19 variance/mean=0.98 share>5=0.025 2bursty: mean=2.25 variance=10.09 variance/mean=4.48 share>5=0.111

When the variance-to-mean ratio is far above 1, the Poisson baseline understates burst behavior. You don't need a more advanced count distribution yet. You need the habit: inspect the fit before trusting cost or capacity conclusions.

Lognormal: positive time with a slow tail

End-to-end agent response time can't be negative, and tool-assisted requests often form a slower right tail. A lognormal baseline encodes those two facts: if T is lognormal, then log(T) is normally distributed.

Choose parameters with care. NumPy's mean and sigma arguments for lognormal describe log seconds, not seconds. Setting mu = log(6) gives a median response time of 6 seconds. With sigma = 0.55, the mean is higher than the median because slow requests pull the right tail outward.

median(T)=eμE[T]=eμ+σ2/2\text{median}(T) = e^\mu \qquad E[T] = e^{\mu + \sigma^2/2}median(T)=eμE[T]=eμ+σ2/2
sample-lognormal-response-time.py
1import numpy as np 2 3median_seconds = 6.0 4sigma = 0.55 5mu = np.log(median_seconds) 6 7rng = np.random.default_rng(44) 8response_seconds = rng.lognormal(mean=mu, sigma=sigma, size=5_000) 9 10print(f"target median seconds: {np.exp(mu):.2f}") 11print(f"sample median seconds: {np.median(response_seconds):.2f}") 12print(f"sample mean seconds: {response_seconds.mean():.2f}") 13print(f"sample p95 seconds: {np.percentile(response_seconds, 95):.2f}") 14print(f"nonpositive values: {(response_seconds <= 0).sum()}")
Positive response-time sample
1target median seconds: 6.00 2sample median seconds: 6.00 3sample mean seconds: 7.05 4sample p95 seconds: 14.89 5nonpositive values: 0

The median describes a typical request; p95 tells you about a noticeably slow slice. Reporting only the mean would conceal the operational question customers feel: how long do the slow responses take?

Failure case: a normal latency model invents time

A normal distribution may look convenient because it takes a familiar mean and standard deviation. For positive response time, it can generate impossible negative seconds.

reject-negative-response-time.py
1import numpy as np 2 3rng = np.random.default_rng(44) 4bad_response_seconds = rng.normal(loc=8.0, scale=6.0, size=5_000) 5 6print(f"minimum seconds: {bad_response_seconds.min():.2f}") 7print(f"share below zero: {(bad_response_seconds < 0).mean():.3f}") 8print(f"p95 seconds: {np.percentile(bad_response_seconds, 95):.2f}") 9 10assert (bad_response_seconds < 0).any()
Invalid latency model
1minimum seconds: -13.09 2share below zero: 0.088 3p95 seconds: 17.91

This isn't a harmless inconvenience. If a simulator violates the physical shape of the metric, it can't be trusted to estimate service-level risk without repair.

Monte Carlo: repeat draws to estimate a question

Monte Carlo simulation means using repeated random samples to approximate a quantity you care about. For the Bernoulli resolution rate, larger simulated batches sit more tightly around the assumed p. The familiar 1 / sqrt(N) behavior returns: multiplying sample size by four roughly halves standard-error scale.

measure-monte-carlo-wobble.py
1import numpy as np 2 3rng = np.random.default_rng(55) 4p_resolved = 0.72 5 6for batch_size in [100, 1_000, 10_000]: 7 repeated_rates = rng.binomial( 8 n=batch_size, p=p_resolved, size=2_000 9 ) / batch_size 10 print( 11 f"N={batch_size:>5}: mean={repeated_rates.mean():.3f} " 12 f"sd_of_rates={repeated_rates.std():.4f}" 13 )
Monte Carlo convergence
1N= 100: mean=0.719 sd_of_rates=0.0454 2N= 1000: mean=0.720 sd_of_rates=0.0140 3N=10000: mean=0.720 sd_of_rates=0.0047

Simulation answers a conditional question: "What happens if these assumptions are the generating process?" It doesn't validate p = 0.72, lambda = 2.2, or the latency parameters. Those must come from data, review design, and production measurement.

Build it: a launch-simulation report

Now put the shapes together. The report below uses one seeded generator and prints the quantities an engineer could review before planning capacity.

build-agent-launch-report.py
1import numpy as np 2 3def simulate_support_agent(seed: int, n_requests: int) -> dict[str, float]: 4 if n_requests <= 0: 5 raise ValueError("n_requests must be positive") 6 7 rng = np.random.default_rng(seed) 8 resolved = rng.binomial(n=1, p=0.72, size=n_requests) 9 10 intents = np.array(["tracking", "refund", "return_label", "damaged_item"]) 11 intent_probs = np.array([0.45, 0.20, 0.20, 0.15]) 12 routed = rng.choice(intents, size=n_requests, p=intent_probs) 13 14 tool_calls = rng.poisson(lam=2.2, size=n_requests) 15 response_seconds = rng.lognormal( 16 mean=np.log(6.0), sigma=0.55, size=n_requests 17 ) 18 19 return { 20 "resolution_rate": float(resolved.mean()), 21 "refund_share": float((routed == "refund").mean()), 22 "mean_tool_calls": float(tool_calls.mean()), 23 "share_over_5_tools": float((tool_calls > 5).mean()), 24 "median_seconds": float(np.median(response_seconds)), 25 "p95_seconds": float(np.percentile(response_seconds, 95)), 26 "nonpositive_times": float((response_seconds <= 0).sum()), 27 } 28 29report = simulate_support_agent(seed=7, n_requests=5_000) 30for name, value in report.items(): 31 print(f"{name:>20}: {value:.3f}") 32 33assert report["nonpositive_times"] == 0 34assert report["p95_seconds"] > report["median_seconds"]
Agent launch simulation report
1resolution_rate: 0.722 2 refund_share: 0.195 3 mean_tool_calls: 2.228 4 share_over_5_tools: 0.023 5 median_seconds: 5.973 6 p95_seconds: 15.064 7 nonpositive_times: 0.000

Use that output as a simulator report, not a production claim. The rate, label mix, count tail, and latency tail are only as credible as their fitted parameters and assumption checks.

The LLM connection: tokens are categorical samples

The support-agent labels above form a categorical distribution because one label is selected from a finite list. An LLM uses the same mathematical family at a much larger scale: after producing a logit for each candidate token, softmax converts the logits into probabilities, and a decoder may sample one token from the weighted vocabulary.[3]

Take a tiny continuation for the prompt Your refund is. Four tokens are enough to see the mechanics:

sample-one-token-from-logits.py
1import numpy as np 2 3tokens = np.array(["approved", "pending", "delayed", "unicorn"]) 4logits = np.array([2.2, 1.5, 0.2, -1.5]) 5 6def stable_softmax(values: np.ndarray) -> np.ndarray: 7 shifted = values - values.max() 8 weights = np.exp(shifted) 9 return weights / weights.sum() 10 11probabilities = stable_softmax(logits) 12rng = np.random.default_rng(9) 13chosen = rng.choice(tokens, p=probabilities) 14 15for token, probability in zip(tokens, probabilities): 16 print(f"{token:>8}: {probability:.3f}") 17print("sampled token:", chosen) 18print("probabilities sum to one:", np.isclose(probabilities.sum(), 1.0))
Tiny token distribution
1approved: 0.604 2 pending: 0.300 3 delayed: 0.082 4 unicorn: 0.015 5sampled token: pending 6probabilities sum to one: True

Later chapters teach decoding controls in depth. For now, retain the bridge: a sampling decoder doesn't choose from raw logits directly. It draws from a probability distribution derived from those logits. A greedy decoder is different: it selects the highest-probability token instead of drawing a sample. For open-ended text generation, nucleus sampling chooses a dynamic high-probability token set and renormalizes it before sampling, rather than sampling from an unreliable long tail.[4]

A review checklist for simulated metrics

Before believing output from a random simulator, write down its contract:

QuantityModel usedReportChallenge before trusting it
resolved without humanBernoullirateis one request one binary decision?
request routecategoricalproportionsare all real intent labels represented?
tool callsPoisson baselinemean and share above 5is variance near mean, or are calls bursty?
response timelognormal baselinemedian and p95is any time nonpositive, and does the tail resemble traces?
next tokencategorical from softmaxsampled token and probabilitiesdoes decoding alter the candidate set or distribution?

The seed also belongs in the report. A seed makes one simulated run reproducible. It doesn't make an incorrect model correct.

Practice: challenge a candidate launch model

A teammate proposes this e-commerce support-agent simulator:

QuantityProposed modelClaim
resolutionBernoulli with p = 0.7878 percent resolves automatically
request typecategorical with no damaged_item labelcovers incoming requests
tool callsPoisson with mean 2.0estimates tool cost
response timenormal with mean 7 seconds and standard deviation 8 secondsestimates p95 latency

Write a review with four parts:

  1. Name one parameter that needs evidence from logs or reviewed traffic.
  2. Identify the impossible or missing state.
  3. State one count diagnostic for the tool-call assumption.
  4. State which latency summary should appear in the report.

Mastery check

Key concepts

  • Bernoulli and categorical distributions
  • discrete versus continuous values
  • Poisson assumptions and overdispersion
  • lognormal positive measurements
  • Monte Carlo simulation
  • seeded reproducibility
  • token sampling as a categorical draw

Evaluation rubric

  • Foundational: Identifies whether a support-agent measurement is binary, categorical, count-valued, or positive continuous.
  • Developing: Explains the roles of p, categorical probabilities, lambda, mu, and sigma in plain English.
  • Intermediate: Computes a Bernoulli expectation and several Poisson probabilities, then reads simulated output without confusing a sample with a parameter.
  • Applied: Builds a seeded simulation report with rate, label mix, tool-call threshold, and p95 response time.
  • Advanced: Rejects a distribution whose generated values or dispersion contradict real system behavior, and connects categorical sampling to LLM tokens.

Follow-up questions

Common pitfalls

  • Symptom: Simulated response time becomes negative. Cause: a symmetric normal model was used for positive time with large spread. Fix: use a positive baseline and validate it against response-time traces.
  • Symptom: Mean tool cost looks fine while costly requests spike. Cause: a Poisson baseline hid bursty tool loops. Fix: compare variance with mean and report threshold exceedance.
  • Symptom: Simulation never routes damaged-item tickets. Cause: categorical support omitted a real label. Fix: define and validate the taxonomy before sampling.
  • Symptom: Two reviewers can't reproduce output. Cause: sampler state or seed is missing. Fix: pass one named seeded generator through the simulation.
  • Symptom: A team treats simulated rates as measured performance. Cause: parameters were assumed rather than estimated. Fix: fit and check parameters from representative evaluation or production data.
Complete the lesson

Mastery Check

Answer every question, then check your score. Score above 75% to mark this lesson complete.

1.A Bernoulli simulator encodes R = 1 for resolved and R = 0 for escalated with p = 0.72. One 100-request run resolves 68 requests. Which statement is correct?
2.A route sampler is configured with intents tracking, refund, return_label, and damaged_item with probabilities 0.45, 0.20, 0.20, and 0.15. In a 500-request sample, one route is chargeback_legal. What should you conclude?
3.Tool calls C are modeled as Poisson with lambda = 2.2, and response time T is modeled as a continuous positive variable. Which statement uses these models correctly?
4.A tool-call sample has mean 2.1 and variance 14.0. What should you conclude about a Poisson baseline?
5.A simulator should give response time a six-second median using rng.lognormal with sigma = 0.55. Which parameter setting matches that goal?
6.A teammate models response time with rng.normal(loc=7.0, scale=8.0) and reports only p95 latency. Which review comment is technically justified?
7.A Bernoulli simulator with assumed p = 0.72 uses a recorded seed and 2,000 repeated batches at N = 100 and N = 10,000. Which conclusion is justified?
8.A tiny decoder has softmax probabilities approved: 0.604, pending: 0.300, delayed: 0.082, and unicorn: 0.015. With a recorded seed, one sampling run returns pending. What does this show?

8 questions remaining.

Next Step
Continue to Hypothesis Tests, Intervals, and pass@k

You can now generate binary evaluation outcomes and inspect their assumptions. Next you will decide when score differences are large enough to support a model claim.

PreviousStatistics and Uncertainty
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Machine Learning: A Probabilistic Perspective.

Murphy, K. P. · 2012

Pattern Recognition and Machine Learning.

Bishop, C. M. · 2006

Deep Learning.

Goodfellow, I., Bengio, Y., Courville, A. · 2016

The Curious Case of Neural Text Degeneration.

Holtzman, A., et al. · 2020 · ICLR 2020