Model an e-commerce support agent with binary outcomes, request intents, tool-call counts, and tail latency, then challenge each simulation before trusting it.
The previous chapter estimated one unknown rate: the precision of a fraud-review queue. Real AI products emit more than one kind of measurement. An e-commerce support agent may resolve a request or escalate it, route a request into an intent, call tools several times, and take a variable number of seconds to answer.
A distribution describes the values a random quantity can produce and how likely those values are. A wrong distribution does more than make a chart look odd. It can make a simulator produce impossible latency, hide retry bursts, or claim a launch is safe when it isn't.[1][2]
Suppose you are planning a load test for an agent that answers order-support requests. For now, the parameter values below are illustrative assumptions, not measured production results. A real launch would estimate them from logged and reviewed traffic.
| Quantity from one request | Example value | Value shape | First simulation model | Summary to report |
|---|---|---|---|---|
| Resolved without a human | 1 or 0 | binary | Bernoulli | resolution rate |
| Customer intent | tracking, refund, return_label, damaged_item | one label | categorical | label mix |
| Tool calls made | 0, 1, 2, ... | count | Poisson baseline | mean and share above a cost threshold |
| End-to-end response time | 7.4 seconds | positive continuous | lognormal baseline | median and p95 |
The phrase first simulation model matters. A distribution is a claim about data shape. If traces disagree with the claim, you revise the model instead of defending the convenient formula.
The first split is between discrete and continuous quantities.
| Kind | Meaning | In the support agent |
|---|---|---|
| Discrete | values are separate outcomes you can enumerate | resolved or escalated, one intent label, number of tool calls |
| Continuous | values live on a measured range | response time in seconds |
For a discrete quantity, you can ask for the probability of one exact value, such as P(tool_calls = 2). For a continuous response time, one exact infinitely precise value has probability zero under the model. Ask for a range or a percentile instead: P(time > 15 seconds) or the p95 response time.
Before sampling, ask:
That fourth question separates an engineering simulation from random-number decoration.
Begin with the smallest distribution. Let R = 1 if the agent resolves an order-support request without a human, and R = 0 if it escalates. Assume a planning value of p = 0.72.
| Outcome | R | Probability |
|---|---|---|
| escalation | 0 | 0.28 |
| resolved | 1 | 0.72 |
The expected value is:
Because R is encoded as 0 or 1, its long-run average is the resolution rate.
1p_resolved = 0.72
2outcomes = {"escalated": 0, "resolved": 1}
3probabilities = {"escalated": 1 - p_resolved, "resolved": p_resolved}
4
5expected_value = sum(
6 outcomes[name] * probabilities[name] for name in outcomes
7)
8
9print(f"P(resolved): {probabilities['resolved']:.2f}")
10print(f"P(escalated): {probabilities['escalated']:.2f}")
11print(f"E[R]: {expected_value:.2f}")
12
13assert expected_value == p_resolved1P(resolved): 0.72
2P(escalated): 0.28
3E[R]: 0.72A batch of requests still moves around that long-run rate. Here, the distribution is held fixed at 0.72, while six independent 100-request batches vary.
1import numpy as np
2
3rng = np.random.default_rng(12)
4resolved_per_batch = rng.binomial(n=100, p=0.72, size=6)
5rates = resolved_per_batch / 100
6
7print("resolved counts:", resolved_per_batch.tolist())
8print("batch rates: ", [float(round(rate, 2)) for rate in rates])
9print(f"range: {rates.min():.2f} to {rates.max():.2f}")1resolved counts: [75, 65, 76, 76, 74, 75]
2batch rates: [0.75, 0.65, 0.76, 0.76, 0.74, 0.75]
3range: 0.65 to 0.76The previous chapter taught you how to put an interval around a rate measured from real labels. Here, you are running the distribution forward to see the outcomes it could generate.
A customer request isn't always binary. It may be exactly one of several intent labels. For the route used by the support agent, suppose the planning mix is:
| Intent | Probability | Expected count among 500 requests |
|---|---|---|
| tracking | 0.45 | 225 |
| refund | 0.20 | 100 |
| return label | 0.20 | 100 |
| damaged item | 0.15 | 75 |
The four probabilities sum to 1.00 because every request receives one route in this simplified taxonomy. A categorical distribution draws one of those labels according to the supplied weights.
1import numpy as np
2
3intents = np.array(["tracking", "refund", "return_label", "damaged_item"])
4probabilities = np.array([0.45, 0.20, 0.20, 0.15])
5requests = 500
6
7assert np.isclose(probabilities.sum(), 1.0)
8
9rng = np.random.default_rng(21)
10sampled = rng.choice(intents, size=requests, p=probabilities)
11expected = {
12 str(intent): int(count)
13 for intent, count in zip(intents, requests * probabilities)
14}
15observed = {str(intent): int((sampled == intent).sum()) for intent in intents}
16
17print("expected counts:", expected)
18print("sampled counts: ", observed)
19print("all labels known:", set(sampled).issubset(set(intents)))1expected counts: {'tracking': 225, 'refund': 100, 'return_label': 100, 'damaged_item': 75}
2sampled counts: {'tracking': 230, 'refund': 97, 'return_label': 102, 'damaged_item': 71}
3all labels known: TrueThe sampled counts won't equal the expected counts exactly. That is ordinary variation. An unknown label such as chargeback_legal would be a different problem: the taxonomy or the sampler contract is wrong.
Tool calls are counts. A simple baseline is a Poisson distribution with parameter lambda, written . If lambda = 2.2, the simulated agent makes 2.2 tool calls per request on average.
For a Poisson quantity C, the probability of exactly k calls is:
With , compute a few values before sampling:
1from math import exp, factorial
2
3lam = 2.2
4
5def poisson_probability(k: int) -> float:
6 if k < 0:
7 raise ValueError("k must be nonnegative")
8 return lam**k * exp(-lam) / factorial(k)
9
10for k in range(5):
11 print(f"P(calls = {k}): {poisson_probability(k):.3f}")
12
13share_above_5 = 1 - sum(poisson_probability(k) for k in range(6))
14print(f"P(calls > 5): {share_above_5:.3f}")
15
16try:
17 poisson_probability(-1)
18except ValueError as error:
19 print(error)1P(calls = 0): 0.111
2P(calls = 1): 0.244
3P(calls = 2): 0.268
4P(calls = 3): 0.197
5P(calls = 4): 0.108
6P(calls > 5): 0.025
7k must be nonnegativeThe Poisson model makes a stronger claim than "counts can't be negative." Its mean and variance are both . That assumption can be reasonable for independent events arriving at a steady rate. It's often questionable for an agent: a difficult refund request may trigger a search, a policy lookup, a retry, and a handoff together. Bursts create overdispersion, where variance is much larger than the mean.[1][2]
Use a stress-test fixture to make that failure visible. The steady stream comes from the Poisson baseline. The bursty stream mixes ordinary requests with a small fraction of hard requests requiring many tools.
1import numpy as np
2
3rng = np.random.default_rng(31)
4steady = rng.poisson(lam=2.2, size=5_000)
5hard_request = rng.random(5_000) < 0.12
6bursty = np.where(
7 hard_request,
8 rng.poisson(lam=10.0, size=5_000),
9 rng.poisson(lam=1.2, size=5_000),
10)
11
12def report(name: str, counts: np.ndarray) -> None:
13 mean = counts.mean()
14 variance = counts.var()
15 above_five = (counts > 5).mean()
16 print(
17 f"{name:>6}: mean={mean:.2f} variance={variance:.2f} "
18 f"variance/mean={variance / mean:.2f} share>5={above_five:.3f}"
19 )
20
21report("steady", steady)
22report("bursty", bursty)1steady: mean=2.24 variance=2.19 variance/mean=0.98 share>5=0.025
2bursty: mean=2.25 variance=10.09 variance/mean=4.48 share>5=0.111When the variance-to-mean ratio is far above 1, the Poisson baseline understates burst behavior. You don't need a more advanced count distribution yet. You need the habit: inspect the fit before trusting cost or capacity conclusions.
End-to-end agent response time can't be negative, and tool-assisted requests often form a slower right tail. A lognormal baseline encodes those two facts: if T is lognormal, then log(T) is normally distributed.
Choose parameters with care. NumPy's mean and sigma arguments for lognormal describe log seconds, not seconds. Setting mu = log(6) gives a median response time of 6 seconds. With sigma = 0.55, the mean is higher than the median because slow requests pull the right tail outward.
1import numpy as np
2
3median_seconds = 6.0
4sigma = 0.55
5mu = np.log(median_seconds)
6
7rng = np.random.default_rng(44)
8response_seconds = rng.lognormal(mean=mu, sigma=sigma, size=5_000)
9
10print(f"target median seconds: {np.exp(mu):.2f}")
11print(f"sample median seconds: {np.median(response_seconds):.2f}")
12print(f"sample mean seconds: {response_seconds.mean():.2f}")
13print(f"sample p95 seconds: {np.percentile(response_seconds, 95):.2f}")
14print(f"nonpositive values: {(response_seconds <= 0).sum()}")1target median seconds: 6.00
2sample median seconds: 6.00
3sample mean seconds: 7.05
4sample p95 seconds: 14.89
5nonpositive values: 0The median describes a typical request; p95 tells you about a noticeably slow slice. Reporting only the mean would conceal the operational question customers feel: how long do the slow responses take?
A normal distribution may look convenient because it takes a familiar mean and standard deviation. For positive response time, it can generate impossible negative seconds.
1import numpy as np
2
3rng = np.random.default_rng(44)
4bad_response_seconds = rng.normal(loc=8.0, scale=6.0, size=5_000)
5
6print(f"minimum seconds: {bad_response_seconds.min():.2f}")
7print(f"share below zero: {(bad_response_seconds < 0).mean():.3f}")
8print(f"p95 seconds: {np.percentile(bad_response_seconds, 95):.2f}")
9
10assert (bad_response_seconds < 0).any()1minimum seconds: -13.09
2share below zero: 0.088
3p95 seconds: 17.91This isn't a harmless inconvenience. If a simulator violates the physical shape of the metric, it can't be trusted to estimate service-level risk without repair.
Monte Carlo simulation means using repeated random samples to approximate a quantity you care about. For the Bernoulli resolution rate, larger simulated batches sit more tightly around the assumed p. The familiar 1 / sqrt(N) behavior returns: multiplying sample size by four roughly halves standard-error scale.
1import numpy as np
2
3rng = np.random.default_rng(55)
4p_resolved = 0.72
5
6for batch_size in [100, 1_000, 10_000]:
7 repeated_rates = rng.binomial(
8 n=batch_size, p=p_resolved, size=2_000
9 ) / batch_size
10 print(
11 f"N={batch_size:>5}: mean={repeated_rates.mean():.3f} "
12 f"sd_of_rates={repeated_rates.std():.4f}"
13 )1N= 100: mean=0.719 sd_of_rates=0.0454
2N= 1000: mean=0.720 sd_of_rates=0.0140
3N=10000: mean=0.720 sd_of_rates=0.0047Simulation answers a conditional question: "What happens if these assumptions are the generating process?" It doesn't validate p = 0.72, lambda = 2.2, or the latency parameters. Those must come from data, review design, and production measurement.
Now put the shapes together. The report below uses one seeded generator and prints the quantities an engineer could review before planning capacity.
1import numpy as np
2
3def simulate_support_agent(seed: int, n_requests: int) -> dict[str, float]:
4 if n_requests <= 0:
5 raise ValueError("n_requests must be positive")
6
7 rng = np.random.default_rng(seed)
8 resolved = rng.binomial(n=1, p=0.72, size=n_requests)
9
10 intents = np.array(["tracking", "refund", "return_label", "damaged_item"])
11 intent_probs = np.array([0.45, 0.20, 0.20, 0.15])
12 routed = rng.choice(intents, size=n_requests, p=intent_probs)
13
14 tool_calls = rng.poisson(lam=2.2, size=n_requests)
15 response_seconds = rng.lognormal(
16 mean=np.log(6.0), sigma=0.55, size=n_requests
17 )
18
19 return {
20 "resolution_rate": float(resolved.mean()),
21 "refund_share": float((routed == "refund").mean()),
22 "mean_tool_calls": float(tool_calls.mean()),
23 "share_over_5_tools": float((tool_calls > 5).mean()),
24 "median_seconds": float(np.median(response_seconds)),
25 "p95_seconds": float(np.percentile(response_seconds, 95)),
26 "nonpositive_times": float((response_seconds <= 0).sum()),
27 }
28
29report = simulate_support_agent(seed=7, n_requests=5_000)
30for name, value in report.items():
31 print(f"{name:>20}: {value:.3f}")
32
33assert report["nonpositive_times"] == 0
34assert report["p95_seconds"] > report["median_seconds"]1resolution_rate: 0.722
2 refund_share: 0.195
3 mean_tool_calls: 2.228
4 share_over_5_tools: 0.023
5 median_seconds: 5.973
6 p95_seconds: 15.064
7 nonpositive_times: 0.000Use that output as a simulator report, not a production claim. The rate, label mix, count tail, and latency tail are only as credible as their fitted parameters and assumption checks.
The support-agent labels above form a categorical distribution because one label is selected from a finite list. An LLM uses the same mathematical family at a much larger scale: after producing a logit for each candidate token, softmax converts the logits into probabilities, and a decoder may sample one token from the weighted vocabulary.[3]
Take a tiny continuation for the prompt Your refund is. Four tokens are enough to see the mechanics:
1import numpy as np
2
3tokens = np.array(["approved", "pending", "delayed", "unicorn"])
4logits = np.array([2.2, 1.5, 0.2, -1.5])
5
6def stable_softmax(values: np.ndarray) -> np.ndarray:
7 shifted = values - values.max()
8 weights = np.exp(shifted)
9 return weights / weights.sum()
10
11probabilities = stable_softmax(logits)
12rng = np.random.default_rng(9)
13chosen = rng.choice(tokens, p=probabilities)
14
15for token, probability in zip(tokens, probabilities):
16 print(f"{token:>8}: {probability:.3f}")
17print("sampled token:", chosen)
18print("probabilities sum to one:", np.isclose(probabilities.sum(), 1.0))1approved: 0.604
2 pending: 0.300
3 delayed: 0.082
4 unicorn: 0.015
5sampled token: pending
6probabilities sum to one: TrueLater chapters teach decoding controls in depth. For now, retain the bridge: a sampling decoder doesn't choose from raw logits directly. It draws from a probability distribution derived from those logits. A greedy decoder is different: it selects the highest-probability token instead of drawing a sample. For open-ended text generation, nucleus sampling chooses a dynamic high-probability token set and renormalizes it before sampling, rather than sampling from an unreliable long tail.[4]
Before believing output from a random simulator, write down its contract:
| Quantity | Model used | Report | Challenge before trusting it |
|---|---|---|---|
| resolved without human | Bernoulli | rate | is one request one binary decision? |
| request route | categorical | proportions | are all real intent labels represented? |
| tool calls | Poisson baseline | mean and share above 5 | is variance near mean, or are calls bursty? |
| response time | lognormal baseline | median and p95 | is any time nonpositive, and does the tail resemble traces? |
| next token | categorical from softmax | sampled token and probabilities | does decoding alter the candidate set or distribution? |
The seed also belongs in the report. A seed makes one simulated run reproducible. It doesn't make an incorrect model correct.
A teammate proposes this e-commerce support-agent simulator:
| Quantity | Proposed model | Claim |
|---|---|---|
| resolution | Bernoulli with p = 0.78 | 78 percent resolves automatically |
| request type | categorical with no damaged_item label | covers incoming requests |
| tool calls | Poisson with mean 2.0 | estimates tool cost |
| response time | normal with mean 7 seconds and standard deviation 8 seconds | estimates p95 latency |
Write a review with four parts:
p, categorical probabilities, lambda, mu, and sigma in plain English.Answer every question, then check your score. Score above 75% to mark this lesson complete.
8 questions remaining.