Turn claim-level answer traces into production metrics, actionable alerts, privacy-safe debugging records, and reproducible incident evidence.
In the previous lesson, a large language model (LLM) answer path called tracking-answerer-v1 refused to promise a delivery date that wasn't present in a FastShip record. It also emitted a trace: evidence version, route, and first failed stage. One safe trace proves one request behaved correctly. It doesn't tell you whether the service is still safe across live traffic.
Suppose a prompt rollback silently bypasses that gate for two customers. The endpoint still returns 200. Answers remain fluent. Latency stays fast. Only the trace events reveal that unsupported ETAs escaped into served messages.
Monitoring reduces many trace events into rates, percentiles, and alerts. Observability lets you start from an alert and reconstruct why a particular request failed. An LLM application needs both: aggregates to detect a change and request evidence to diagnose it.
Avoid beginning with a dashboard wish list. Begin with the decision boundary from the serving path:
| Serving question | Event field | Why it matters |
|---|---|---|
| What answer route was chosen? | route | Shows serve, shorten, or abstain behavior |
| Did an unsupported claim reach the customer? | unsafe_claim_served | Direct safety invariant |
| Where did a request first go wrong? | first_failed_stage | Separates evidence, generation, and serving faults |
| Which record and prompt were used? | evidence_version, prompt_template_id | Makes a failure reproducible |
| Was the request usable and affordable? | token and timing fields | Measures experience and cost beside quality |
This chapter uses synthetic events from the same delivery-update path as the prior lab. The last two events model a regression; they aren't claims about real production traffic.
The first cell creates six requests. Four are safe outcomes from a strict gate: a supported answer or a blocked factual claim. Two come from a regressed template that served an invented ETA.
1from dataclasses import dataclass, replace
2
3@dataclass(frozen=True)
4class TraceEvent:
5 request_id: str
6 case_id: str
7 evidence_version: str | None
8 prompt_template_id: str
9 route: str
10 first_failed_stage: str
11 unsafe_claim_served: bool
12 input_tokens: int
13 cached_input_tokens: int
14 output_tokens: int
15 ttft_ms: int
16 total_ms: int
17
18current_window = [
19 TraceEvent("req_201", "clean_scan", "fastship@v17", "tracking-answerer-v1",
20 "serve", "passed", False, 188, 80, 27, 172, 410),
21 TraceEvent("req_202", "invented_eta", "fastship@v17", "tracking-answerer-v1",
22 "abstain", "claim_generation", False, 196, 80, 20, 181, 395),
23 TraceEvent("req_203", "wrong_status", "fastship@v17", "tracking-answerer-v1",
24 "abstain", "claim_generation", False, 192, 80, 21, 188, 402),
25 TraceEvent("req_204", "missing_feed", None, "tracking-answerer-v1",
26 "abstain", "evidence_admission", False, 160, 64, 15, 205, 382),
27 TraceEvent("req_205", "invented_eta", "fastship@v17", "tracking-answerer-v1.1-regression",
28 "serve", "serving_gate", True, 198, 80, 29, 218, 430),
29 TraceEvent("req_206", "invented_eta", "fastship@v17", "tracking-answerer-v1.1-regression",
30 "serve", "serving_gate", True, 201, 80, 30, 240, 455),
31]
32
33print(f"events={len(current_window)}")
34for event in current_window:
35 verdict = "UNSAFE SERVE" if event.unsafe_claim_served else "safe route"
36 print(f"{event.request_id}: {event.route:7} {verdict}")1events=6
2req_201: serve safe route
3req_202: abstain safe route
4req_203: abstain safe route
5req_204: abstain safe route
6req_205: serve UNSAFE SERVE
7req_206: serve UNSAFE SERVENotice the distinction: an abstention isn't automatically a failure. When evidence doesn't establish an ETA, abstention is the safe product behavior. The event you need to page on is an unsupported factual claim that was actually served.
An aggregate must preserve the invariant it claims to monitor. A generic "answer quality score" could hide two unsafe delivery promises inside many pleasant responses. For this answer path, start with metrics directly derived from claim-gate outcomes:
| Metric | Numerator / denominator | Interpretation |
|---|---|---|
| Unsafe-serve rate | Requests with any unsupported served claim / requests | Customer-facing factual safety failure |
| Safe-route rate | Requests without unsafe served claims / requests | Gate effectiveness, including justified abstention |
| Abstention rate | Abstained requests / requests | Product usefulness signal, not a safety failure by itself |
| Failure-stage count | Requests by first failed stage | Where engineers should investigate first |
The next cell aggregates the synthetic production window.
1from collections import Counter
2from dataclasses import dataclass
3
4@dataclass(frozen=True)
5class QualityWindow:
6 requests: int
7 unsafe_serves: int
8 abstentions: int
9 failure_stages: Counter
10
11 @property
12 def unsafe_serve_rate(self) -> float:
13 return self.unsafe_serves / self.requests
14
15 @property
16 def safe_route_rate(self) -> float:
17 return (self.requests - self.unsafe_serves) / self.requests
18
19def quality_window(events: list[TraceEvent]) -> QualityWindow:
20 return QualityWindow(
21 requests=len(events),
22 unsafe_serves=sum(event.unsafe_claim_served for event in events),
23 abstentions=sum(event.route == "abstain" for event in events),
24 failure_stages=Counter(event.first_failed_stage for event in events),
25 )
26
27quality = quality_window(current_window)
28print(f"unsafe_serve_rate={quality.unsafe_serve_rate:.1%}")
29print(f"safe_route_rate={quality.safe_route_rate:.1%}")
30print(f"abstentions={quality.abstentions}")
31print(f"serving_gate_failures={quality.failure_stages['serving_gate']}")
32
33assert quality.unsafe_serves == 2
34assert quality.failure_stages["serving_gate"] == 21unsafe_serve_rate=33.3%
2safe_route_rate=66.7%
3abstentions=3
4serving_gate_failures=2The count does more than say "quality is down." Both escaped claims first failed at serving_gate, while the earlier strict template safely abstained on the same invented-ETA case. That evidence points at a gate or template rollback before anyone tunes retrieval.
Quality is the first invariant here, but customers also notice sluggish answers. For a streamed response:
Be precise with token arithmetic. If TTFT marks arrival of the first token, only output_tokens - 1 tokens arrive in the time after TTFT. Counting all output tokens in that interval slightly overstates throughput.
1import math
2
3def nearest_rank_percentile(values: list[int], percentile: float) -> int:
4 ordered = sorted(values)
5 rank = max(1, math.ceil(percentile * len(ordered)))
6 return ordered[rank - 1]
7
8def post_first_token_tps(event: TraceEvent) -> float:
9 remaining_tokens = max(event.output_tokens - 1, 0)
10 if remaining_tokens == 0:
11 return 0.0
12 remaining_ms = event.total_ms - event.ttft_ms
13 if remaining_ms <= 0:
14 raise ValueError("total_ms must exceed ttft_ms after the first token")
15 return remaining_tokens / (remaining_ms / 1000)
16
17p95_ttft_ms = nearest_rank_percentile([event.ttft_ms for event in current_window], 0.95)
18example_tps = post_first_token_tps(current_window[-1])
19malformed_timing = replace(current_window[-1], total_ms=240)
20
21print(f"p95_ttft_ms={p95_ttft_ms}")
22print(f"req_206_post_first_tps={example_tps:.1f}")
23print(f"quality_safe={quality.unsafe_serves == 0}")
24try:
25 post_first_token_tps(malformed_timing)
26except ValueError as error:
27 print(f"invalid_timing={error}")
28else:
29 raise AssertionError("malformed timing event should fail")
30
31assert p95_ttft_ms == 240
32assert quality.unsafe_serves > 01p95_ttft_ms=240
2req_206_post_first_tps=134.9
3quality_safe=False
4invalid_timing=total_ms must exceed ttft_ms after the first tokenThis window is fast and unsafe. A latency-only dashboard would celebrate the exact release that should be rolled back.
Don't hide malformed timing events behind a tiny fallback denominator. If a multi-token response reports total_ms <= ttft_ms, fix the instrumentation before trusting its throughput.
OpenTelemetry (OTel) publishes generative-AI semantic conventions for spans and metrics. Its current metrics distinguish client first-chunk latency (gen_ai.client.operation.time_to_first_chunk) from model-server first-token latency (gen_ai.server.time_to_first_token). This lab's ttft_ms is an application-owned customer-visible field, so document exactly where your timer starts and stops. The convention is marked Development; pin the convention emitted by your instrumentation instead of treating its field set as permanent.[1][2]
OTel's generative-AI span attributes describe model operations and usage. The current inference-span convention requires both gen_ai.operation.name and gen_ai.provider.name; gen_ai.request.model is conditionally required when available. The synthetic service below uses a custom provider name because no real provider is involved. Use the convention's well-known provider value when one applies. Your application still needs custom attributes for its own invariant: which evidence version supported the answer, whether the answer route blocked a claim, and where the first failure occurred.[3]
Prompt, output-message, and system-instruction attributes are opt-in in the OTel convention because they may contain sensitive content. Store stable identifiers and safe outcome attributes by default; make raw text capture a deliberate, governed exception.[3]
1def trace_attributes(event: TraceEvent) -> dict[str, str | int | bool]:
2 attributes: dict[str, str | int | bool] = {
3 "gen_ai.operation.name": "chat",
4 "gen_ai.provider.name": "shopflow.internal",
5 "gen_ai.request.model": "support-model-prod",
6 "gen_ai.usage.input_tokens": event.input_tokens,
7 "gen_ai.usage.output_tokens": event.output_tokens,
8 "shopflow.answer.route": event.route,
9 "shopflow.answer.unsafe_claim_served": event.unsafe_claim_served,
10 "shopflow.failure.stage": event.first_failed_stage,
11 "shopflow.prompt.template_id": event.prompt_template_id,
12 }
13 if event.evidence_version is not None:
14 attributes["shopflow.evidence.version"] = event.evidence_version
15 return attributes
16
17unsafe_attributes = trace_attributes(current_window[-1])
18print(f"operation={unsafe_attributes['gen_ai.operation.name']}")
19print(f"template={unsafe_attributes['shopflow.prompt.template_id']}")
20print(f"failed_stage={unsafe_attributes['shopflow.failure.stage']}")
21print(f"raw_messages_logged={'gen_ai.input.messages' in unsafe_attributes}")
22
23assert unsafe_attributes["shopflow.answer.unsafe_claim_served"] is True
24assert unsafe_attributes["gen_ai.provider.name"] == "shopflow.internal"
25assert "gen_ai.input.messages" not in unsafe_attributes
26assert "gen_ai.output.messages" not in unsafe_attributes
27assert "gen_ai.system_instructions" not in unsafe_attributes1operation=chat
2template=tracking-answerer-v1.1-regression
3failed_stage=serving_gate
4raw_messages_logged=Falsegen_ai.* attributes let tracing backends recognize the model call. shopflow.* attributes explain the business decision. Neither requires putting a customer's full message into a durable trace.
Cost is part of monitoring, but embedding current provider price claims in application logic or a tutorial makes both go stale quickly. Production systems should store usage counts on the trace and evaluate them with a versioned rate card. The rates below are synthetic fixture values, chosen only to exercise the calculation.
1@dataclass(frozen=True)
2class RateCard:
3 version: str
4 input_per_million: float
5 cached_input_per_million: float
6 output_per_million: float
7
8def estimate_cost(event: TraceEvent, card: RateCard) -> float:
9 uncached_input = max(event.input_tokens - event.cached_input_tokens, 0)
10 return (
11 uncached_input * card.input_per_million
12 + event.cached_input_tokens * card.cached_input_per_million
13 + event.output_tokens * card.output_per_million
14 ) / 1_000_000
15
16card = RateCard(
17 version="internal-fixture-2026-05-27",
18 input_per_million=1.00,
19 cached_input_per_million=0.10,
20 output_per_million=5.00,
21)
22cost_by_request = {
23 event.request_id: estimate_cost(event, card)
24 for event in current_window
25}
26total_cost = sum(cost_by_request.values())
27
28print(f"rate_card={card.version}")
29print(f"window_cost_usd={total_cost:.6f}")
30print(f"req_206_cost_usd={cost_by_request['req_206']:.6f}")
31
32assert card.version.startswith("internal-fixture")
33assert total_cost > 01rate_card=internal-fixture-2026-05-27
2window_cost_usd=0.001427
3req_206_cost_usd=0.000279If a provider later changes cached-token pricing or adds a service-tier surcharge, old traces remain interpretable: usage was recorded once, while the rate-card version states how the estimate was computed.
A trace may contain an order identifier, email address, or full customer request. That text can help an incident investigation, but it shouldn't be your default long-lived record.
For this workflow, durable logs can retain:
| Retain by default | Avoid by default |
|---|---|
| Request ID, evidence version, prompt template ID | Full customer message |
| Route, failure stage, claim-verdict counts | Full generated answer |
| Token counts, latency, rate-card version | Unredacted order or contact details |
| Redacted preview when needed | Raw retrieved documents |
The record below keeps a redacted preview for an incident exemplar while leaving raw payload storage unset.
1import re
2
3@dataclass(frozen=True)
4class DurableRecord:
5 request_id: str
6 prompt_template_id: str
7 evidence_version: str | None
8 route: str
9 failed_stage: str
10 redacted_preview: str
11 raw_payload_ref: str | None
12 scrub_policy: str
13
14def redact_customer_text(text: str) -> str:
15 text = re.sub(r"#[A-Z0-9]+", "[ORDER_ID]", text)
16 return re.sub(r"\b[\w.+-]+@[\w.-]+\.\w+\b", "[EMAIL]", text)
17
18def durable_record(event: TraceEvent, customer_text: str) -> DurableRecord:
19 return DurableRecord(
20 request_id=event.request_id,
21 prompt_template_id=event.prompt_template_id,
22 evidence_version=event.evidence_version,
23 route=event.route,
24 failed_stage=event.first_failed_stage,
25 redacted_preview=redact_customer_text(customer_text),
26 raw_payload_ref=None,
27 scrub_policy="delivery-pii-v1",
28 )
29
30record = durable_record(
31 current_window[-1],
32 "Email me ETA for order #A10234 at [email protected]",
33)
34print(record.redacted_preview)
35print(f"raw_payload_stored={record.raw_payload_ref is not None}")
36print(f"scrub_policy={record.scrub_policy}")
37
38assert "#A10234" not in record.redacted_preview
39assert "[email protected]" not in record.redacted_preview1Email me ETA for order [ORDER_ID] at [EMAIL]
2raw_payload_stored=False
3scrub_policy=delivery-pii-v1The regex is intentionally narrow: it makes the fixture readable, but it isn't a production scrubber. In production, minimize collected attributes, review what instrumentation libraries emit, and apply centrally managed Collector processors or an equivalent scrub pipeline before durable storage.[4]
Not every metric deserves a page. A long answer or higher cost might warrant investigation. An unsupported delivery promise that escaped a hard gate violates the product invariant and should page immediately.
Thresholds must come from a service contract, not a generic article. This lab's synthetic policy says:
1@dataclass(frozen=True)
2class MonitorPolicy:
3 policy_id: str
4 max_unsafe_serves: int
5 max_p95_ttft_ms: int
6 max_window_cost_usd: float
7
8def evaluate_alerts(
9 events: list[TraceEvent],
10 quality: QualityWindow,
11 cost_usd: float,
12 policy: MonitorPolicy,
13) -> list[str]:
14 findings: list[str] = []
15 if quality.unsafe_serves > policy.max_unsafe_serves:
16 findings.append(
17 f"PAGE unsafe_claim_served={quality.unsafe_serves}/{quality.requests}"
18 )
19 if p95_ttft_ms > policy.max_p95_ttft_ms:
20 findings.append(f"WARN p95_ttft_ms={p95_ttft_ms}")
21 if cost_usd > policy.max_window_cost_usd:
22 findings.append(f"WARN window_cost_usd={cost_usd:.6f}")
23 return findings
24
25policy = MonitorPolicy(
26 policy_id="delivery-grounding-slo-v1",
27 max_unsafe_serves=0,
28 max_p95_ttft_ms=500,
29 max_window_cost_usd=0.002,
30)
31findings = evaluate_alerts(current_window, quality, total_cost, policy)
32
33for finding in findings:
34 print(finding)
35print(f"latency_warning={any('ttft' in item for item in findings)}")
36print(f"cost_warning={any('cost' in item for item in findings)}")
37
38assert findings == ["PAGE unsafe_claim_served=2/6"]1PAGE unsafe_claim_served=2/6
2latency_warning=False
3cost_warning=FalseThe page is actionable because it says which invariant broke. It doesn't page merely because a judge score drifted or because latency was a little noisy.
Six synthetic requests are enough to demonstrate alert logic, not to set a production latency or cost budget. In a real launch, define those warning thresholds from representative traffic and revisit them as workload shape changes. The zero-tolerance safety invariant is different: a known unsupported delivery promise must not be served.
An on-call engineer shouldn't begin by searching arbitrary logs. For an invariant page, attach one violating trace with its prompt template and source version, then state the immediate check.
1@dataclass(frozen=True)
2class IncidentCard:
3 policy_id: str
4 summary: str
5 exemplar_request_id: str
6 template_id: str
7 evidence_version: str | None
8 first_failed_stage: str
9 first_check: str
10
11def incident_card(events: list[TraceEvent], policy: MonitorPolicy) -> IncidentCard:
12 exemplar = next(event for event in events if event.unsafe_claim_served)
13 return IncidentCard(
14 policy_id=policy.policy_id,
15 summary="Unsupported delivery claim reached customer response.",
16 exemplar_request_id=exemplar.request_id,
17 template_id=exemplar.prompt_template_id,
18 evidence_version=exemplar.evidence_version,
19 first_failed_stage=exemplar.first_failed_stage,
20 first_check="Compare serving-gate template with tracking-answerer-v1.",
21 )
22
23card_view = incident_card(current_window, policy)
24print(f"exemplar={card_view.exemplar_request_id}")
25print(f"template={card_view.template_id}")
26print(f"stage={card_view.first_failed_stage}")
27print(f"first_check={card_view.first_check}")
28
29assert card_view.first_failed_stage == "serving_gate"1exemplar=req_205
2template=tracking-answerer-v1.1-regression
3stage=serving_gate
4first_check=Compare serving-gate template with tracking-answerer-v1.The request still needs deeper review, but investigation is now narrow: compare the template or gate configuration that served the claim with the strict version that abstained on the same failure case.
The central example is a factual answer gate. If the product later allows an agent to create return labels or message a carrier, the same discipline applies to tool actions. Log state transitions you control: tool name, redacted parameter hash, result status, progress marker, retry count, and stop reason. Don't depend on hidden reasoning text.
1@dataclass(frozen=True)
2class ToolStep:
3 tool_name: str
4 params_hash: str
5 progress_marker: str
6
7def stalled_repeat(steps: list[ToolStep]) -> bool:
8 if len(steps) < 2:
9 return False
10 last, previous = steps[-1], steps[-2]
11 return (
12 last.tool_name == previous.tool_name
13 and last.params_hash == previous.params_hash
14 and last.progress_marker == previous.progress_marker
15 )
16
17steps = [
18 ToolStep("lookup_tracking", "order-redacted:v1", "scan:v17"),
19 ToolStep("lookup_tracking", "order-redacted:v1", "scan:v17"),
20]
21
22print(f"tool={steps[-1].tool_name}")
23print(f"stalled_repeat={stalled_repeat(steps)}")
24print("route=stop_and_review" if stalled_repeat(steps) else "route=continue")
25
26assert stalled_repeat(steps) is True1tool=lookup_tracking
2stalled_repeat=True
3route=stop_and_reviewAn alert tells you something is wrong now. The next engineering question is whether a proposed fix is better. That requires an experiment record linking the failing window, proposed template, policy, and rerun results.
1@dataclass(frozen=True)
2class ExperimentHandoff:
3 incident_policy_id: str
4 failing_template_id: str
5 exemplar_request_id: str
6 evidence_version: str | None
7 candidate_template_id: str
8 required_metric: str
9 promotion_status: str
10
11handoff = ExperimentHandoff(
12 incident_policy_id=card_view.policy_id,
13 failing_template_id=card_view.template_id,
14 exemplar_request_id=card_view.exemplar_request_id,
15 evidence_version=card_view.evidence_version,
16 candidate_template_id="tracking-answerer-v1.2-fix",
17 required_metric="unsafe_claim_served == 0 on regression and holdout windows",
18 promotion_status="BLOCKED_PENDING_EVALUATION",
19)
20
21print(f"candidate={handoff.candidate_template_id}")
22print(f"required_metric={handoff.required_metric}")
23print(f"promotion={handoff.promotion_status}")
24
25assert handoff.promotion_status == "BLOCKED_PENDING_EVALUATION"1candidate=tracking-answerer-v1.2-fix
2required_metric=unsafe_claim_served == 0 on regression and holdout windows
3promotion=BLOCKED_PENDING_EVALUATIONThe monitoring system doesn't approve the proposed fix. It creates a precise starting point for the next controlled run: what broke, which candidate claims to fix it, and which metric must pass.
Symptom: Dashboards stay green while customers receive invented ETAs. Cause: Monitoring captures HTTP status and latency but not claim-gate outcomes. Fix: Emit and aggregate the unsafe-served-claim invariant beside performance metrics.
Symptom: A strict gate produces noisy incidents whenever evidence is missing. Cause: Safe withholding and unsafe serving were collapsed into one "bad answer" metric. Fix: Separate unsafe-serve, abstention, and supported-serve rates.
Symptom: Generation throughput looks slightly better than it was. Cause: The first token is counted in an interval that begins after first-token arrival. Fix: Divide remaining output tokens by post-TTFT duration, or define a different measured interval explicitly. Reject malformed timing events instead of masking them with a fallback denominator.
Symptom: Debugging records retain order identifiers and email addresses unnecessarily. Cause: Raw prompts were treated as ordinary metadata. Fix: Store redacted previews and stable identifiers by default; govern any raw-payload access separately.
Symptom: A page says "quality decreased" and on-call searches broad logs manually. Cause: The alert has no exemplar trace, template version, evidence version, or first failed stage. Fix: Attach a violating trace and the first runbook check to every invariant alert.