LearnApplied LLM EngineeringLLM Observability & Monitoring

⚙️MediumMLOps & Deployment

LLM Observability & Monitoring

Turn claim-level answer traces into production metrics, actionable alerts, privacy-safe debugging records, and reproducible incident evidence.

19 min read

Learning path

Step 72 of 158 in the full curriculum

Hallucination Detection & Mitigation Experiment Tracking with MLflow and W&B

deploy-answerer-v1, a large language model (LLM) answer path, refused to promise a deploy approval that wasn't present in a ReleaseOps record. It also emitted a trace: evidence version, route, and first failed stage. One safe trace proves one request behaved correctly. It doesn't tell you whether the service is still safe across live traffic.

Suppose a prompt rollback silently bypasses that gate for two operators. The endpoint still returns 200. Answers remain fluent. Latency stays fast. Only the trace events reveal that unsupported deploy approvals escaped into served messages.

Monitoring reduces many trace events into rates, percentiles, and alerts. Observability lets you start from an alert and reconstruct why a particular request failed. An LLM application needs both: aggregates to detect a change and request evidence to diagnose it.

Six-request production window where four requests route safely, requests 205 and 206 serve unsupported deploy approvals, TTFT stays under the 500 millisecond warning line, and the alert panel links exemplar request 205 to a claim-generation regression that escaped the serving gate. — Latency stays healthy while claim safety regresses, so the alert has to point straight to request 205 and its trace.

Start with the event your gate already produced

Avoid beginning with a dashboard wish list. Begin with the decision boundary from the serving path:

Serving question	Event field	Why it matters
What answer route was chosen?	`route`	Shows `serve`, `shorten`, `review`, or `abstain` behavior
Did an unsupported claim reach the operator?	`unsafe_claim_served`	Direct safety invariant
Where did a request first go wrong?	`first_failed_stage`	Separates evidence, generation, and serving faults
Which protection boundary leaked a known failure?	`escape_stage`	Distinguishes root cause from failed containment
Which record and prompt were used?	`evidence_version`, `prompt_template_id`	Makes a failure reproducible
Was the request usable and affordable?	token and timing fields	Measures experience and cost beside quality

The examples use synthetic events from the same deploy-policy answer path as the prior lab. The last two events model a regression; they aren't claims about real production traffic.

The first cell creates six requests. Four are safe outcomes from a strict gate: a supported answer or a blocked factual claim. Two come from a regressed template that served an invented deploy approval.

trace-events.py

from dataclasses import dataclass, replace

@dataclass(frozen=True)
class TraceEvent:
    request_id: str
    case_id: str
    evidence_version: str | None
    prompt_template_id: str
    route: str
    first_failed_stage: str
    escape_stage: str | None
    unsafe_claim_served: bool
    input_tokens: int
    cached_input_tokens: int
    output_tokens: int
    ttft_ms: int
    total_ms: int

current_window = [
    TraceEvent("req_201", "clean_policy", "deploy-policy@v17", "deploy-answerer-v1",
               "serve", "passed", None, False, 188, 80, 27, 172, 410),
    TraceEvent("req_202", "invented_approval", "deploy-policy@v17", "deploy-answerer-v1",
               "abstain", "claim_generation", None, False, 196, 80, 20, 181, 395),
    TraceEvent("req_203", "wrong_freeze_status", "deploy-policy@v17", "deploy-answerer-v1",
               "abstain", "claim_generation", None, False, 192, 80, 21, 188, 402),
    TraceEvent("req_204", "missing_policy", None, "deploy-answerer-v1",
               "abstain", "evidence_admission", None, False, 160, 64, 15, 205, 382),
    TraceEvent("req_205", "invented_approval", "deploy-policy@v17", "deploy-answerer-v1.1-regression",
               "serve", "claim_generation", "serving_gate", True, 198, 80, 29, 218, 430),
    TraceEvent("req_206", "invented_approval", "deploy-policy@v17", "deploy-answerer-v1.1-regression",
               "serve", "claim_generation", "serving_gate", True, 201, 80, 30, 240, 455),
]

print(f"events={len(current_window)}")
for event in current_window:
    verdict = "UNSAFE SERVE" if event.unsafe_claim_served else "safe route"
    print(f"{event.request_id}: {event.route:7} {verdict}")

Output

events=6
req_201: serve   safe route
req_202: abstain safe route
req_203: abstain safe route
req_204: abstain safe route
req_205: serve   UNSAFE SERVE
req_206: serve   UNSAFE SERVE

Notice the distinction: an abstention isn't automatically a failure. When evidence doesn't establish a deploy approval, abstention is the safe product behavior. The event you need to page on is an unsupported factual claim that was served.

Turn request evidence into quality metrics

An aggregate must preserve the invariant it claims to monitor. A generic "answer quality score" could hide two unsafe deploy claims inside many pleasant responses. For this answer path, start with metrics directly derived from claim-gate outcomes:

Metric	Numerator / denominator	Interpretation
Unsafe-serve rate	Requests with any unsupported served claim / requests	Operator-facing factual safety failure
Safe-route rate	Requests without unsafe served claims / requests	Gate effectiveness, including justified abstention
Abstention rate	Abstained requests / requests	Product usefulness signal, not a safety failure by itself
Failure-stage count	Requests by first failed stage	Where engineers should investigate first
Escape-stage count	Unsafe requests by leaked protection boundary	Where containment failed after the first defect

The next cell aggregates the synthetic production window.

quality-window.py

from collections import Counter
from dataclasses import dataclass

@dataclass(frozen=True)
class QualityWindow:
    requests: int
    unsafe_serves: int
    abstentions: int
    failure_stages: Counter
    escape_stages: Counter

    @property
    def unsafe_serve_rate(self) -> float:
        return self.unsafe_serves / self.requests

    @property
    def safe_route_rate(self) -> float:
        return (self.requests - self.unsafe_serves) / self.requests

def quality_window(events: list[TraceEvent]) -> QualityWindow:
    return QualityWindow(
        requests=len(events),
        unsafe_serves=sum(event.unsafe_claim_served for event in events),
        abstentions=sum(event.route == "abstain" for event in events),
        failure_stages=Counter(event.first_failed_stage for event in events),
        escape_stages=Counter(
            event.escape_stage for event in events if event.escape_stage is not None
        ),
    )

quality = quality_window(current_window)
print(f"unsafe_serve_rate={quality.unsafe_serve_rate:.1%}")
print(f"safe_route_rate={quality.safe_route_rate:.1%}")
print(f"abstentions={quality.abstentions}")
print(f"claim_generation_failures={quality.failure_stages['claim_generation']}")
print(f"serving_gate_escapes={quality.escape_stages['serving_gate']}")

assert quality.unsafe_serves == 2
assert quality.failure_stages["claim_generation"] == 4
assert quality.escape_stages["serving_gate"] == 2

Output

unsafe_serve_rate=33.3%
safe_route_rate=66.7%
abstentions=3
claim_generation_failures=4
serving_gate_escapes=2

The counts do more than say "quality is down." All four unsupported drafts first failed at claim_generation. The strict template safely withheld two; the regressed template let two escape through serving_gate. That evidence points at a gate or template rollback before anyone tunes retrieval.

Observability triage view where safety fails with 2 of 6 unsafe serves while latency and cost stay within limits, and the same claim-generation defect splits into two contained requests under the strict gate and two escaped requests under version 1.1. — Safety, latency, and cost answer different policy questions. The containment split shows why the same generation defect stayed safe under the strict gate but escaped under version 1.1.

Measure responsiveness without hiding correctness

Quality is the first invariant here, but operators also notice sluggish answers. For a streamed response:

Time to first token (TTFT) is the delay before the operator sees output.
Post-first-token throughput measures how quickly the remaining output arrives.
End-to-end latency includes the entire response.

Be precise with token arithmetic. If TTFT marks arrival of the first token, only output_tokens - 1 tokens arrive in the time after TTFT. Counting all output tokens in that interval slightly overstates throughput.

The fixture also reports p95 TTFT: 95% of measured requests begin output at or below that first-token delay. With only six requests, nearest-rank p95 is the slowest observation. This teaches the calculation, not a production budget.

streaming-latency.py

import math

def nearest_rank_percentile(values: list[int], percentile: float) -> int:
    ordered = sorted(values)
    rank = max(1, math.ceil(percentile * len(ordered)))
    return ordered[rank - 1]

def post_first_token_tps(event: TraceEvent) -> float:
    remaining_tokens = max(event.output_tokens - 1, 0)
    if remaining_tokens == 0:
        return 0.0
    remaining_ms = event.total_ms - event.ttft_ms
    if remaining_ms <= 0:
        raise ValueError("total_ms must exceed ttft_ms after the first token")
    return remaining_tokens / (remaining_ms / 1000)

p95_ttft_ms = nearest_rank_percentile([event.ttft_ms for event in current_window], 0.95)
example_tps = post_first_token_tps(current_window[-1])
malformed_timing = replace(current_window[-1], total_ms=240)

print(f"p95_ttft_ms={p95_ttft_ms}")
print(f"req_206_post_first_tps={example_tps:.1f}")
print(f"quality_safe={quality.unsafe_serves == 0}")
try:
    post_first_token_tps(malformed_timing)
except ValueError as error:
    print(f"invalid_timing={error}")
else:
    raise AssertionError("malformed timing event should fail")

assert p95_ttft_ms == 240
assert quality.unsafe_serves > 0

Output

p95_ttft_ms=240
req_206_post_first_tps=134.9
quality_safe=False
invalid_timing=total_ms must exceed ttft_ms after the first token

This window is fast and unsafe. A latency-only dashboard would celebrate the exact release that should be rolled back.

Don't hide malformed timing events behind a tiny fallback denominator. If a multi-token response reports total_ms <= ttft_ms, fix the instrumentation before trusting its throughput.

OpenTelemetry (OTel) semantic conventions 1.41.0 distinguish client first-chunk latency (gen_ai.client.operation.time_to_first_chunk) from model-server first-token latency (gen_ai.server.time_to_first_token). This lab's ttft_ms is an application-owned operator-visible field, so document exactly where your timer starts and stops. The GenAI convention is still marked Development; pin the convention emitted by your instrumentation instead of treating its field set as permanent.^{[1]Reference 1Semantic conventions for generative AI systemshttps://opentelemetry.io/docs/specs/semconv/gen-ai/}^{[2]Reference 2OpenTelemetry GenAI metrics, v1.41.0https://github.com/open-telemetry/semantic-conventions/blob/v1.41.0/docs/gen-ai/gen-ai-metrics.md}

Parent-child span hierarchy for agent tracing

A simple request-response model is a flat timeline. But a tool-using agent, RAG pipeline, or multi-step assistant behaves like a tree: a root user operation spawns retrieval queries, multiple LLM thinking loops, and external tool execution spans.

To reconstruct what happened, OpenTelemetry uses parent-child span relationships. The root span represents the user-facing request, while child spans capture nested operations such as retrieval, model generation, and tool execution.^{[3]Reference 3OpenTelemetry Specification.https://opentelemetry.io/docs/specs/otel}

Here is how you initialize and link these spans programmatically using the OpenTelemetry API:

otel-spans.py

import time
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter

provider = TracerProvider()
exporter = InMemorySpanExporter()
provider.add_span_processor(SimpleSpanProcessor(exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("agent-service")

def execute_agent_loop(query: str) -> None:
    with tracer.start_as_current_span("agent.request") as root_span:
        root_span.set_attribute("app.query", query)

        with tracer.start_as_current_span("retrieval") as retrieval_span:
            retrieval_span.set_attribute("db.system", "chroma")
            time.sleep(0.01)
            docs_retrieved = 3
            retrieval_span.set_attribute("retrieval.docs_count", docs_retrieved)

        with tracer.start_as_current_span("gen_ai.call") as llm_span:
            llm_span.set_attribute("gen_ai.operation.name", "chat")
            llm_span.set_attribute("gen_ai.provider.name", "platform.internal")
            llm_span.set_attribute("gen_ai.request.model", "deploy-assistant-prod")
            llm_span.set_attribute("gen_ai.usage.input_tokens", 150)
            llm_span.set_attribute("gen_ai.usage.output_tokens", 45)
            time.sleep(0.02)

execute_agent_loop("Check deploy safety for retry change")

spans = exporter.get_finished_spans()
span_names = {span.context.span_id: span.name for span in spans}
parent_names = {
    span.name: span_names.get(span.parent.span_id) if span.parent else None
    for span in spans
}
print(f"Captured spans: {len(spans)}")
for span in sorted(spans, key=lambda s: s.start_time):
    print(f"Span: {span.name} | Parent: {parent_names[span.name]}")

assert len(spans) == 3
assert parent_names == {
    "agent.request": None,
    "retrieval": "agent.request",
    "gen_ai.call": "agent.request",
}

Output

Captured spans: 3
Span: agent.request | Parent: None
Span: retrieval | Parent: agent.request
Span: gen_ai.call | Parent: agent.request

Keep standard telemetry and product decisions together

In semantic conventions 1.41.0, the inference-span attributes require both gen_ai.operation.name and gen_ai.provider.name; gen_ai.request.model is conditionally required when available. The synthetic service below uses a custom provider name because no real provider is involved. Use the convention's well-known provider value when one applies. Your application still needs custom attributes for its own invariant: which evidence version supported the answer, whether the answer route blocked a claim, where the first failure occurred, and which protection boundary leaked it.^{[4]Reference 4OpenTelemetry GenAI span definitions, v1.41.0https://github.com/open-telemetry/semantic-conventions/blob/v1.41.0/model/gen-ai/spans.yaml}

Prompt, output-message, and system-instruction attributes are opt-in in the OTel convention because they may contain sensitive content. Store stable identifiers and safe outcome attributes by default; make raw text capture a deliberate, governed exception.^{[4]Reference 4OpenTelemetry GenAI span definitions, v1.41.0https://github.com/open-telemetry/semantic-conventions/blob/v1.41.0/model/gen-ai/spans.yaml}

trace-attributes.py

def trace_attributes(event: TraceEvent) -> dict[str, str | int | bool]:
    attributes: dict[str, str | int | bool] = {
        "gen_ai.operation.name": "chat",
        "gen_ai.provider.name": "platform.internal",
        "gen_ai.request.model": "deploy-assistant-prod",
        "gen_ai.usage.input_tokens": event.input_tokens,
        "gen_ai.usage.output_tokens": event.output_tokens,
        "platform.answer.route": event.route,
        "platform.answer.unsafe_claim_served": event.unsafe_claim_served,
        "platform.failure.stage": event.first_failed_stage,
        "platform.prompt.template_id": event.prompt_template_id,
    }
    if event.evidence_version is not None:
        attributes["platform.evidence.version"] = event.evidence_version
    if event.escape_stage is not None:
        attributes["platform.failure.escape_stage"] = event.escape_stage
    return attributes

unsafe_attributes = trace_attributes(current_window[-1])
print(f"operation={unsafe_attributes['gen_ai.operation.name']}")
print(f"template={unsafe_attributes['platform.prompt.template_id']}")
print(f"failed_stage={unsafe_attributes['platform.failure.stage']}")
print(f"escape_stage={unsafe_attributes['platform.failure.escape_stage']}")
print(f"raw_messages_logged={'gen_ai.input.messages' in unsafe_attributes}")

assert unsafe_attributes["platform.answer.unsafe_claim_served"] is True
assert unsafe_attributes["platform.failure.escape_stage"] == "serving_gate"
assert unsafe_attributes["gen_ai.provider.name"] == "platform.internal"
assert "gen_ai.input.messages" not in unsafe_attributes
assert "gen_ai.output.messages" not in unsafe_attributes
assert "gen_ai.system_instructions" not in unsafe_attributes

Output

operation=chat
template=deploy-answerer-v1.1-regression
failed_stage=claim_generation
escape_stage=serving_gate
raw_messages_logged=False

gen_ai.* attributes let tracing backends recognize the model call. platform.* attributes explain the business decision. Neither requires putting an operator's full message into a durable trace.

Attribute cost with a versioned rate card

Cost is part of monitoring, but embedding current provider price claims in application logic or a tutorial makes both go stale quickly. Production systems should store usage counts on the trace and evaluate them with a versioned rate card. The rates below are synthetic fixture values, chosen only to exercise the calculation.

versioned-cost.py

@dataclass(frozen=True)
class RateCard:
    version: str
    input_per_million: float
    cached_input_per_million: float
    output_per_million: float

def estimate_cost(event: TraceEvent, card: RateCard) -> float:
    if event.input_tokens < 0 or event.output_tokens < 0:
        raise ValueError("token counts must be non-negative")
    if not 0 <= event.cached_input_tokens <= event.input_tokens:
        raise ValueError("cached_input_tokens must be between 0 and input_tokens")
    uncached_input = event.input_tokens - event.cached_input_tokens
    return (
        uncached_input * card.input_per_million
        + event.cached_input_tokens * card.cached_input_per_million
        + event.output_tokens * card.output_per_million
    ) / 1_000_000

card = RateCard(
    version="internal-fixture-2026-05-27",
    input_per_million=1.00,
    cached_input_per_million=0.10,
    output_per_million=5.00,
)
cost_by_request = {
    event.request_id: estimate_cost(event, card)
    for event in current_window
}
total_cost = sum(cost_by_request.values())
malformed_usage = replace(current_window[-1], cached_input_tokens=202)

print(f"rate_card={card.version}")
print(f"window_cost_usd={total_cost:.6f}")
print(f"req_206_cost_usd={cost_by_request['req_206']:.6f}")
try:
    estimate_cost(malformed_usage, card)
except ValueError as error:
    print(f"invalid_usage={error}")
else:
    raise AssertionError("malformed usage event should fail")

assert card.version.startswith("internal-fixture")
assert total_cost > 0

Output

rate_card=internal-fixture-2026-05-27
window_cost_usd=0.001427
req_206_cost_usd=0.000279
invalid_usage=cached_input_tokens must be between 0 and input_tokens

If a provider later changes cached-token pricing or adds a service-tier surcharge, old traces remain interpretable: usage was recorded once, while the rate-card version states how the estimate was computed. Reject impossible counters rather than clamping them into a plausible-looking cost.

Store durable debugging evidence without operator text

A trace may contain an incident identifier, email address, or full operator request. That text can help an incident investigation, but it shouldn't be your default long-lived record.

For this workflow, durable logs can retain:

Retain by default	Avoid by default
Request ID, evidence version, prompt template ID	Full operator message
Route, first failure stage, escape stage, claim-verdict counts	Full generated answer
Token counts, latency, rate-card version	Unredacted incident or contact details
Redacted preview when needed	Raw retrieved documents

The record below keeps a redacted preview for an incident exemplar while leaving raw payload storage unset.

durable-log-record.py

import re

@dataclass(frozen=True)
class DurableRecord:
    request_id: str
    prompt_template_id: str
    evidence_version: str | None
    route: str
    failed_stage: str
    escape_stage: str | None
    redacted_preview: str
    raw_payload_ref: str | None
    scrub_policy: str

def redact_operator_text(text: str) -> str:
    text = re.sub(r"#[A-Z0-9]+", "[INCIDENT_ID]", text)
    return re.sub(r"\b[\w.+-]+@[\w.-]+\.\w+\b", "[EMAIL]", text)

def durable_record(event: TraceEvent, operator_text: str) -> DurableRecord:
    return DurableRecord(
        request_id=event.request_id,
        prompt_template_id=event.prompt_template_id,
        evidence_version=event.evidence_version,
        route=event.route,
        failed_stage=event.first_failed_stage,
        escape_stage=event.escape_stage,
        redacted_preview=redact_operator_text(operator_text),
        raw_payload_ref=None,
        scrub_policy="platform-pii-v1",
    )

record = durable_record(
    current_window[-1],
    "Email me deploy approval for incident #INC10234 at [email protected]",
)
print(record.redacted_preview)
print(f"raw_payload_stored={record.raw_payload_ref is not None}")
print(f"scrub_policy={record.scrub_policy}")
print(f"escape_stage={record.escape_stage}")

assert "#INC10234" not in record.redacted_preview
assert "[email protected]" not in record.redacted_preview

Output

Email me deploy approval for incident [INCIDENT_ID] at [EMAIL]
raw_payload_stored=False
scrub_policy=platform-pii-v1
escape_stage=serving_gate

The regex is intentionally narrow: it makes the fixture readable, but it isn't a production scrubber. In production, minimize collected attributes, review what instrumentation libraries emit, and apply centrally managed Collector processors or an equivalent scrub pipeline before durable storage.^{[5]Reference 5Handling sensitive datahttps://opentelemetry.io/docs/security/handling-sensitive-data/}

Privacy-preserving trace transformation where the raw request for incident INC-10234 and bruno@example.com is not stored, the scrub policy platform-pii-v1 replaces those values with INCIDENT_ID and EMAIL markers, and the durable request 206 record keeps template, evidence, route, claim-generation failure, and serving-gate escape fields while raw_payload_ref remains None. — The durable request 206 record keeps enough structured evidence to reproduce the gate escape, while the raw incident ID, email address, and payload reference remain absent.

Alert on decisions engineers can act on

Not every metric deserves a page. A long answer or higher cost might warrant investigation. An unsupported deploy claim that escaped a hard gate violates the product invariant and should page immediately.

Thresholds must come from a service-level objective (SLO) or another owned service contract, not a generic article. This lab's synthetic policy says:

Any unsafe served deploy claim is a page.
TTFT above 500 ms is a warning for this interactive answer path.
Estimated cost above the fixture budget is a warning, using the rate card named in the event window.

actionable-alerts.py

@dataclass(frozen=True)
class MonitorPolicy:
    policy_id: str
    max_unsafe_serves: int
    max_p95_ttft_ms: int
    max_window_cost_usd: float

def evaluate_alerts(
    events: list[TraceEvent],
    quality: QualityWindow,
    cost_usd: float,
    policy: MonitorPolicy,
) -> list[str]:
    findings: list[str] = []
    if quality.unsafe_serves > policy.max_unsafe_serves:
        findings.append(
            f"PAGE unsafe_claim_served={quality.unsafe_serves}/{quality.requests}"
        )
    if p95_ttft_ms > policy.max_p95_ttft_ms:
        findings.append(f"WARN p95_ttft_ms={p95_ttft_ms}")
    if cost_usd > policy.max_window_cost_usd:
        findings.append(f"WARN window_cost_usd={cost_usd:.6f}")
    return findings

policy = MonitorPolicy(
    policy_id="deploy-grounding-slo-v1",
    max_unsafe_serves=0,
    max_p95_ttft_ms=500,
    max_window_cost_usd=0.002,
)
findings = evaluate_alerts(current_window, quality, total_cost, policy)

for finding in findings:
    print(finding)
print(f"latency_warning={any('ttft' in item for item in findings)}")
print(f"cost_warning={any('cost' in item for item in findings)}")

assert findings == ["PAGE unsafe_claim_served=2/6"]

Output

PAGE unsafe_claim_served=2/6
latency_warning=False
cost_warning=False

The page is actionable because it says which invariant broke. It doesn't page merely because a judge score drifted or because latency was a little noisy.

Six synthetic requests are enough to demonstrate alert logic, not to set a production latency or cost budget. In a real launch, define those warning thresholds from representative traffic and revisit them as workload shape changes. The zero-tolerance safety invariant is different: a known unsupported deploy claim must not be served.

Attach an exemplar and a first investigation step

An on-call engineer shouldn't begin by searching arbitrary logs. For an invariant page, attach one violating trace with its prompt template and source version, then state the immediate check.

runbook-card.py

@dataclass(frozen=True)
class IncidentCard:
    policy_id: str
    summary: str
    exemplar_request_id: str
    template_id: str
    evidence_version: str | None
    first_failed_stage: str
    escape_stage: str | None
    first_check: str

def incident_card(events: list[TraceEvent], policy: MonitorPolicy) -> IncidentCard:
    exemplar = next(event for event in events if event.unsafe_claim_served)
    return IncidentCard(
        policy_id=policy.policy_id,
        summary="Unsupported deploy claim reached operator response.",
        exemplar_request_id=exemplar.request_id,
        template_id=exemplar.prompt_template_id,
        evidence_version=exemplar.evidence_version,
        first_failed_stage=exemplar.first_failed_stage,
        escape_stage=exemplar.escape_stage,
        first_check="Compare serving-gate template with deploy-answerer-v1.",
    )

card_view = incident_card(current_window, policy)
print(f"exemplar={card_view.exemplar_request_id}")
print(f"template={card_view.template_id}")
print(f"first_stage={card_view.first_failed_stage}")
print(f"escape_stage={card_view.escape_stage}")
print(f"first_check={card_view.first_check}")

assert card_view.first_failed_stage == "claim_generation"
assert card_view.escape_stage == "serving_gate"

Output

exemplar=req_205
template=deploy-answerer-v1.1-regression
first_stage=claim_generation
escape_stage=serving_gate
first_check=Compare serving-gate template with deploy-answerer-v1.

The request still needs deeper review, but investigation is now narrow: generation produced an unsupported deploy approval, and the serving gate let it escape. Compare the template or gate configuration with the strict version that abstained on the same failure case.

Extend the contract when answers call tools

The central example is a factual answer gate. If the product later allows an agent to create open rollback tickets or trigger CI jobs, the same discipline applies to tool actions. Log state transitions you control: tool name, redacted parameter hash, result status, progress marker, retry count, and stop reason. Don't depend on hidden reasoning text.

Agent state-delta comparison where two consecutive lookup_deploy_status steps have the same redacted parameter hash deploy-redacted:v1 and the same progress marker scan:v17, so all three equality checks produce stalled_repeat true and route to stop_and_review, while a counterfactual scan:v18 progress marker would continue. — An identical action becomes a stalled loop only when the controlled progress marker is also unchanged. A new deploy state takes the continue branch without inspecting hidden reasoning text.

tool-loop-monitor.py

@dataclass(frozen=True)
class ToolStep:
    tool_name: str
    params_hash: str
    progress_marker: str

def stalled_repeat(steps: list[ToolStep]) -> bool:
    if len(steps) < 2:
        return False
    last, previous = steps[-1], steps[-2]
    return (
        last.tool_name == previous.tool_name
        and last.params_hash == previous.params_hash
        and last.progress_marker == previous.progress_marker
    )

steps = [
    ToolStep("lookup_deploy_status", "deploy-redacted:v1", "scan:v17"),
    ToolStep("lookup_deploy_status", "deploy-redacted:v1", "scan:v17"),
]

print(f"tool={steps[-1].tool_name}")
print(f"stalled_repeat={stalled_repeat(steps)}")
print("route=stop_and_review" if stalled_repeat(steps) else "route=continue")

assert stalled_repeat(steps) is True

Output

tool=lookup_deploy_status
stalled_repeat=True
route=stop_and_review

Preserve the evidence for the next fix

An alert tells you something is wrong now. The next engineering question is whether a proposed fix is better. That requires an experiment record linking the failing window, proposed template, policy, and rerun results.

experiment-handoff.py

@dataclass(frozen=True)
class ExperimentHandoff:
    incident_policy_id: str
    failing_template_id: str
    exemplar_request_id: str
    evidence_version: str | None
    candidate_template_id: str
    required_metric: str
    promotion_status: str

handoff = ExperimentHandoff(
    incident_policy_id=card_view.policy_id,
    failing_template_id=card_view.template_id,
    exemplar_request_id=card_view.exemplar_request_id,
    evidence_version=card_view.evidence_version,
    candidate_template_id="deploy-answerer-v1.2-fix",
    required_metric="unsafe_claim_served == 0 on regression and holdout windows",
    promotion_status="BLOCKED_PENDING_EVALUATION",
)

print(f"candidate={handoff.candidate_template_id}")
print(f"required_metric={handoff.required_metric}")
print(f"promotion={handoff.promotion_status}")

assert handoff.promotion_status == "BLOCKED_PENDING_EVALUATION"

Output

candidate=deploy-answerer-v1.2-fix
required_metric=unsafe_claim_served == 0 on regression and holdout windows
promotion=BLOCKED_PENDING_EVALUATION

The monitoring system doesn't approve the proposed fix. It creates a precise starting point for the next controlled run: what broke, which candidate claims to fix it, and which metric must pass.

Mastery check

Mastery outcomes

Capability	Working proof
Separate safety from responsiveness	A six-event window pages on unsupported serves despite healthy p95 TTFT
Diagnose root cause and failed containment	Traces record `claim_generation` separately from the `serving_gate` escape
Keep durable telemetry useful and private	The incident record retains redacted evidence, stable identifiers, and no raw payload
Hand incident evidence to experiment tracking	The monitor creates a blocked candidate handoff with a required safety metric

Evaluation rubric

Starts from a factual-serving invariant instead of generic dashboard metrics
Treats justified abstention as safe behavior rather than a blanket failure
Detects unsafe served claims even when latency is healthy
Keeps the earliest defect separate from the protection boundary that leaked it
Computes post-first-token throughput without counting the first token twice
Keeps sensitive prompt and answer text out of default durable attributes
Uses a versioned cost card rather than hard-coded current-provider assertions
Pages with an exemplar trace and a concrete first check
Records tool progress without relying on chain-of-thought logs
Blocks a candidate fix until a measured experiment validates it

Follow-up questions

Common pitfalls

Response success is mistaken for answer safety

Symptom: Dashboards stay green while operators receive invented deploy approvals.
Cause: Monitoring captures HTTP status and latency but not claim-gate outcomes.
Fix: Emit and aggregate the unsafe-served-claim invariant beside performance metrics.

Abstention pages the on-call engineer

Symptom: A strict gate produces noisy incidents whenever evidence is missing.
Cause: Safe withholding and unsafe serving were collapsed into one "bad answer" metric.
Fix: Separate unsafe-serve, abstention, and supported-serve rates.

Streaming speed is calculated incorrectly

Symptom: Generation throughput looks slightly better than it was.
Cause: The first token is counted in an interval that begins after first-token arrival.
Fix: Divide remaining output tokens by post-TTFT duration, or define a different measured interval explicitly. Reject malformed timing events instead of masking them with a fallback denominator.

Root cause and containment escape are collapsed

Symptom: An incident starts at the serving gate but loses where the unsupported draft originated.
Cause: One field is used for both the earliest defect and the boundary that leaked it.
Fix: Record first_failed_stage and escape_stage separately, then attach both to the exemplar.

Durable logs collect raw operator text

Symptom: Debugging records retain incident identifiers and email addresses unnecessarily.
Cause: Raw prompts were treated as ordinary metadata.
Fix: Store redacted previews and stable identifiers by default; govern any raw-payload access separately.

The alert can't start an investigation

Symptom: A page says "quality decreased" and on-call searches broad logs manually.
Cause: The alert has no exemplar trace, template version, evidence version, first failed stage, or escape stage.
Fix: Attach a violating trace and the first runbook check to every invariant alert.

Next Step

Continue to Experiment Tracking with MLflow and W&B

You can now detect a live failure and capture the exact evidence behind it. Next you'll record controlled candidate runs so a proposed fix earns promotion through reproducible results.

PreviousHallucination Detection & Mitigation

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Semantic conventions for generative AI systems

OpenTelemetry Authors · 2026

OpenTelemetry GenAI metrics, v1.41.0

OpenTelemetry Authors · 2026

OpenTelemetry Specification.

OpenTelemetry Authors. · 2023

OpenTelemetry GenAI span definitions, v1.41.0

OpenTelemetry Authors · 2026

Handling sensitive data

OpenTelemetry Authors · 2026

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

LLM Observability & Monitoring

Start with the event your gate already produced

Turn request evidence into quality metrics

Measure responsiveness without hiding correctness

Parent-child span hierarchy for agent tracing

Keep standard telemetry and product decisions together

Attribute cost with a versioned rate card

Store durable debugging evidence without operator text

Alert on decisions engineers can act on

Attach an exemplar and a first investigation step

Extend the contract when answers call tools

Preserve the evidence for the next fix

Mastery check

Mastery outcomes

Evaluation rubric

Follow-up questions

Why isn't the 50% abstention rate in this synthetic window automatically an incident?

Two unsupported deploy approvals were served while p95 TTFT was 240 ms. Which metric determines paging?

An unsupported deploy approval was generated and then served. Why record both claim_generation and serving_gate?

Why store prompt_template_id and evidence_version in the exemplar?

Why doesn't the monitor immediately approve deploy-answerer-v1.2-fix?

Common pitfalls

Response success is mistaken for answer safety

Abstention pages the on-call engineer

Streaming speed is calculated incorrectly

Root cause and containment escape are collapsed

Durable logs collect raw operator text

The alert can't start an investigation

Mastery Check

Discussion