LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

ยฉ 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

๐Ÿ› ๏ธComputing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
๐Ÿ“ŠMath & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
๐Ÿ“šPreparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
๐ŸงฎML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
๐Ÿ“ฆProduction ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
๐ŸงชCore LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
๐ŸงฐApplied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
๐ŸŽ“Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
๐Ÿง Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
๐ŸงฌAdvanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
๐Ÿค–Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
โšกInference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
๐Ÿ—๏ธSystem Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
๐ŸŽคAI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnApplied LLM EngineeringLLM Observability & Monitoring
โš™๏ธMediumMLOps & Deployment

LLM Observability & Monitoring

Turn claim-level answer traces into production metrics, actionable alerts, privacy-safe debugging records, and reproducible incident evidence.

16 min read
Learning path
Step 68 of 155 in the full curriculum
Hallucination Detection & MitigationExperiment Tracking with MLflow and W&B

LLM Observability & Monitoring

In the previous lesson, a large language model (LLM) answer path called tracking-answerer-v1 refused to promise a delivery date that wasn't present in a FastShip record. It also emitted a trace: evidence version, route, and first failed stage. One safe trace proves one request behaved correctly. It doesn't tell you whether the service is still safe across live traffic.

Suppose a prompt rollback silently bypasses that gate for two customers. The endpoint still returns 200. Answers remain fluent. Latency stays fast. Only the trace events reveal that unsupported ETAs escaped into served messages.

Monitoring reduces many trace events into rates, percentiles, and alerts. Observability lets you start from an alert and reconstruct why a particular request failed. An LLM application needs both: aggregates to detect a change and request evidence to diagnose it.

Delivery answer traces roll up into a window where two unsupported ETAs are served despite healthy latency, triggering an alert with an exemplar request. Delivery answer traces roll up into a window where two unsupported ETAs are served despite healthy latency, triggering an alert with an exemplar request.
Healthy response time does not make an unsupported delivery promise acceptable. The safety invariant pages first, and the exemplar trace tells the engineer where to look.

Start with the event your gate already produced

Avoid beginning with a dashboard wish list. Begin with the decision boundary from the serving path:

Serving questionEvent fieldWhy it matters
What answer route was chosen?routeShows serve, shorten, or abstain behavior
Did an unsupported claim reach the customer?unsafe_claim_servedDirect safety invariant
Where did a request first go wrong?first_failed_stageSeparates evidence, generation, and serving faults
Which record and prompt were used?evidence_version, prompt_template_idMakes a failure reproducible
Was the request usable and affordable?token and timing fieldsMeasures experience and cost beside quality

This chapter uses synthetic events from the same delivery-update path as the prior lab. The last two events model a regression; they aren't claims about real production traffic.

Diagram showing Claim gate trace route + verdicts, Window rollup rates + percentiles, Safety invariant breached?, and Page with exemplar trace and runbook. Diagram showing Claim gate trace route + verdicts, Window rollup rates + percentiles, Safety invariant breached?, and Page with exemplar trace and runbook.
Claim gate trace route + verdicts, Window rollup rates + percentiles, Safety invariant breached?, and Page with exemplar trace and runbook.

The first cell creates six requests. Four are safe outcomes from a strict gate: a supported answer or a blocked factual claim. Two come from a regressed template that served an invented ETA.

trace-events.py
1from dataclasses import dataclass, replace 2 3@dataclass(frozen=True) 4class TraceEvent: 5 request_id: str 6 case_id: str 7 evidence_version: str | None 8 prompt_template_id: str 9 route: str 10 first_failed_stage: str 11 unsafe_claim_served: bool 12 input_tokens: int 13 cached_input_tokens: int 14 output_tokens: int 15 ttft_ms: int 16 total_ms: int 17 18current_window = [ 19 TraceEvent("req_201", "clean_scan", "fastship@v17", "tracking-answerer-v1", 20 "serve", "passed", False, 188, 80, 27, 172, 410), 21 TraceEvent("req_202", "invented_eta", "fastship@v17", "tracking-answerer-v1", 22 "abstain", "claim_generation", False, 196, 80, 20, 181, 395), 23 TraceEvent("req_203", "wrong_status", "fastship@v17", "tracking-answerer-v1", 24 "abstain", "claim_generation", False, 192, 80, 21, 188, 402), 25 TraceEvent("req_204", "missing_feed", None, "tracking-answerer-v1", 26 "abstain", "evidence_admission", False, 160, 64, 15, 205, 382), 27 TraceEvent("req_205", "invented_eta", "fastship@v17", "tracking-answerer-v1.1-regression", 28 "serve", "serving_gate", True, 198, 80, 29, 218, 430), 29 TraceEvent("req_206", "invented_eta", "fastship@v17", "tracking-answerer-v1.1-regression", 30 "serve", "serving_gate", True, 201, 80, 30, 240, 455), 31] 32 33print(f"events={len(current_window)}") 34for event in current_window: 35 verdict = "UNSAFE SERVE" if event.unsafe_claim_served else "safe route" 36 print(f"{event.request_id}: {event.route:7} {verdict}")
Output
1events=6 2req_201: serve safe route 3req_202: abstain safe route 4req_203: abstain safe route 5req_204: abstain safe route 6req_205: serve UNSAFE SERVE 7req_206: serve UNSAFE SERVE

Notice the distinction: an abstention isn't automatically a failure. When evidence doesn't establish an ETA, abstention is the safe product behavior. The event you need to page on is an unsupported factual claim that was actually served.

Turn request evidence into quality metrics

An aggregate must preserve the invariant it claims to monitor. A generic "answer quality score" could hide two unsafe delivery promises inside many pleasant responses. For this answer path, start with metrics directly derived from claim-gate outcomes:

MetricNumerator / denominatorInterpretation
Unsafe-serve rateRequests with any unsupported served claim / requestsCustomer-facing factual safety failure
Safe-route rateRequests without unsafe served claims / requestsGate effectiveness, including justified abstention
Abstention rateAbstained requests / requestsProduct usefulness signal, not a safety failure by itself
Failure-stage countRequests by first failed stageWhere engineers should investigate first

The next cell aggregates the synthetic production window.

quality-window.py
1from collections import Counter 2from dataclasses import dataclass 3 4@dataclass(frozen=True) 5class QualityWindow: 6 requests: int 7 unsafe_serves: int 8 abstentions: int 9 failure_stages: Counter 10 11 @property 12 def unsafe_serve_rate(self) -> float: 13 return self.unsafe_serves / self.requests 14 15 @property 16 def safe_route_rate(self) -> float: 17 return (self.requests - self.unsafe_serves) / self.requests 18 19def quality_window(events: list[TraceEvent]) -> QualityWindow: 20 return QualityWindow( 21 requests=len(events), 22 unsafe_serves=sum(event.unsafe_claim_served for event in events), 23 abstentions=sum(event.route == "abstain" for event in events), 24 failure_stages=Counter(event.first_failed_stage for event in events), 25 ) 26 27quality = quality_window(current_window) 28print(f"unsafe_serve_rate={quality.unsafe_serve_rate:.1%}") 29print(f"safe_route_rate={quality.safe_route_rate:.1%}") 30print(f"abstentions={quality.abstentions}") 31print(f"serving_gate_failures={quality.failure_stages['serving_gate']}") 32 33assert quality.unsafe_serves == 2 34assert quality.failure_stages["serving_gate"] == 2
Output
1unsafe_serve_rate=33.3% 2safe_route_rate=66.7% 3abstentions=3 4serving_gate_failures=2

The count does more than say "quality is down." Both escaped claims first failed at serving_gate, while the earlier strict template safely abstained on the same invented-ETA case. That evidence points at a gate or template rollback before anyone tunes retrieval.

The same delivery-answer window is examined across safety, latency, and cost, with the safety invariant identifying a serving-gate regression. The same delivery-answer window is examined across safety, latency, and cost, with the safety invariant identifying a serving-gate regression.
Every metric window keeps the same trace identifiers and prompt-template version, so an alert can move from symptom to likely regression without guessing.

Measure responsiveness without hiding correctness

Quality is the first invariant here, but customers also notice sluggish answers. For a streamed response:

  • Time to first token (TTFT) is the delay before the customer sees output.
  • Post-first-token throughput measures how quickly the remaining output arrives.
  • End-to-end latency includes the entire response.

Be precise with token arithmetic. If TTFT marks arrival of the first token, only output_tokens - 1 tokens arrive in the time after TTFT. Counting all output tokens in that interval slightly overstates throughput.

streaming-latency.py
1import math 2 3def nearest_rank_percentile(values: list[int], percentile: float) -> int: 4 ordered = sorted(values) 5 rank = max(1, math.ceil(percentile * len(ordered))) 6 return ordered[rank - 1] 7 8def post_first_token_tps(event: TraceEvent) -> float: 9 remaining_tokens = max(event.output_tokens - 1, 0) 10 if remaining_tokens == 0: 11 return 0.0 12 remaining_ms = event.total_ms - event.ttft_ms 13 if remaining_ms <= 0: 14 raise ValueError("total_ms must exceed ttft_ms after the first token") 15 return remaining_tokens / (remaining_ms / 1000) 16 17p95_ttft_ms = nearest_rank_percentile([event.ttft_ms for event in current_window], 0.95) 18example_tps = post_first_token_tps(current_window[-1]) 19malformed_timing = replace(current_window[-1], total_ms=240) 20 21print(f"p95_ttft_ms={p95_ttft_ms}") 22print(f"req_206_post_first_tps={example_tps:.1f}") 23print(f"quality_safe={quality.unsafe_serves == 0}") 24try: 25 post_first_token_tps(malformed_timing) 26except ValueError as error: 27 print(f"invalid_timing={error}") 28else: 29 raise AssertionError("malformed timing event should fail") 30 31assert p95_ttft_ms == 240 32assert quality.unsafe_serves > 0
Output
1p95_ttft_ms=240 2req_206_post_first_tps=134.9 3quality_safe=False 4invalid_timing=total_ms must exceed ttft_ms after the first token

This window is fast and unsafe. A latency-only dashboard would celebrate the exact release that should be rolled back.

Don't hide malformed timing events behind a tiny fallback denominator. If a multi-token response reports total_ms <= ttft_ms, fix the instrumentation before trusting its throughput.

OpenTelemetry (OTel) publishes generative-AI semantic conventions for spans and metrics. Its current metrics distinguish client first-chunk latency (gen_ai.client.operation.time_to_first_chunk) from model-server first-token latency (gen_ai.server.time_to_first_token). This lab's ttft_ms is an application-owned customer-visible field, so document exactly where your timer starts and stops. The convention is marked Development; pin the convention emitted by your instrumentation instead of treating its field set as permanent.[1][2]

Keep standard telemetry and product decisions together

OTel's generative-AI span attributes describe model operations and usage. The current inference-span convention requires both gen_ai.operation.name and gen_ai.provider.name; gen_ai.request.model is conditionally required when available. The synthetic service below uses a custom provider name because no real provider is involved. Use the convention's well-known provider value when one applies. Your application still needs custom attributes for its own invariant: which evidence version supported the answer, whether the answer route blocked a claim, and where the first failure occurred.[3]

Prompt, output-message, and system-instruction attributes are opt-in in the OTel convention because they may contain sensitive content. Store stable identifiers and safe outcome attributes by default; make raw text capture a deliberate, governed exception.[3]

trace-attributes.py
1def trace_attributes(event: TraceEvent) -> dict[str, str | int | bool]: 2 attributes: dict[str, str | int | bool] = { 3 "gen_ai.operation.name": "chat", 4 "gen_ai.provider.name": "shopflow.internal", 5 "gen_ai.request.model": "support-model-prod", 6 "gen_ai.usage.input_tokens": event.input_tokens, 7 "gen_ai.usage.output_tokens": event.output_tokens, 8 "shopflow.answer.route": event.route, 9 "shopflow.answer.unsafe_claim_served": event.unsafe_claim_served, 10 "shopflow.failure.stage": event.first_failed_stage, 11 "shopflow.prompt.template_id": event.prompt_template_id, 12 } 13 if event.evidence_version is not None: 14 attributes["shopflow.evidence.version"] = event.evidence_version 15 return attributes 16 17unsafe_attributes = trace_attributes(current_window[-1]) 18print(f"operation={unsafe_attributes['gen_ai.operation.name']}") 19print(f"template={unsafe_attributes['shopflow.prompt.template_id']}") 20print(f"failed_stage={unsafe_attributes['shopflow.failure.stage']}") 21print(f"raw_messages_logged={'gen_ai.input.messages' in unsafe_attributes}") 22 23assert unsafe_attributes["shopflow.answer.unsafe_claim_served"] is True 24assert unsafe_attributes["gen_ai.provider.name"] == "shopflow.internal" 25assert "gen_ai.input.messages" not in unsafe_attributes 26assert "gen_ai.output.messages" not in unsafe_attributes 27assert "gen_ai.system_instructions" not in unsafe_attributes
Output
1operation=chat 2template=tracking-answerer-v1.1-regression 3failed_stage=serving_gate 4raw_messages_logged=False

gen_ai.* attributes let tracing backends recognize the model call. shopflow.* attributes explain the business decision. Neither requires putting a customer's full message into a durable trace.

Attribute cost with a versioned rate card

Cost is part of monitoring, but embedding current provider price claims in application logic or a tutorial makes both go stale quickly. Production systems should store usage counts on the trace and evaluate them with a versioned rate card. The rates below are synthetic fixture values, chosen only to exercise the calculation.

versioned-cost.py
1@dataclass(frozen=True) 2class RateCard: 3 version: str 4 input_per_million: float 5 cached_input_per_million: float 6 output_per_million: float 7 8def estimate_cost(event: TraceEvent, card: RateCard) -> float: 9 uncached_input = max(event.input_tokens - event.cached_input_tokens, 0) 10 return ( 11 uncached_input * card.input_per_million 12 + event.cached_input_tokens * card.cached_input_per_million 13 + event.output_tokens * card.output_per_million 14 ) / 1_000_000 15 16card = RateCard( 17 version="internal-fixture-2026-05-27", 18 input_per_million=1.00, 19 cached_input_per_million=0.10, 20 output_per_million=5.00, 21) 22cost_by_request = { 23 event.request_id: estimate_cost(event, card) 24 for event in current_window 25} 26total_cost = sum(cost_by_request.values()) 27 28print(f"rate_card={card.version}") 29print(f"window_cost_usd={total_cost:.6f}") 30print(f"req_206_cost_usd={cost_by_request['req_206']:.6f}") 31 32assert card.version.startswith("internal-fixture") 33assert total_cost > 0
Output
1rate_card=internal-fixture-2026-05-27 2window_cost_usd=0.001427 3req_206_cost_usd=0.000279

If a provider later changes cached-token pricing or adds a service-tier surcharge, old traces remain interpretable: usage was recorded once, while the rate-card version states how the estimate was computed.

Store durable debugging evidence without customer text

A trace may contain an order identifier, email address, or full customer request. That text can help an incident investigation, but it shouldn't be your default long-lived record.

For this workflow, durable logs can retain:

Retain by defaultAvoid by default
Request ID, evidence version, prompt template IDFull customer message
Route, failure stage, claim-verdict countsFull generated answer
Token counts, latency, rate-card versionUnredacted order or contact details
Redacted preview when neededRaw retrieved documents

The record below keeps a redacted preview for an incident exemplar while leaving raw payload storage unset.

durable-log-record.py
1import re 2 3@dataclass(frozen=True) 4class DurableRecord: 5 request_id: str 6 prompt_template_id: str 7 evidence_version: str | None 8 route: str 9 failed_stage: str 10 redacted_preview: str 11 raw_payload_ref: str | None 12 scrub_policy: str 13 14def redact_customer_text(text: str) -> str: 15 text = re.sub(r"#[A-Z0-9]+", "[ORDER_ID]", text) 16 return re.sub(r"\b[\w.+-]+@[\w.-]+\.\w+\b", "[EMAIL]", text) 17 18def durable_record(event: TraceEvent, customer_text: str) -> DurableRecord: 19 return DurableRecord( 20 request_id=event.request_id, 21 prompt_template_id=event.prompt_template_id, 22 evidence_version=event.evidence_version, 23 route=event.route, 24 failed_stage=event.first_failed_stage, 25 redacted_preview=redact_customer_text(customer_text), 26 raw_payload_ref=None, 27 scrub_policy="delivery-pii-v1", 28 ) 29 30record = durable_record( 31 current_window[-1], 32 "Email me ETA for order #A10234 at [email protected]", 33) 34print(record.redacted_preview) 35print(f"raw_payload_stored={record.raw_payload_ref is not None}") 36print(f"scrub_policy={record.scrub_policy}") 37 38assert "#A10234" not in record.redacted_preview 39assert "[email protected]" not in record.redacted_preview
Output
1Email me ETA for order [ORDER_ID] at [EMAIL] 2raw_payload_stored=False 3scrub_policy=delivery-pii-v1

The regex is intentionally narrow: it makes the fixture readable, but it isn't a production scrubber. In production, minimize collected attributes, review what instrumentation libraries emit, and apply centrally managed Collector processors or an equivalent scrub pipeline before durable storage.[4]

A delivery message with an order id and email is scrubbed before durable traces retain safe decision metadata, while raw payload storage remains disabled. A delivery message with an order id and email is scrubbed before durable traces retain safe decision metadata, while raw payload storage remains disabled.
Start incident debugging with redacted evidence and exact trace identifiers. Escalate to raw payload access only under a separately governed process, if your product permits retaining it at all.

Alert on decisions engineers can act on

Not every metric deserves a page. A long answer or higher cost might warrant investigation. An unsupported delivery promise that escaped a hard gate violates the product invariant and should page immediately.

Thresholds must come from a service contract, not a generic article. This lab's synthetic policy says:

  • Any unsafe served delivery claim is a page.
  • TTFT above 500 ms is a warning for this interactive answer path.
  • Estimated cost above the fixture budget is a warning, using the rate card named in the event window.
actionable-alerts.py
1@dataclass(frozen=True) 2class MonitorPolicy: 3 policy_id: str 4 max_unsafe_serves: int 5 max_p95_ttft_ms: int 6 max_window_cost_usd: float 7 8def evaluate_alerts( 9 events: list[TraceEvent], 10 quality: QualityWindow, 11 cost_usd: float, 12 policy: MonitorPolicy, 13) -> list[str]: 14 findings: list[str] = [] 15 if quality.unsafe_serves > policy.max_unsafe_serves: 16 findings.append( 17 f"PAGE unsafe_claim_served={quality.unsafe_serves}/{quality.requests}" 18 ) 19 if p95_ttft_ms > policy.max_p95_ttft_ms: 20 findings.append(f"WARN p95_ttft_ms={p95_ttft_ms}") 21 if cost_usd > policy.max_window_cost_usd: 22 findings.append(f"WARN window_cost_usd={cost_usd:.6f}") 23 return findings 24 25policy = MonitorPolicy( 26 policy_id="delivery-grounding-slo-v1", 27 max_unsafe_serves=0, 28 max_p95_ttft_ms=500, 29 max_window_cost_usd=0.002, 30) 31findings = evaluate_alerts(current_window, quality, total_cost, policy) 32 33for finding in findings: 34 print(finding) 35print(f"latency_warning={any('ttft' in item for item in findings)}") 36print(f"cost_warning={any('cost' in item for item in findings)}") 37 38assert findings == ["PAGE unsafe_claim_served=2/6"]
Output
1PAGE unsafe_claim_served=2/6 2latency_warning=False 3cost_warning=False

The page is actionable because it says which invariant broke. It doesn't page merely because a judge score drifted or because latency was a little noisy.

Six synthetic requests are enough to demonstrate alert logic, not to set a production latency or cost budget. In a real launch, define those warning thresholds from representative traffic and revisit them as workload shape changes. The zero-tolerance safety invariant is different: a known unsupported delivery promise must not be served.

Attach an exemplar and a first investigation step

An on-call engineer shouldn't begin by searching arbitrary logs. For an invariant page, attach one violating trace with its prompt template and source version, then state the immediate check.

runbook-card.py
1@dataclass(frozen=True) 2class IncidentCard: 3 policy_id: str 4 summary: str 5 exemplar_request_id: str 6 template_id: str 7 evidence_version: str | None 8 first_failed_stage: str 9 first_check: str 10 11def incident_card(events: list[TraceEvent], policy: MonitorPolicy) -> IncidentCard: 12 exemplar = next(event for event in events if event.unsafe_claim_served) 13 return IncidentCard( 14 policy_id=policy.policy_id, 15 summary="Unsupported delivery claim reached customer response.", 16 exemplar_request_id=exemplar.request_id, 17 template_id=exemplar.prompt_template_id, 18 evidence_version=exemplar.evidence_version, 19 first_failed_stage=exemplar.first_failed_stage, 20 first_check="Compare serving-gate template with tracking-answerer-v1.", 21 ) 22 23card_view = incident_card(current_window, policy) 24print(f"exemplar={card_view.exemplar_request_id}") 25print(f"template={card_view.template_id}") 26print(f"stage={card_view.first_failed_stage}") 27print(f"first_check={card_view.first_check}") 28 29assert card_view.first_failed_stage == "serving_gate"
Output
1exemplar=req_205 2template=tracking-answerer-v1.1-regression 3stage=serving_gate 4first_check=Compare serving-gate template with tracking-answerer-v1.

The request still needs deeper review, but investigation is now narrow: compare the template or gate configuration that served the claim with the strict version that abstained on the same failure case.

Extend the contract when answers call tools

The central example is a factual answer gate. If the product later allows an agent to create return labels or message a carrier, the same discipline applies to tool actions. Log state transitions you control: tool name, redacted parameter hash, result status, progress marker, retry count, and stop reason. Don't depend on hidden reasoning text.

A delivery agent traces tool actions and progress markers, then blocks a repeated identical lookup that produces no new information. A delivery agent traces tool actions and progress markers, then blocks a repeated identical lookup that produces no new information.
Agent observability extends the same rule: trace decisions and outcomes that affect the customer, and stop repeated actions that don't advance the workflow.
tool-loop-monitor.py
1@dataclass(frozen=True) 2class ToolStep: 3 tool_name: str 4 params_hash: str 5 progress_marker: str 6 7def stalled_repeat(steps: list[ToolStep]) -> bool: 8 if len(steps) < 2: 9 return False 10 last, previous = steps[-1], steps[-2] 11 return ( 12 last.tool_name == previous.tool_name 13 and last.params_hash == previous.params_hash 14 and last.progress_marker == previous.progress_marker 15 ) 16 17steps = [ 18 ToolStep("lookup_tracking", "order-redacted:v1", "scan:v17"), 19 ToolStep("lookup_tracking", "order-redacted:v1", "scan:v17"), 20] 21 22print(f"tool={steps[-1].tool_name}") 23print(f"stalled_repeat={stalled_repeat(steps)}") 24print("route=stop_and_review" if stalled_repeat(steps) else "route=continue") 25 26assert stalled_repeat(steps) is True
Output
1tool=lookup_tracking 2stalled_repeat=True 3route=stop_and_review

Preserve the evidence for the next fix

An alert tells you something is wrong now. The next engineering question is whether a proposed fix is better. That requires an experiment record linking the failing window, proposed template, policy, and rerun results.

experiment-handoff.py
1@dataclass(frozen=True) 2class ExperimentHandoff: 3 incident_policy_id: str 4 failing_template_id: str 5 exemplar_request_id: str 6 evidence_version: str | None 7 candidate_template_id: str 8 required_metric: str 9 promotion_status: str 10 11handoff = ExperimentHandoff( 12 incident_policy_id=card_view.policy_id, 13 failing_template_id=card_view.template_id, 14 exemplar_request_id=card_view.exemplar_request_id, 15 evidence_version=card_view.evidence_version, 16 candidate_template_id="tracking-answerer-v1.2-fix", 17 required_metric="unsafe_claim_served == 0 on regression and holdout windows", 18 promotion_status="BLOCKED_PENDING_EVALUATION", 19) 20 21print(f"candidate={handoff.candidate_template_id}") 22print(f"required_metric={handoff.required_metric}") 23print(f"promotion={handoff.promotion_status}") 24 25assert handoff.promotion_status == "BLOCKED_PENDING_EVALUATION"
Output
1candidate=tracking-answerer-v1.2-fix 2required_metric=unsafe_claim_served == 0 on regression and holdout windows 3promotion=BLOCKED_PENDING_EVALUATION

The monitoring system doesn't approve the proposed fix. It creates a precise starting point for the next controlled run: what broke, which candidate claims to fix it, and which metric must pass.

Mastery check

Key concepts

  • Monitoring aggregates and request-level observability
  • Safety metrics derived from claim-gate decisions
  • First-failure-stage attribution
  • TTFT and correctly computed post-first-token throughput
  • Standard model-call attributes plus product-specific decision attributes
  • Versioned cost estimates
  • Durable redacted debugging records
  • Actionable pages with exemplar traces
  • Tool-loop evidence for agent paths
  • Experiment evidence handoff

Evaluation rubric

  • Starts from a factual-serving invariant instead of generic dashboard metrics
  • Treats justified abstention as safe behavior rather than a blanket failure
  • Detects unsafe served claims even when latency is healthy
  • Computes post-first-token throughput without counting the first token twice
  • Keeps sensitive prompt and answer text out of default durable attributes
  • Uses a versioned cost card rather than hard-coded current-provider assertions
  • Pages with an exemplar trace and a concrete first check
  • Records tool progress without relying on chain-of-thought logs
  • Blocks a candidate fix until a measured experiment validates it

Follow-up questions

Common pitfalls

Response success is mistaken for answer safety

Symptom: Dashboards stay green while customers receive invented ETAs. Cause: Monitoring captures HTTP status and latency but not claim-gate outcomes. Fix: Emit and aggregate the unsafe-served-claim invariant beside performance metrics.

Abstention pages the on-call engineer

Symptom: A strict gate produces noisy incidents whenever evidence is missing. Cause: Safe withholding and unsafe serving were collapsed into one "bad answer" metric. Fix: Separate unsafe-serve, abstention, and supported-serve rates.

Streaming speed is calculated incorrectly

Symptom: Generation throughput looks slightly better than it was. Cause: The first token is counted in an interval that begins after first-token arrival. Fix: Divide remaining output tokens by post-TTFT duration, or define a different measured interval explicitly. Reject malformed timing events instead of masking them with a fallback denominator.

Durable logs collect raw customer text

Symptom: Debugging records retain order identifiers and email addresses unnecessarily. Cause: Raw prompts were treated as ordinary metadata. Fix: Store redacted previews and stable identifiers by default; govern any raw-payload access separately.

The alert cannot start an investigation

Symptom: A page says "quality decreased" and on-call searches broad logs manually. Cause: The alert has no exemplar trace, template version, evidence version, or first failed stage. Fix: Attach a violating trace and the first runbook check to every invariant alert.

Next Step
Continue to Experiment Tracking with MLflow and W&B

You can now detect a live failure and capture the exact evidence behind it. Next you will record controlled candidate runs so a proposed fix earns promotion through reproducible results.

PreviousHallucination Detection & Mitigation
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Semantic conventions for generative AI systems

OpenTelemetry Authors ยท 2026

Semantic conventions for generative AI metrics

OpenTelemetry Authors ยท 2026

Semantic conventions for generative client AI spans

OpenTelemetry Authors ยท 2026

Handling sensitive data

OpenTelemetry Authors ยท 2026