LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnAdvanced Agents & RetrievalAgent Failure & Recovery
🤖HardLLM Agents & Tool Use

Agent Failure & Recovery

Learn how to implement validation gates, retries, checkpointed recovery, state reconciliation, loop breakers, and graceful degradation when LLM agents hallucinate, stall, or drift from their tools.

50 min read
Learning path
Step 121 of 155 in the full curriculum
Agent Memory & PersistenceMulti-Agent Orchestration

Traditional software and LLM agents can both fail loudly or quietly. Agent failures are especially easy to miss when the process returns a plausible answer while selecting a nonexistent tool, looping on a task, or claiming a side effect occurred without verified evidence. Without runtime controls, one stuck run can waste budget or send a customer an unsupported answer.

Imagine you have built an order-lookup agent for an e-commerce site. A customer asks, "Where is my order A102?" The agent plans to call the query_db tool with the order ID, fetch the shipping status, and return a friendly summary. On a good day, this works perfectly. On a bad day, the agent might invent a tool called search_database, send the wrong parameter name, or loop forever rephrasing the same query while the customer waits. Papers like ReAct (Reason + Act)[1] and Toolformer[2] made tool-using LLMs practical, but they didn't make them reliable by default. Production agents need defensive architecture that assumes failure is normal, not exceptional.

This article walks through a concrete order-lookup agent, watches it fail in realistic ways, and then adds validation gates, retry policy, , and bounded fallback paths. By the end, you'll know how to make failure explicit before generated claims or uncertain writes reach a customer.

Why agent failures require runtime checks

Agent systems inherit ordinary software faults and add generated decisions that may be syntactically valid but unsupported. Compare the recovery controls needed at execution time:

Failure ModeTraditional SoftwareLLM AgentRecovery Strategy
Invalid inputRejected with error messageAgent hallucinates a "valid" responseValidation Gates: Schema checks before tool execution
Infinite loopStack overflow or out-of-memory (OOM)Agent keeps calling the same tool with slight variationsSemantic Loop Detection: Hash canonicalized actions and compare intent similarity
Service unavailableConnection timeoutAgent fabricates the API responseCircuit Breaker: Fail fast and fallback to static logic
Logic errorWrong result, sometimes unnoticedPlausible but unsupported answer or tool choiceVerifier + source check: Validate against deterministic evidence where available

This additional risk stems from generated actions and answers. An LLM acting inside an agent can vary its tool selection or interpretation after a small change in input or sampling. Agent architecture still prevents known errors, but it also has to detect degraded trajectories while they are running.

Because you can't exhaustively unit-test every possible conversational path or generated output, the center of gravity shifts toward runtime safeguards and evaluation harnesses. In a deterministic system, you can enumerate many edge cases before deployment. In an agentic system, you still need offline evals, but you also need guardrails that constrain the model during execution. When the model inevitably drifts off course, the system surrounding it must detect the drift and either steer it back or abort the operation safely before it causes compounding harm.

Why small per-step errors compound

Here is the single number that explains why agent reliability is hard. A task that needs n dependent steps succeeds only if every step succeeds. If each step is independent and succeeds with probability p, the whole task succeeds with probability:

P(success)≈pnP(\text{success}) \approx p^{n}P(success)≈pn

This product decays fast. Even a per-step success rate that sounds excellent collapses over a long trajectory:

Per-step success p5 steps20 steps50 steps
99%95%82%61%
95%77%36%8%
90%59%12%0.5%

A 95%-reliable step is fine for a one-shot tool call and nearly useless for a 50-step research agent. This is why the same model can feel magical in a demo and fall apart in production: the demo was short. The model didn't get worse, the trajectory got longer.

This compounding is why long-horizon agents still break in production. In METR's 2025 task-horizon analysis, frontier models at the time improved their 50%-reliable task length quickly, with Claude 3.7 Sonnet reaching roughly an hour of human task time on that benchmark.[3] METR's interpretation is that those gains came largely from reliability and the ability to recover from mistakes, not only raw reasoning. Reliability is the bottleneck, and part of that bottleneck is engineering, not only model quality.

The benchmark numbers make the same point. On τ\tauτ-bench, a tool-agent benchmark over retail and airline tasks, GPT-4o solves a meaningful share of tasks once but its pass^8 (succeeding on all eight independent attempts of the same task) falls below 25% in retail.[4] An agent that is right most of the time is still wrong often enough that, over a long run, failure is the expected outcome unless you engineer around it.

The defenses in this article all attack the same equation. They either raise per-step reliability p (validation gates, action-contract checks, structured feedback) or shrink the number of unrecovered steps that count against you (retries, fallbacks, checkpoints, loop breakers, escalation). You cannot make p = 1, so you build a system that survives the steps where p < 1.

The order-lookup agent: our running example

Throughout this article, we'll follow a single ReAct-style agent tasked with looking up e-commerce orders. The agent receives a customer question, plans a tool call, executes it, and observes the result. Here's the loop in plain English:

  1. Customer asks: "Where is my order A102?"
  2. Agent reasons: "I need to query the database for order A102."
  3. Agent acts: Calls query_db with {"id": "A102"}.
  4. Agent observes: "Status: delayed, ETA Friday."
  5. Agent answers: "Your order A102 is delayed and should arrive Friday."

This loop assumes the agent picks the right tool, formats the arguments correctly, and stops after getting an answer. In practice, any of those steps can fail. The rest of this article introduces failures one at a time, shows the symptom, and then adds a defense.

Before we add defenses, you should already understand two ideas from earlier in the curriculum:

  • The agentic loop (Planning -> Action -> Observation) from the ReAct article.
  • Tool execution flow (how an LLM turns a goal into a specific API schema) from the function-calling article.

If those concepts are fuzzy, revisit them first. This article builds directly on top of them.

A taxonomy of failure

When building and operating autonomous agents in production, failures usually collapse into four categories. Each category has a typical symptom and a root cause you'll see again and again.

CategoryTypical SymptomRoot Cause Example
PlanningInfinite loopsHallucinating a tool that doesn't exist
ActionSyntax errors or wrong parametersMissing a required JSON field in an API call
Reflection"False success"Agent thinks it finished but the output is empty or wrong
MemoryContext poisoningEarly errors cascading into later steps

These categories map cleanly onto the six concrete failure states addressed below. Planning failures show up as tool hallucinations and infinite loops. Action failures show up as wrong parameters and task-contract drift. Reflection failures show up as the agent declaring success when it actually failed. Memory failures show up as cascading errors in multi-agent systems and context-window overflow.

The illustration below grounds those categories in one order-lookup agent, so the taxonomy stays tied to symptoms you can recognize in production.

Four agent failure categories mapped to concrete order-lookup symptoms and recovery strategies Four agent failure categories mapped to concrete order-lookup symptoms and recovery strategies
Read each row as a production symptom, then the validation or recovery mechanism that catches it before it reaches the customer.

Worked example: the wrong parameter

Let's start with the simplest failure and the simplest fix. Our order-lookup agent has a tool called query_db that expects {"id": "A102"}. Suppose the agent sends {"order_id": "A102"} instead. The API returns a 400 Bad Request with the message: Missing required parameter 'id'.

A naive agent sees the error and tries the exact same call again. Nothing has changed, so it gets the exact same 400. That's the "just try again" anti-pattern. Without explaining why it failed, the agent is likely to repeat the error forever.

The fix is to feed the error back into the agent's context as structured feedback, not just a raw exception string. The agent can then compare its last action with the error and the tool schema, correct the key name, and retry. That's the foundation of reflexion-style recovery: the agent reflects on its failure, diagnoses the cause, and refines its next attempt.[5]

Here's how the corrected loop looks in practice:

text
1Step 1: query_db({"order_id": "A102"}) -> 400 Bad Request: Missing required parameter 'id'. 2Step 2: Agent reflects: "I used 'order_id' but the schema requires 'id'." 3Step 3: query_db({"id": "A102"}) -> 200 OK: Status delayed, ETA Friday.

The critical insight is that the retry must carry new information. The same prompt with the same context will likely produce the same wrong answer. You need to tell the model exactly what went wrong in a format it can act on.

Defense strategy: schema validation

The primary defense is strict JSON schema validation. Modern model APIs often support tool calling or , but you must still validate arguments before execution and apply authorization separately. If the arguments are invalid, the system returns a structured error message to the repair loop rather than executing the call.

We can use Pydantic (in Python) or Zod (in TypeScript) to enforce these schemas. The following validation function models the actual query_db({"id": "A102"}) tool contract and returns concise correction feedback:

validate-order-tool-arguments.py
1from pydantic import BaseModel, ConfigDict, Field, ValidationError 2 3class ToolCallError(Exception): 4 pass 5 6class QueryOrderArgs(BaseModel): 7 model_config = ConfigDict(extra="forbid") 8 id: str = Field(..., pattern=r"^A[0-9]{3,}$") 9 10def validate_tool_call(raw_args: dict[str, object]) -> QueryOrderArgs: 11 try: 12 return QueryOrderArgs(**raw_args) 13 except ValidationError as exc: 14 field = exc.errors()[0]["loc"][0] 15 raise ToolCallError(f"query_db requires valid field: {field}") from exc 16 17try: 18 validate_tool_call({"order_id": "A102"}) 19except ToolCallError as exc: 20 print("rejected:", exc) 21print("accepted:", validate_tool_call({"id": "A102"}).id)
Output
1rejected: query_db requires valid field: id 2accepted: A102

Common mistake: Returning a raw Python traceback to the LLM. Standard error messages are often too noisy for LLMs. Agentic debugging involves isolating the minimal set of root-cause failures rather than dumping every surface-level exception. Feed the model a one-sentence description of what's wrong, not a 50-line stack .

Tool call hallucination

One of the most frequent failures is tool misuse. In practice, that usually means one of two concrete problems: the model invents a tool that doesn't exist, or it emits malformed arguments for a real tool. Because LLMs are predictive text engines, if they lack sufficient context to execute a task, they often hallucinate a plausible-sounding tool instead of asking for clarification.

For example, our order-lookup agent might output:

text
1// Model generates this: 2{"tool": "search_database", "args": {"query": "order A102", "format": "json"}} 3// But the actual API expects: 4{"tool": "query_db", "params": {"id": "A102"}}

The model invented search_database because the name sounds reasonable. Without a validation gate, this call would be executed against a non-existent endpoint, producing a confusing error that the agent might misinterpret.

Defense strategy

The fix is the same schema validation we saw above, plus a tool allowlist. Before executing any tool call, check that the requested tool name exists in the registered tool set. If it doesn't, return a clear error: Tool 'search_database' is not available. Available tools: query_db, get_eta, suggest_alternative.

This transforms an ambiguous failure into structured feedback the agent can use to self-correct.

Infinite loops (the "stuck agent")

The loop bucket contains two common states: exact repeats and semantic repeats. In both cases, the agent repeatedly calls the same tool or oscillates between two states. Here's what it looks like with our order-lookup agent:

text
1Step 1: query_db({"id": "A102"}) -> "Delayed, ETA Friday" 2Step 2: query_db({"id": "A102"}) -> "Delayed, ETA Friday" 3Step 3: query_db({"id": "A102"}) -> "Delayed, ETA Friday" 4... (continues forever)

Or with semantic rephrasing:

text
1Step 1: query_db({"id": "A102"}) -> "Delayed, ETA Friday" 2Step 2: search("status for order A102") -> "Delayed, ETA Friday" 3Step 3: search("A102 shipment ETA") -> "Delayed, ETA Friday" 4Step 4: search("where is order A102") -> "Delayed, ETA Friday" 5... (continues forever)

These loops typically occur when the model's internal prompt fails to guide it out of a dead end. If an API returns an unhelpful response, the model might "forget" that it just tried that exact approach because its attention mechanism incorrectly weights its overarching goal higher than the recent failure. This creates a state of "zombie" execution where the agent keeps trying to solve the problem by doing the exact same thing, slightly rephrased, forever.

Defense strategy

Multi-layered loop detection is the key. The LoopBreaker class provides a practical implementation of this defense by maintaining a stateful history of actions. Its check method takes a proposed tool name and its arguments as inputs to evaluate against past behavior. It combines exact deduplication, repeated-tool heuristics, and approximate intent similarity, then returns True if the current call would push the agent over a loop threshold:

defense-strategy.py
1import difflib 2import hashlib 3import json 4 5class LoopBreaker: 6 def __init__( 7 self, 8 max_steps: int = 15, 9 max_identical: int = 3, 10 same_tool_streak: int = 5, 11 semantic_threshold: float = 0.9, 12 ): 13 self.max_steps = max_steps 14 self.max_identical = max_identical 15 self.same_tool_streak = same_tool_streak 16 self.semantic_threshold = semantic_threshold 17 self.tool_call_history: list[dict[str, str]] = [] 18 19 def _stable_call_hash(self, tool_name: str, args: dict[str, object]) -> str: 20 canonical_args = json.dumps( 21 args, 22 sort_keys=True, 23 default=str, 24 separators=(",", ":"), 25 ) 26 return hashlib.sha256(f"{tool_name}:{canonical_args}".encode("utf-8")).hexdigest() 27 28 def _intent_text(self, tool_name: str, args: dict[str, object]) -> str: 29 fields = [tool_name] 30 for key in ("query", "prompt", "text", "task"): 31 value = args.get(key) 32 if value: 33 fields.append(str(value).strip().lower()) 34 return " ".join(fields) 35 36 def check(self, tool_name: str, args: dict[str, object]) -> bool: 37 """Returns True if a loop is detected.""" 38 call = { 39 "tool": tool_name, 40 "args_hash": self._stable_call_hash(tool_name, args), 41 "intent_text": self._intent_text(tool_name, args), 42 } 43 44 # Check for loops BEFORE appending to history 45 # Hard step limit 46 if len(self.tool_call_history) >= self.max_steps: 47 return True 48 49 # Detect repeated identical calls 50 identical_attempts = 1 + sum( 51 1 for h in self.tool_call_history if h["args_hash"] == call["args_hash"] 52 ) 53 if identical_attempts >= self.max_identical: 54 return True 55 56 # Detect same-tool repetition streaks 57 same_tool_attempts = 1 58 for h in reversed(self.tool_call_history): 59 if h["tool"] != tool_name: 60 break 61 same_tool_attempts += 1 62 if same_tool_attempts >= self.same_tool_streak: 63 return True 64 65 # Detect approximate intent repetition for rephrased queries/prompts 66 recent_window = self.tool_call_history[-(self.same_tool_streak - 1):] 67 similar_attempts = 1 + sum( 68 1 69 for h in recent_window 70 if h["intent_text"] 71 and difflib.SequenceMatcher( 72 None, h["intent_text"], call["intent_text"] 73 ).ratio() >= self.semantic_threshold 74 ) 75 if similar_attempts >= self.max_identical: 76 return True 77 78 # Only append after checks to avoid polluting history with the failing call 79 self.tool_call_history.append(call) 80 return False 81 82breaker = LoopBreaker(max_identical=3) 83print("first:", breaker.check("query_db", {"id": "A102"})) 84print("second:", breaker.check("query_db", {"id": "A102"})) 85print("third:", breaker.check("query_db", {"id": "A102"}))
Output
1first: False 2second: False 3third: True

Notice that these thresholds include the current attempt. That detail matters because an off-by-one bug here burns a real extra tool call, token round-trip, and latency spike before the breaker trips.

In production, this approximate string check is usually replaced with embeddings over the tool name plus a compact state summary. The goal is the same: catch query_db({"id": "A102"}) versus search("where is order A102") before the agent burns another five steps.

Common mistake: Assuming that max_steps alone is enough to prevent loops. A step limit of 15 doesn't stop an agent from wasting 15 steps (and thousands of tokens) on a futile loop. Semantic loop detection stops the bleeding early, often after 2-3 repeated attempts.

Token budget exhaustion

This bucket has two related states: hard context-window overflow and slow budget bleed. Because most agent architectures append every tool call, reasoning step, and observation to a continuous conversational history, prompt size grows roughly linearly with each turn. Across the full run, total token spend can become roughly quadratic because each new step re-sends most of the prior history. If a task requires 20 steps, the final step includes the previous 19 steps in context, leading to massive context window bloat.

If this growth is left unchecked, it inevitably leads to context window overflows (commonly rejected with a provider error, or handled via truncation you didn't intend) or unexpectedly high API bills. A runaway agent stuck in a subtle loop can consume a large amount of tokens before anyone notices.

Defense strategy

Effective budget management involves implementing two distinct layers: hard limits and soft management. Hard limits act as a circuit breaker, abruptly stopping the agent if it exceeds a predefined cost or token threshold. Soft management involves proactively monitoring the context window and taking action (like summarizing the history) before limits are breached.

The TokenBudget class below provides a programmatic safeguard to track token consumption against hard constraints. Its consume method takes the number of input and output tokens from each generation step as inputs. It updates the internal state and, if the predefined usage or cost thresholds are exceeded, it immediately interrupts execution by raising a BudgetExhausted exception:

defense-strategy-2.py
1class BudgetExhausted(Exception): 2 pass 3 4class TokenBudget: 5 def __init__( 6 self, 7 max_tokens: int = 100_000, 8 max_cost_usd: float = 1.0, 9 pricing: dict[str, dict[str, float]] | None = None, 10 ): 11 self.max_tokens = max_tokens 12 self.max_cost = max_cost_usd 13 self.used_tokens = 0 14 self.estimated_cost = 0.0 15 self.pricing = pricing or {} 16 17 def consume(self, input_tokens: int, output_tokens: int, model: str): 18 """Updates token counts and raises an error if budget is exceeded.""" 19 self.used_tokens += input_tokens + output_tokens 20 self.estimated_cost += self._calculate_cost(input_tokens, output_tokens, model) 21 22 # Hard stop to prevent runaway billing 23 if self.used_tokens > self.max_tokens: 24 raise BudgetExhausted(f"Token limit exceeded: {self.used_tokens}/{self.max_tokens}") 25 if self.estimated_cost > self.max_cost: 26 raise BudgetExhausted(f"Cost limit exceeded: ${self.estimated_cost:.2f}/${self.max_cost:.2f}") 27 28 def _calculate_cost(self, input_tokens: int, output_tokens: int, model: str) -> float: 29 rates = self.pricing.get(model) 30 if rates is None: 31 raise BudgetExhausted(f"Pricing missing for model: {model}") 32 33 input_cost = (input_tokens / 1_000_000) * rates["input"] 34 output_cost = (output_tokens / 1_000_000) * rates["output"] 35 return input_cost + output_cost 36 37budget = TokenBudget( 38 max_tokens=1_000, 39 max_cost_usd=0.01, 40 pricing={"order-model": {"input": 1.0, "output": 2.0}}, 41) 42budget.consume(200, 50, "order-model") 43print("used tokens:", budget.used_tokens) 44try: 45 budget.consume(900, 20, "order-model") 46except BudgetExhausted as exc: 47 print("blocked:", str(exc).split(":")[0])
Output
1used tokens: 250 2blocked: Token limit exceeded

Production tip: When the budget approaches a configured soft threshold, stop optional exploration and compact only validated facts and pending work. A generated summary is lossy and cannot replace the durable record of writes, approvals, or idempotency keys.

Cascading failures in multi-agent systems

In complex architectures, such as Directed Acyclic Graphs (DAGs) or hierarchical agent swarms (where a supervisor agent delegates to worker agents), failures usually spread in two ways: a bad handoff poisons downstream reasoning, or the agent's internal state drifts away from the source of truth after a partial side effect. Both create a snowball effect where a minor mistake at the start of a pipeline compounds into a much larger failure by the end.

Consider an example failure chain in a fulfillment-support pipeline:

  1. Tracking Agent: "Carrier scan says delivered" (Hallucinates a delivery event that never happened).
  2. Policy Agent: "Delivery is confirmed, so deny refund review" (Treats the hallucinated scan as source-of-truth state).
  3. Reply Agent: "Your package was delivered yesterday" (Sends a confident customer-facing answer grounded in false state).

A second version is even more dangerous because it looks operationally healthy. Suppose an "issue refund" tool times out after the payment processor already committed the refund. If your orchestrator records that step as failed and blindly retries, you now have duplicate side effects plus an agent state that no longer matches the external system.

Defense strategy

The solution is strict, per-agent output validation gates plus explicit state reconciliation around side effects. You should never assume an agent's output is safe to pass directly to another agent without verification, and you should never assume a timed-out write means "nothing happened."

Before passing context to the next node in the graph, run a validator. Format and authorization checks should be deterministic; factual business state should be checked against its source of truth. A model-based critic can flag candidates for review, but it cannot prove that an order was delivered or a refund committed. If validation fails, the pipeline halts and either requests a corrected proposal or escalates. For side-effecting tool calls, add read-after-write reconciliation and idempotency keys before telling downstream agents the action succeeded.

The validated_handoff function implements this pattern to secure the boundaries between components. It accepts the upstream agent's raw output and a list of validation functions as inputs. Each validator returns a small ValidationResult, so the handoff code can return the intact output if all checks pass, or a structured error dictionary meant for retry or human escalation if any validation fails:

defense-strategy-3.py
1from dataclasses import dataclass 2from typing import Callable 3 4@dataclass 5class ValidationResult: 6 valid: bool 7 reason: str = "" 8 9Validator = Callable[[dict[str, object]], ValidationResult] 10 11def validated_handoff( 12 output: dict[str, object], 13 validators: list[Validator], 14) -> dict[str, object]: 15 """Validate agent output before passing to next agent.""" 16 for validator in validators: 17 result = validator(output) 18 if not result.valid: 19 # Option 1: Fail fast 20 # raise AgentError(result.reason) 21 22 # Option 2: Return structured error for retry 23 return { 24 "status": "validation_failed", 25 "error": result.reason, 26 "fallback": "Request human review of this output" 27 } 28 return output

External service failures

This bucket has two main states: transient faults (429, timeouts, brief 503s) and persistent dependency outages where the upstream still isn't healthy after the retry window. APIs go down, rate limits are hit, and databases occasionally time out. While traditional software also faces these issues, LLM agents are particularly vulnerable because they make a high volume of sequential API calls. A single user request might trigger multiple model generations, drastically increasing the surface area for a network failure.

Many LLM APIs enforce separate request-rate and token-rate quotas (often expressed as RPM and TPM). Because an agent's context window grows with every step, late-stage agent turns consume significantly more tokens than early ones. This means an agent can suddenly hit a token-rate limit in the middle of a complex reasoning chain, even if the raw request count is still within bounds.

Defense strategy

To handle these transient network and rate-limit errors, use exponential backoff with jitter. A simple immediate retry will likely hit the same rate limit again, whereas exponential backoff gives the API time to replenish its token buckets. Jitter (adding randomness to the retry delay) prevents the "thundering herd" problem where multiple stalled agents retry simultaneously. Only retry failures that are plausibly transient, such as 429, 502/503, or timeouts. Don't put deterministic failures like 400 schema errors or 401 auth problems on the same retry path.

Here's an example using the tenacity library in Python to wrap model calls with reliable retry logic. The call_llm_with_retry async function takes a list of conversational messages and a model string as inputs. It normalizes provider-specific transient failures into retryable exception classes, then retries with randomized exponential backoff. The call_model_api function stands in for your actual provider client:

defense-strategy-4.py
1import logging 2import tenacity 3 4logger = logging.getLogger(__name__) 5 6class RetryableModelError(Exception): 7 pass 8 9class RateLimitExceeded(RetryableModelError): 10 pass 11 12class UpstreamTimeout(RetryableModelError): 13 pass 14 15class UpstreamUnavailable(RetryableModelError): 16 pass 17 18def normalize_provider_error(exc: Exception) -> Exception: 19 message = str(exc).lower() 20 if "rate limit" in message: 21 return RateLimitExceeded(str(exc)) 22 if "timeout" in message: 23 return UpstreamTimeout(str(exc)) 24 if "temporarily unavailable" in message or "503" in message: 25 return UpstreamUnavailable(str(exc)) 26 return exc 27 28@tenacity.retry( 29 stop=tenacity.stop_after_attempt(3), 30 wait=tenacity.wait_random_exponential(multiplier=1, min=1, max=30), 31 retry=tenacity.retry_if_exception_type(RetryableModelError), 32 before_sleep=lambda retry_state: logger.warning( 33 f"Retry {retry_state.attempt_number}: {retry_state.outcome.exception()}" 34 ), 35) 36async def call_llm_with_retry(messages: list[dict[str, str]], model: str = "primary-model"): 37 """Wrap a provider call with exponential backoff plus jitter.""" 38 try: 39 return await call_model_api( 40 messages=messages, 41 model=model, 42 timeout_seconds=30, 43 ) 44 except Exception as exc: 45 raise normalize_provider_error(exc) from exc

Blind retries only help with transient infrastructure faults. If the model keeps making the same schema mistake, the next attempt needs better information, not just more time. Reflexion-style recovery uses feedback from the previous trial to improve the next one.[5] In production, that feedback can be much simpler than a full reflection loop: a concise validation error or tool rejection reason is often enough to change the next attempt.

Retry, resume, or roll back?

External faults are where retry logic usually starts, but the same decision tree applies everywhere. Retries are only the right tool when the failure is transient and nothing important has committed yet. A 429, 503, or dropped connection usually fits that pattern. A poisoned context or partially completed side effect doesn't. If the agent already believes "invoice created" even though the tool call actually timed out, replaying the next step can duplicate charges or reason from false state.

Durable runtimes handle this with checkpoints rather than blind replay. LangGraph persists graph state as checkpoints keyed by thread_id and resumes from a saved super-step boundary.[6] Temporal persists workflow execution state in event history and replays deterministic workflow code after failures.[7] Both patterns let you resume from the last confirmed-good boundary instead of restarting a long run from scratch.

In practice, the rule is simple:

  • Retry when the failure is transient and the step is side-effect free.
  • Resume from a checkpoint when prior state is still valid but expensive to recompute.
  • Reconcile or roll back when the failure happened around a non-idempotent side effect such as sending an email, writing a ticket, or charging a card.

This classifier makes the distinction executable. Notice that an uncertain write never enters the ordinary retry path, even when its transport error looks transient.

choose-recovery-path.py
1def recovery_path(error: str, side_effect_status: str, checkpoint_valid: bool) -> str: 2 if side_effect_status == "uncertain": 3 return "reconcile before replay" 4 if checkpoint_valid and error == "worker_restarted": 5 return "resume checkpoint" 6 if side_effect_status == "none" and error in {"429", "503", "timeout"}: 7 return "retry with backoff" 8 return "stop or escalate" 9 10print("lookup timeout:", recovery_path("timeout", "none", False)) 11print("refund timeout:", recovery_path("timeout", "uncertain", True))
Output
1lookup timeout: retry with backoff 2refund timeout: reconcile before replay

Production tip: Pair checkpoints with idempotency keys for outbound side effects. Resume must be safe even if the previous attempt failed after the external system committed but before your agent recorded success.

Recovery path after agent failure showing retry for transient side-effect-free faults, resume for valid checkpoints, and reconcile for uncertain writes. Recovery path after agent failure showing retry for transient side-effect-free faults, resume for valid checkpoints, and reconcile for uncertain writes.
Choose recovery from failure shape: retry transient faults, resume valid checkpoints, and reconcile uncertain writes before any replay.

Action-contract disconnect

A particularly insidious failure occurs when the agent's proposed operation diverges from its accepted task contract. The two common states here are wrong tool choice despite a valid request and wrong argument serialization despite choosing the right tool. A model may describe a reasonable plan yet call the wrong tool or use unsupported parameters when it emits the executable payload.

Consider our order-lookup agent. Its structured intent says lookup_order for A102. However, the emitted tool call might target a calculate_shipping tool or provide a different order ID, producing irrelevant results while the response text still sounds plausible.

This disconnect happens because generated explanations do not constrain generated tool payloads. An application should validate the payload against its task contract instead of asking for hidden reasoning or trusting a confident explanation.

Defense strategy

The primary defense is execution verification: compare the proposed tool call to a structured task contract before execution. A critic model can help triage ambiguous output, but deterministic tool, identifier, permission, and schema checks should block clear violations.

validate-action-contract.py
1class ActionMismatch(Exception): 2 pass 3 4def validate_action_contract( 5 task_contract: dict[str, object], 6 tool_name: str, 7 tool_args: dict[str, object], 8 allowed_tools: set[str], 9) -> str: 10 if tool_name not in allowed_tools: 11 raise ActionMismatch("tool not allowed") 12 if task_contract["intent"] == "lookup_order" and tool_name != "query_db": 13 raise ActionMismatch("tool does not match intent") 14 if tool_args.get("id") != task_contract["order_id"]: 15 raise ActionMismatch("order ID changed") 16 return "admitted" 17 18contract = {"intent": "lookup_order", "order_id": "A102"} 19print("lookup:", validate_action_contract(contract, "query_db", {"id": "A102"}, {"query_db"})) 20try: 21 validate_action_contract(contract, "calculate_shipping", {}, {"query_db"}) 22except ActionMismatch as exc: 23 print("drift:", exc)
Output
1lookup: admitted 2drift: tool not allowed

Self-correction with the Reflexion pattern

So far we've treated each failure as a separate problem with a separate fix. For failures that permit another attempt, structured feedback can help the model propose a better one. The Reflexion pattern, introduced by Shinn et al. (2023), gives an agent a way to reflect on feedback and refine its strategy before trying again.[5]

The loop is simple: Generate -> Critique -> Refine.

Think of it like a writer and an editor working on the same draft. The writer (the generator) produces the first attempt. The editor (the critique) reads it, marks what's wrong, and suggests improvements. The writer then produces a second draft that addresses the feedback. In an agent, the same LLM can play both roles by switching prompts: first a "doer" prompt, then a "critic" prompt, then a "fixer" prompt.

Here's how this looks for our order-lookup agent after a 400 Bad Request:

text
1[Generate] Agent: query_db({"order_id": "A102"}) 2[Observe] System: 400 Bad Request: Missing required parameter 'id'. 3[Critique] Critic prompt: "The agent used 'order_id' but the schema requires 'id'. 4 The agent should check the tool schema before retrying." 5[Refine] Agent: query_db({"id": "A102"}) 6[Observe] System: 200 OK: Status delayed, ETA Friday.

The critical difference from simple retry is that the critique step produces new reasoning, not just a repeated attempt. The agent explicitly names what went wrong and how to fix it, which changes the probability distribution of the next generation.

Key insight: A model critique is another proposal, not proof. Use it to suggest a repair after an allowed failure, then rerun deterministic validators and source-of-truth checks before execution.

In production, you don't always need a full three-step Reflexion loop. For many failures, a single structured error message is enough. Use critique/refine for repeated or ambiguous proposals that remain safe to retry; high-stakes or potentially committed effects should stop for approval or reconciliation instead.

The circuit breaker pattern

Borrowed from microservices architecture, the circuit breaker prevents a failing component from taking down the entire system. For agents, that usually means failing fast once a tool or model endpoint crosses a known error threshold instead of letting every request discover the outage independently.

Circuit breaker state machine showing closed, open, and half-open states with transition conditions Circuit breaker state machine showing closed, open, and half-open states with transition conditions
Trace the three states: Closed allows traffic, Open blocks it, and Half-Open tests one request before reopening.

The circuit breaker pattern operates as a state machine with three distinct states to manage failure routing:

  1. Closed (Normal): Requests flow through to the agent/LLM. Failures are counted.
  2. Open (Failing): Requests fail fast immediately. This prevents wasting tokens on a service known to be down or hallucinating.
  3. Half-Open (Recovery): After a timeout, allow one trial request. If it succeeds, reset to Closed. If it fails, return to Open.

The "Half-Open" state is critical for agent recovery. If we simply switched from Open back to Closed after a timeout, a still-failing API would immediately receive a flood of pending agent requests (the "thundering herd" problem), potentially triggering further rate limits or blowing through your budget before the circuit trips again. By allowing only a single test request through, we ensure the service is genuinely healthy before restoring full traffic.

Here's a complete implementation of this state machine. The CircuitBreaker class acts as a protective wrapper, where its call method takes an arbitrary async function and its arguments as inputs. This version explicitly gates the half-open state so only one probe request can run at a time:

the-circuit-breaker-pattern.py
1import asyncio 2import time 3 4class CircuitOpenError(Exception): 5 pass 6 7class CircuitBreaker: 8 """Prevents cascading failures by halting requests to failing services.""" 9 CLOSED = "closed" # Normal operation 10 OPEN = "open" # All calls fail fast 11 HALF_OPEN = "half_open" # Test with one call 12 13 def __init__(self, failure_threshold: int = 5, reset_timeout: float = 60.0): 14 self.state = self.CLOSED 15 self.failure_count = 0 16 self.failure_threshold = failure_threshold 17 self.reset_timeout = reset_timeout 18 self.last_failure_time = 0.0 19 self._lock = asyncio.Lock() 20 self._half_open_probe_in_flight = False 21 22 async def call(self, func, *args, **kwargs): 23 """Executes the function if the circuit is closed or half-open.""" 24 async with self._lock: 25 now = time.time() 26 27 if self.state == self.OPEN: 28 if now - self.last_failure_time > self.reset_timeout: 29 self.state = self.HALF_OPEN 30 self._half_open_probe_in_flight = False 31 else: 32 raise CircuitOpenError("Circuit is open - failing fast") 33 34 if self.state == self.HALF_OPEN and self._half_open_probe_in_flight: 35 raise CircuitOpenError("Half-open probe already in flight") 36 37 if self.state == self.HALF_OPEN: 38 self._half_open_probe_in_flight = True 39 40 try: 41 result = await func(*args, **kwargs) 42 except Exception: 43 async with self._lock: 44 self._on_failure() 45 self._half_open_probe_in_flight = False 46 raise 47 48 async with self._lock: 49 self._on_success() 50 self._half_open_probe_in_flight = False 51 return result 52 53 def _on_success(self): 54 self.failure_count = 0 55 self.state = self.CLOSED 56 57 def _on_failure(self): 58 self.failure_count += 1 59 self.last_failure_time = time.time() 60 if self.state == self.HALF_OPEN or self.failure_count >= self.failure_threshold: 61 self.state = self.OPEN

If you run multiple app servers, keep this state in a shared store like Redis rather than in process memory. A per-process breaker protects only the worker that observed the failures.

Fallback chains: graceful degradation

When the primary approach fails, fall back to cheaper but more reliable alternatives. The chain below shows the order: try the most capable path first, validate its result, then degrade only when that path fails.

Fallback chain from primary agent to backup agent, direct model call, and static response Fallback chain from primary agent to backup agent, direct model call, and static response
Each step trades capability for reliability; the first valid response exits the chain.

This chain systematically trades capability for narrower behavior. If the most advanced agent with the fullest tool set fails or hits a circuit breaker, you may downgrade to a simpler agent with fewer tools, then to a direct model call with no tools, and finally to a static status message. A static fallback is deterministic, but it cannot answer a live order or refund question. It should state that live data is unavailable and route the user to retry or support rather than inventing status.

Here's a Python implementation of this pattern. The FallbackChain class orchestrates this graceful degradation through its execute method, which takes the user's task string as input. It systematically attempts to process the task using a list of progressively simpler internal strategies, ultimately returning the first valid string response it generates, or a hardcoded default if all fail:

fallback-chains-graceful-degradation.py
1import logging 2 3logger = logging.getLogger(__name__) 4 5class FallbackChain: 6 """Try multiple approaches in order of capability/cost. 7 8 Note: run_agent and call_llm are illustrative functions representing 9 your underlying execution engine. 10 """ 11 12 # Pre-defined responses for last-resort fallback 13 FALLBACK_RESPONSES = { 14 "greeting": "Hello! I'm currently operating in degraded mode.", 15 "search": "Search is currently unavailable, please try again later." 16 } 17 DEFAULT_RESPONSE = "Live order status is unavailable right now. Please retry or contact support." 18 19 async def execute(self, task: str, requires_live_data: bool = False) -> str: 20 strategies = [ 21 ("primary_agent", self._full_agent_execution), 22 ("backup_agent", self._simplified_agent), 23 ("direct_generation", self._direct_llm_call), 24 ("static_fallback", self._static_fallback), 25 ] 26 27 for strategy_name, strategy in strategies: 28 try: 29 result = await strategy(task) 30 if self._validate_result(strategy_name, result, requires_live_data): 31 return result 32 logger.warning(f"Strategy {strategy_name} returned invalid result") 33 except Exception as e: 34 logger.error(f"Strategy {strategy_name} failed: {e}") 35 continue 36 37 return self.DEFAULT_RESPONSE 38 39 def _validate_result( 40 self, 41 strategy_name: str, 42 result: str, 43 requires_live_data: bool, 44 ) -> bool: 45 if not result or not result.strip(): 46 return False 47 if requires_live_data and strategy_name in {"direct_generation", "static_fallback"}: 48 return False 49 return True 50 51 async def _full_agent_execution(self, task: str) -> str: 52 """Full agent with tools: most capable, most expensive.""" 53 # In an implementation, return only output backed by validated tool evidence. 54 return "Full agent output" 55 56 async def _simplified_agent(self, task: str) -> str: 57 """Simplified agent: fewer tools, lower step limit.""" 58 # In an implementation, retain the same evidence checks as primary path. 59 return "Simplified agent output" 60 61 async def _direct_llm_call(self, task: str) -> str: 62 """Direct LLM call without tools: cheapest, fastest.""" 63 # return await call_llm(task) 64 return "Direct LLM output" 65 66 async def _static_fallback(self, task: str) -> str: 67 """Static response: deterministic last-resort path.""" 68 # classify_task is an illustrative function 69 # task_type = classify_task(task) 70 task_type = "unknown" 71 return self.FALLBACK_RESPONSES.get(task_type, self.DEFAULT_RESPONSE)

The pinned constraints pattern

Long-context studies show a clear positional bias: models often do best when relevant information appears near the beginning or end of the context window, and noticeably worse when that information is buried in the middle.[8] In long-running agent traces, critical constraints can drift into that low-salience middle region as more tool output and scratchpad text accumulate.

The pinned constraints pattern addresses this by systematically re-injecting a very short set of essential rules at the end of every prompt. This takes advantage of recency effects to keep high-priority instructions salient regardless of conversation length. It's a recall aid, not an enforcement boundary. Authorization checks, tool allowlists, and output validation still need to live outside the model.

For example, if an agent must never disclose sensitive customer data, simply stating this constraint once at the start of a system prompt is insufficient. As the conversation grows, the model's attention shifts to recent turns and the original constraint fades. By appending a "pinned constraints" section to every prompt generation, you keep the critical rules fresher in the model's context.

the-pinned-constraints-pattern.py
1class PinnedConstraintsAgent: 2 """Agent that re-injects critical constraints at the end of every prompt.""" 3 4 CRITICAL_CONSTRAINTS = """ 5CRITICAL CONSTRAINTS (must follow): 61. Never disclose PII (Personally Identifiable Information) 72. Always confirm actions that modify data 83. If uncertain, ask for clarification rather than guessing 94. Maximum 10 tool calls per user request 10""" 11 12 def __init__(self, base_system_prompt: str): 13 self.base_system_prompt = base_system_prompt 14 15 def build_prompt(self, conversation_history: list, current_task: str) -> str: 16 """Build prompt with pinned constraints at the end.""" 17 prompt_parts = [ 18 self.base_system_prompt, 19 "", 20 "=== CONVERSATION HISTORY ===", 21 self._format_history(conversation_history), 22 "", 23 "=== CURRENT TASK ===", 24 current_task, 25 "", 26 self.CRITICAL_CONSTRAINTS, # Pinned at the end for recency bias 27 ] 28 return "\n".join(prompt_parts) 29 30 def _format_history(self, history: list) -> str: 31 return "\n".join([f"{msg['role']}: {msg['content']}" for msg in history[-10:]])

Production tip: Keep pinned constraints concise (3-5 bullet points max). Too many constraints dilute the recency effect. Prioritize safety, compliance, and task-critical rules only.

Human-in-the-loop escalation

Not all failures can be resolved automatically. When an agent encounters high-stakes ambiguity, repeated failures, or potential safety issues, it should pause and route a bounded review packet to an authorized human reviewer.

Human-in-the-loop (HITL) architectures can intercept risky proposals when triggers are correctly enforced. They do not guarantee correctness: reviewers can be wrong, approvals can become stale, and execution can drift from the reviewed proposal. Use measurable escalation triggers and, for any later side effect, bind approval to the exact action and revalidate it at execution time.

When to escalate

Escalation should be triggered by specific, measurable conditions rather than vague "uncertainty." Effective triggers include:

  • Verifier thresholds: When an evaluated classifier or policy score crosses a documented review threshold
  • Repeated failure patterns: After 3 consecutive tool call failures or validation rejections
  • Safety-critical actions: Before executing any operation that changes user data, moves money, or accesses sensitive systems
  • Flagged uncertainty: When a verifier or policy gate flags output as requiring review

Here's an implementation that demonstrates intelligent escalation logic:

when-to-escalate.py
1from dataclasses import dataclass 2from enum import Enum 3 4class EscalationReason(Enum): 5 POLICY_REVIEW = "policy_review" 6 REPEATED_FAILURES = "repeated_failures" 7 SAFETY_CHECK = "safety_check" 8 VALIDATION_FAILED = "validation_failed" 9 MAX_STEPS_EXCEEDED = "max_steps_exceeded" 10 11@dataclass 12class EscalationRequest: 13 reason: EscalationReason 14 context: dict[str, object] 15 priority: int # 1 (critical) to 5 (low) 16 suggested_action: str 17 18class HumanInTheLoop: 19 """Manages escalation to human operators with context preservation.""" 20 21 def __init__(self, 22 review_threshold: float = 0.7, 23 max_retries: int = 3): 24 self.review_threshold = review_threshold 25 self.max_retries = max_retries 26 self.escalation_queue: list[EscalationRequest] = [] 27 28 def should_escalate(self, 29 agent_state: dict[str, object], 30 failure_count: int = 0, 31 policy_score: float | None = None) -> EscalationRequest | None: 32 """Determine if human intervention is needed.""" 33 34 # Check repeated failures 35 if failure_count >= self.max_retries: 36 return EscalationRequest( 37 reason=EscalationReason.REPEATED_FAILURES, 38 context=agent_state, 39 priority=2, 40 suggested_action="Review failed attempts and provide guidance" 41 ) 42 43 # Check evaluated policy threshold 44 if policy_score is not None and policy_score < self.review_threshold: 45 return EscalationRequest( 46 reason=EscalationReason.POLICY_REVIEW, 47 context=agent_state, 48 priority=3, 49 suggested_action="Review policy-flagged output" 50 ) 51 52 # Check safety-critical actions 53 if agent_state.get("action_type") in ["delete", "modify", "transfer", "refund"]: 54 return EscalationRequest( 55 reason=EscalationReason.SAFETY_CHECK, 56 context=agent_state, 57 priority=1, 58 suggested_action="Approve sensitive action before execution" 59 ) 60 61 return None # No escalation needed 62 63 async def handle_escalation(self, request: EscalationRequest) -> dict[str, object]: 64 """Submit to human review queue and await resolution.""" 65 self.escalation_queue.append(request) 66 67 # In production, this would notify via Slack, PagerDuty, etc. 68 # and wait for human response 69 return { 70 "status": "escalated", 71 "ticket_id": f"ESC-{len(self.escalation_queue)}", 72 "reason": request.reason.value, 73 "priority": request.priority, 74 "review_decision": await self._await_human_input(request) 75 } 76 77 async def _await_human_input(self, request: EscalationRequest) -> dict[str, str]: 78 # Placeholder - production would integrate with ticketing system 79 return {"decision": "pending", "action_hash": "not-approved"}

Key insight: An agent that says "I need help" is more valuable than one that silently hallucinates a "Success" message. Design your escalation UX to make it clear when and why the agent failed, preserving user trust.

Monitoring and alerting for agent systems

Production agent systems need specialized observability beyond standard APM (Application Performance Monitoring). While latency and error rates are important, they don't capture the unique failure modes of probabilistic agents. A "healthy" agent might return 200 OK status codes while being completely stuck in a logic loop or hallucinating facts. Benchmarks such as AgentBench highlight how even strong models exhibit high failure rates on realistic multi-step agent tasks when no defensive layers are present.[9][10]

Track agent-specific metrics to detect runs that are technically returning responses but failing to deliver valid results. Semantic entropy techniques measure uncertainty across sampled answers and can help prioritize review for open-ended generations; they do not verify a tool result or replace source-of-truth checks.[11]

Agent observability metrics

MetricTypeWhat it MeasuresCritical Alert Threshold
Loop RateCounterPercentage of traces where loop detection triggered> 5% of requests
Recovery RateDerived ratioPercentage of failed trajectories rescued by validation, retry, fallback, or escalationDrops below baseline
Fallback RateCounterFrequency of downgrading to simpler models> 10% of requests
Step CountHistogramNumber of tool calls per user turnP99 (99th percentile) > 15 steps
Token UsageHistogramTotal tokens (prompt + completion) per turn> 80% of budget
Validation Rejection RateDerived ratioPercentage of validated outputs rejected by factual or schema checks> 2% of validated outputs

Counters track cumulative events like loops, fallbacks, and validation failures. Histograms record distributions like step counts and token usage. Validation rejection rate and recovery rate are usually computed as derived ratios from validation and incident counters rather than stored directly as gauges. A useful definition is:

RecoveryRate=failed trajectories that ended safelyfailed trajectories that triggered an intervention\text{RecoveryRate} = \frac{\text{failed trajectories that ended safely}}{\text{failed trajectories that triggered an intervention}}RecoveryRate=failed trajectories that triggered an interventionfailed trajectories that ended safely​

For example, if 100 order-lookup traces hit a validation error, loop breaker, timeout, or fallback path, and 72 still return a safe answer or clean escalation, the recovery rate is 72%. This metric tells you whether your defenses are actually saving users, not merely detecting failures.

Agent observability panel grouping trajectory-health metrics like loop rate, recovery rate, and step P99 with protection-health metrics like fallback rate, token budget, and circuit state. Agent observability panel grouping trajectory-health metrics like loop rate, recovery rate, and step P99 with protection-health metrics like fallback rate, token budget, and circuit state.
Good agent monitoring watches trajectory health and protection health together. HTTP status alone won't tell you whether runs are looping, degrading, or recovering safely.

The following example uses the Prometheus client to define these custom agent metrics. AgentMetrics class sets up counters and histograms that map cleanly to the table above. Once instantiated, it produces metric objects that take labels (like the agent's name or model) as inputs, allowing your observability stack to monitor loop rates, fallback frequency, and validation failures over time:

agent-observability-metrics.py
1from prometheus_client import Counter, Histogram 2 3class AgentMetrics: 4 """Tracks specialized observability metrics for LLM agents.""" 5 def __init__(self): 6 # Per-run distributions support P95/P99 alerting 7 self.step_count = Histogram( 8 "agent_steps_per_run", 9 "Tool calls per agent execution", 10 ["agent"], 11 buckets=(1, 3, 5, 8, 10, 15, 20, 30), 12 ) 13 14 # Event counters 15 self.loop_detected = Counter("agent_loops_total", "Loop detections", ["agent"]) 16 self.fallback_triggered = Counter("agent_fallbacks_total", "Fallback triggers", ["strategy"]) 17 self.interventions_triggered = Counter( 18 "agent_interventions_total", 19 "Failure-handling interventions triggered", 20 ["agent", "intervention"], 21 ) 22 self.safe_recoveries = Counter( 23 "agent_safe_recoveries_total", 24 "Interventions that ended in safe answer or clean escalation", 25 ["agent", "intervention"], 26 ) 27 self.validation_failures = Counter( 28 "agent_validation_failures_total", 29 "Outputs rejected by validation gates", 30 ["agent", "reason"], 31 ) 32 self.validated_outputs = Counter( 33 "agent_validated_outputs_total", 34 "Outputs checked by validation gates", 35 ["agent"], 36 ) 37 38 # Budget distributions 39 self.token_usage = Histogram("agent_tokens_per_run", "Tokens per execution", ["model"])

Critical alerts

Treat these numbers as example starting points calibrated from your own baseline, not universal constants.

  1. Loop Detection Rate > 5%: A high loop rate indicates the agent's prompt or tools are ambiguous, causing it to oscillate. Treat it as a code/prompt issue, not an infrastructure issue.
  2. Fallback Rate > 10%: If the system frequently downgrades to simpler models or static responses, a primary dependency, model path, verifier, or task mix may have changed; inspect traces before assigning cause.
  3. P99 Token Usage > Budget: If the 99th percentile of requests hits the token limit, you need summarization or a tighter sliding context window before hard stops start landing on real users.
  4. Circuit Breaker OPEN: An external dependency (e.g., a search API or specific LLM endpoint) is down.

These agent-specific metrics should be piped directly into your alerting infrastructure (like PagerDuty or Datadog) alongside standard server metrics. Catching a spike in loop detections early can be the difference between a minor service degradation and a massive, unexpected API bill at the end of the month.

Common misconceptions

When engineers transition from deterministic software to building agentic pipelines, they often carry over assumptions that don't apply to probabilistic models. Recognizing and unlearning these patterns is critical for building resilient systems.

Common mistake: Believing you can "just retry on failure" like traditional APIs. Reality: Retrying the same hallucinating LLM call often repeats the same failure mode, or produces a different wrong answer. Effective recovery requires changing the strategy, either by modifying the prompt, switching models, or reducing the task scope.

Common mistake: Treating every timeout as safe to replay. Reality: If a side effect may already have committed, replay can duplicate work or create inconsistent state. Use idempotency keys, checkpoints, and explicit reconciliation against the source of truth.

Common mistake: Assuming that max step counts are enough to prevent loops. Reality: Step limits are a crude safety net. They don't prevent an agent from wasting 15 steps (and thousands of tokens) on a futile loop before hitting the limit. Semantic loop detection stops the bleeding early.

Common mistake: Wrapping all agent calls in generic try/except blocks. Reality: Catching all exceptions silently leads to "zombie agents" that continue operating with corrupted state. Failures should be explicit and handled by validation gates or circuit breakers, not swallowed by generic error handlers.

Practice: diagnose the failure

Here's a short exercise to test whether you can match symptoms to defenses. Read each scenario, decide which failure category it belongs to, and pick the right fix.

Scenario 1: Your order-lookup agent calls query_db({"id": "A102"}) three times in a row, gets the same answer each time, and keeps going.

Scenario 2: The agent's reasoning trace says "I will search the database" but the actual tool call is calculate_shipping({}).

Scenario 3: After a refund tool times out, the orchestrator retries and the customer receives two refund emails.

Scenario 4: The agent has used 120,000 tokens on a single request and your limit is 100,000.

Scenario 5: The agent invents a tool called fetch_order_magic that doesn't exist in your tool registry.

Solution sketches

Scenario 1: Infinite loop (Planning failure). The LoopBreaker would catch it on the third identical call via args_hash deduplication. You should also check why the agent isn't stopping after getting a complete answer. Is the stopping condition too vague?

Scenario 2: Action-contract disconnect (Action failure). The runtime should compare the emitted tool to the structured task contract and allowlist, blocking calculate_shipping when the accepted intent is order lookup. A critique model can add review signal, but it should not replace deterministic capability checks.

Scenario 3: Cascading failure (Memory/State failure). The refund was non-idempotent and the timeout didn't mean "nothing happened." You need idempotency keys on the refund tool and read-after-write reconciliation before retrying. Blind replay caused a duplicate side effect.

Scenario 4: Token budget exhaustion (Memory failure). The TokenBudget class would raise BudgetExhausted at the hard limit. Better yet, you should trigger summarization at the 80% soft threshold to prevent hitting the hard limit at all.

Scenario 5: Tool call hallucination (Planning failure). A tool allowlist would reject fetch_order_magic before execution, returning structured feedback: Tool 'fetch_order_magic' not available. Available tools: query_db, get_eta, suggest_alternative.

Key takeaways

To build agents that can survive production workloads, keep the following core principles in mind:

  1. Small per-step errors compound. A task of n dependent steps succeeds with roughly p^n, so a 95%-reliable step is nearly useless over a 50-step run. Reliability, not raw reasoning, is usually the bottleneck.
  2. Generated failures can look successful. Build validation gates for claims and actions, not only exception handlers.
  3. Four failure categories (Planning, Action, Reflection, Memory) collapse into six concrete states you'll see again and again: tool misuse, loops, budget exhaustion, poisoned handoffs/state desync, external dependency faults, and action-contract drift.
  4. Loop detection needs multiple signals: exact-call hashing, approximate intent similarity, repeated-tool heuristics, step limits, and token budgets.
  5. Corrective feedback beats blind retry for deterministic mistakes. Reflexion-style critique can help ambiguous failures, but it does not authorize actions.
  6. Circuit breakers limit repeated calls to unhealthy dependencies; they do not establish correctness.
  7. Fallback chains degrade capability explicitly: live-data questions must stop or escalate when fallback paths lack evidence.
  8. Checkpointed recovery beats blind replay when state may already be poisoned or partially committed.
  9. Pinned constraints exploit recency bias to keep critical rules salient in long conversations.
  10. Human review must be bound to exact actions and revalidated at execution; escalation alone is not a guarantee.
  11. Monitoring agent-specific metrics (loop rate, fallback rate, token usage) is as important as latency and error rate.

What comes next

You now understand how to detect, classify, and recover from agent failures inside a single agent workflow. The next logical step is orchestration: once planning, retrieval, execution, review, and escalation are split across multiple workers, every failure control needs an owner, a handoff contract, and a shared view of state.

Mastery check

Key concepts

  • Retry Policies
  • Exponential Backoff
  • Fallback Chains
  • Loop Detection
  • Token Budget Limits
  • Checkpointed Recovery
  • Idempotency Keys
  • State Reconciliation
  • Circuit Breakers
  • Graceful Degradation
  • Action Contract Validation
  • Pinned Constraints Pattern
  • Human-in-the-Loop Escalation
  • Trust Gap

Evaluation rubric

  • Foundational: Implement exponential backoff with jitter for transient errors
  • Intermediate: Design a fallback chain degrading from agent -> LLM -> static response
  • Advanced: Implement loop detection with exact-call hashing plus approximate similarity checks
  • Advanced: Validate tool arguments using Pydantic/Zod schemas
  • Advanced: Calculate token usage to prevent budget exhaustion
  • Advanced: Differentiate transient retries from checkpoint-based recovery for poisoned or partially committed state
  • Advanced: Reconcile timed-out side effects against the source of truth before retrying
  • Advanced: Apply pinned constraints pattern to combat recency bias
  • Advanced: Implement human-in-the-loop escalation for high-stakes failures

Follow-up questions

Common pitfalls

  • Symptom: Agent retries same hallucinated tool call three times. Cause: Retry path adds time, not new information. Fix: Return tool-allowlist or schema feedback, or switch to a narrower fallback path.
  • Symptom: Timeout after refund or email leads to duplicate side effect. Cause: Runtime treated "no response" as "nothing happened." Fix: Use idempotency keys, reconcile against source of truth, and resume from checkpoint before any replay.
  • Symptom: Loop breaker fires only after large token burn. Cause: Runtime relies on max-step cap alone. Fix: Add exact-call hashes plus semantic similarity over recent intents.
  • Symptom: Downstream agent makes confident wrong decision from bad upstream state. Cause: Handoff trusted raw agent output. Fix: Validate each handoff and stop pipeline when source-of-truth checks fail.
  • Symptom: Auth errors or invalid arguments keep entering exponential backoff. Cause: Retry policy doesn't separate transient from deterministic failures. Fix: Retry only 429, timeouts, and short outages; route schema, auth, and policy failures to correction or escalation.
  • Symptom: Dashboards look green while customers get useless replies. Cause: Team watched only latency and HTTP status. Fix: Track loop rate, recovery rate, fallback rate, validation failures, and token budget pressure.
Next Step
Continue to Multi-Agent Orchestration

Failure handling keeps one agent reliable; orchestration asks what happens when planning, retrieval, execution, review, and escalation are split across multiple coordinated workers.

PreviousAgent Memory & Persistence
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

ReAct: Synergizing Reasoning and Acting in Language Models.

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. · 2023 · ICLR 2023

Toolformer: Language Models Can Teach Themselves to Use Tools.

Schick, T., et al. · 2023 · NeurIPS 2023

Measuring AI Ability to Complete Long Tasks

Kwa, T., West, B., Becker, J., et al. (METR) · 2025

Tau-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Yao, S., et al. · 2024 · arXiv preprint

Reflexion: Language Agents with Verbal Reinforcement Learning.

Shinn, N., et al. · 2023

LangGraph Persistence

LangChain · 2026

Temporal Workflow Execution Overview

Temporal Technologies · 2026

Lost in the Middle: How Language Models Use Long Contexts

Liu, N.F., et al. · 2023 · TACL 2023

AgentBench: Evaluating LLMs as Agents

Liu et al. · 2023

AgentBench: Evaluating LLMs as Agents

Liu, X., et al. · 2023

Detecting Hallucinations in Large Language Models Using Semantic Entropy

Farquhar, S., et al. · 2024 · Nature