Learn how to implement validation gates, retries, checkpointed recovery, state reconciliation, loop breakers, and graceful degradation when LLM agents hallucinate, stall, or drift from their tools.
Traditional software and LLM agents can both fail loudly or quietly. Agent failures are especially easy to miss when the process returns a plausible answer while selecting a nonexistent tool, looping on a task, or claiming a side effect occurred without verified evidence. Without runtime controls, one stuck run can waste budget or send a customer an unsupported answer.
Imagine you have built an order-lookup agent for an e-commerce site. A customer asks, "Where is my order A102?" The agent plans to call the query_db tool with the order ID, fetch the shipping status, and return a friendly summary. On a good day, this works perfectly. On a bad day, the agent might invent a tool called search_database, send the wrong parameter name, or loop forever rephrasing the same query while the customer waits. Papers like ReAct (Reason + Act)[1] and Toolformer[2] made tool-using LLMs practical, but they didn't make them reliable by default. Production agents need defensive architecture that assumes failure is normal, not exceptional.
This article walks through a concrete order-lookup agent, watches it fail in realistic ways, and then adds validation gates, retry policy, , and bounded fallback paths. By the end, you'll know how to make failure explicit before generated claims or uncertain writes reach a customer.
Agent systems inherit ordinary software faults and add generated decisions that may be syntactically valid but unsupported. Compare the recovery controls needed at execution time:
| Failure Mode | Traditional Software | LLM Agent | Recovery Strategy |
|---|---|---|---|
| Invalid input | Rejected with error message | Agent hallucinates a "valid" response | Validation Gates: Schema checks before tool execution |
| Infinite loop | Stack overflow or out-of-memory (OOM) | Agent keeps calling the same tool with slight variations | Semantic Loop Detection: Hash canonicalized actions and compare intent similarity |
| Service unavailable | Connection timeout | Agent fabricates the API response | Circuit Breaker: Fail fast and fallback to static logic |
| Logic error | Wrong result, sometimes unnoticed | Plausible but unsupported answer or tool choice | Verifier + source check: Validate against deterministic evidence where available |
This additional risk stems from generated actions and answers. An LLM acting inside an agent can vary its tool selection or interpretation after a small change in input or sampling. Agent architecture still prevents known errors, but it also has to detect degraded trajectories while they are running.
Because you can't exhaustively unit-test every possible conversational path or generated output, the center of gravity shifts toward runtime safeguards and evaluation harnesses. In a deterministic system, you can enumerate many edge cases before deployment. In an agentic system, you still need offline evals, but you also need guardrails that constrain the model during execution. When the model inevitably drifts off course, the system surrounding it must detect the drift and either steer it back or abort the operation safely before it causes compounding harm.
Here is the single number that explains why agent reliability is hard. A task that needs n dependent steps succeeds only if every step succeeds. If each step is independent and succeeds with probability p, the whole task succeeds with probability:
This product decays fast. Even a per-step success rate that sounds excellent collapses over a long trajectory:
Per-step success p | 5 steps | 20 steps | 50 steps |
|---|---|---|---|
| 99% | 95% | 82% | 61% |
| 95% | 77% | 36% | 8% |
| 90% | 59% | 12% | 0.5% |
A 95%-reliable step is fine for a one-shot tool call and nearly useless for a 50-step research agent. This is why the same model can feel magical in a demo and fall apart in production: the demo was short. The model didn't get worse, the trajectory got longer.
This compounding is why long-horizon agents still break in production. In METR's 2025 task-horizon analysis, frontier models at the time improved their 50%-reliable task length quickly, with Claude 3.7 Sonnet reaching roughly an hour of human task time on that benchmark.[3] METR's interpretation is that those gains came largely from reliability and the ability to recover from mistakes, not only raw reasoning. Reliability is the bottleneck, and part of that bottleneck is engineering, not only model quality.
The benchmark numbers make the same point. On -bench, a tool-agent benchmark over retail and airline tasks, GPT-4o solves a meaningful share of tasks once but its pass^8 (succeeding on all eight independent attempts of the same task) falls below 25% in retail.[4] An agent that is right most of the time is still wrong often enough that, over a long run, failure is the expected outcome unless you engineer around it.
The defenses in this article all attack the same equation. They either raise per-step reliability p (validation gates, action-contract checks, structured feedback) or shrink the number of unrecovered steps that count against you (retries, fallbacks, checkpoints, loop breakers, escalation). You cannot make p = 1, so you build a system that survives the steps where p < 1.
Throughout this article, we'll follow a single ReAct-style agent tasked with looking up e-commerce orders. The agent receives a customer question, plans a tool call, executes it, and observes the result. Here's the loop in plain English:
query_db with {"id": "A102"}.This loop assumes the agent picks the right tool, formats the arguments correctly, and stops after getting an answer. In practice, any of those steps can fail. The rest of this article introduces failures one at a time, shows the symptom, and then adds a defense.
Before we add defenses, you should already understand two ideas from earlier in the curriculum:
If those concepts are fuzzy, revisit them first. This article builds directly on top of them.
When building and operating autonomous agents in production, failures usually collapse into four categories. Each category has a typical symptom and a root cause you'll see again and again.
| Category | Typical Symptom | Root Cause Example |
|---|---|---|
| Planning | Infinite loops | Hallucinating a tool that doesn't exist |
| Action | Syntax errors or wrong parameters | Missing a required JSON field in an API call |
| Reflection | "False success" | Agent thinks it finished but the output is empty or wrong |
| Memory | Context poisoning | Early errors cascading into later steps |
These categories map cleanly onto the six concrete failure states addressed below. Planning failures show up as tool hallucinations and infinite loops. Action failures show up as wrong parameters and task-contract drift. Reflection failures show up as the agent declaring success when it actually failed. Memory failures show up as cascading errors in multi-agent systems and context-window overflow.
The illustration below grounds those categories in one order-lookup agent, so the taxonomy stays tied to symptoms you can recognize in production.
Let's start with the simplest failure and the simplest fix. Our order-lookup agent has a tool called query_db that expects {"id": "A102"}. Suppose the agent sends {"order_id": "A102"} instead. The API returns a 400 Bad Request with the message: Missing required parameter 'id'.
A naive agent sees the error and tries the exact same call again. Nothing has changed, so it gets the exact same 400. That's the "just try again" anti-pattern. Without explaining why it failed, the agent is likely to repeat the error forever.
The fix is to feed the error back into the agent's context as structured feedback, not just a raw exception string. The agent can then compare its last action with the error and the tool schema, correct the key name, and retry. That's the foundation of reflexion-style recovery: the agent reflects on its failure, diagnoses the cause, and refines its next attempt.[5]
Here's how the corrected loop looks in practice:
1Step 1: query_db({"order_id": "A102"}) -> 400 Bad Request: Missing required parameter 'id'.
2Step 2: Agent reflects: "I used 'order_id' but the schema requires 'id'."
3Step 3: query_db({"id": "A102"}) -> 200 OK: Status delayed, ETA Friday.The critical insight is that the retry must carry new information. The same prompt with the same context will likely produce the same wrong answer. You need to tell the model exactly what went wrong in a format it can act on.
The primary defense is strict JSON schema validation. Modern model APIs often support tool calling or , but you must still validate arguments before execution and apply authorization separately. If the arguments are invalid, the system returns a structured error message to the repair loop rather than executing the call.
We can use Pydantic (in Python) or Zod (in TypeScript) to enforce these schemas. The following validation function models the actual query_db({"id": "A102"}) tool contract and returns concise correction feedback:
1from pydantic import BaseModel, ConfigDict, Field, ValidationError
2
3class ToolCallError(Exception):
4 pass
5
6class QueryOrderArgs(BaseModel):
7 model_config = ConfigDict(extra="forbid")
8 id: str = Field(..., pattern=r"^A[0-9]{3,}$")
9
10def validate_tool_call(raw_args: dict[str, object]) -> QueryOrderArgs:
11 try:
12 return QueryOrderArgs(**raw_args)
13 except ValidationError as exc:
14 field = exc.errors()[0]["loc"][0]
15 raise ToolCallError(f"query_db requires valid field: {field}") from exc
16
17try:
18 validate_tool_call({"order_id": "A102"})
19except ToolCallError as exc:
20 print("rejected:", exc)
21print("accepted:", validate_tool_call({"id": "A102"}).id)1rejected: query_db requires valid field: id
2accepted: A102Common mistake: Returning a raw Python traceback to the LLM. Standard error messages are often too noisy for LLMs. Agentic debugging involves isolating the minimal set of root-cause failures rather than dumping every surface-level exception. Feed the model a one-sentence description of what's wrong, not a 50-line stack .
One of the most frequent failures is tool misuse. In practice, that usually means one of two concrete problems: the model invents a tool that doesn't exist, or it emits malformed arguments for a real tool. Because LLMs are predictive text engines, if they lack sufficient context to execute a task, they often hallucinate a plausible-sounding tool instead of asking for clarification.
For example, our order-lookup agent might output:
1// Model generates this:
2{"tool": "search_database", "args": {"query": "order A102", "format": "json"}}
3// But the actual API expects:
4{"tool": "query_db", "params": {"id": "A102"}}The model invented search_database because the name sounds reasonable. Without a validation gate, this call would be executed against a non-existent endpoint, producing a confusing error that the agent might misinterpret.
The fix is the same schema validation we saw above, plus a tool allowlist. Before executing any tool call, check that the requested tool name exists in the registered tool set. If it doesn't, return a clear error: Tool 'search_database' is not available. Available tools: query_db, get_eta, suggest_alternative.
This transforms an ambiguous failure into structured feedback the agent can use to self-correct.
The loop bucket contains two common states: exact repeats and semantic repeats. In both cases, the agent repeatedly calls the same tool or oscillates between two states. Here's what it looks like with our order-lookup agent:
1Step 1: query_db({"id": "A102"}) -> "Delayed, ETA Friday"
2Step 2: query_db({"id": "A102"}) -> "Delayed, ETA Friday"
3Step 3: query_db({"id": "A102"}) -> "Delayed, ETA Friday"
4... (continues forever)Or with semantic rephrasing:
1Step 1: query_db({"id": "A102"}) -> "Delayed, ETA Friday"
2Step 2: search("status for order A102") -> "Delayed, ETA Friday"
3Step 3: search("A102 shipment ETA") -> "Delayed, ETA Friday"
4Step 4: search("where is order A102") -> "Delayed, ETA Friday"
5... (continues forever)These loops typically occur when the model's internal prompt fails to guide it out of a dead end. If an API returns an unhelpful response, the model might "forget" that it just tried that exact approach because its attention mechanism incorrectly weights its overarching goal higher than the recent failure. This creates a state of "zombie" execution where the agent keeps trying to solve the problem by doing the exact same thing, slightly rephrased, forever.
Multi-layered loop detection is the key. The LoopBreaker class provides a practical implementation of this defense by maintaining a stateful history of actions. Its check method takes a proposed tool name and its arguments as inputs to evaluate against past behavior. It combines exact deduplication, repeated-tool heuristics, and approximate intent similarity, then returns True if the current call would push the agent over a loop threshold:
1import difflib
2import hashlib
3import json
4
5class LoopBreaker:
6 def __init__(
7 self,
8 max_steps: int = 15,
9 max_identical: int = 3,
10 same_tool_streak: int = 5,
11 semantic_threshold: float = 0.9,
12 ):
13 self.max_steps = max_steps
14 self.max_identical = max_identical
15 self.same_tool_streak = same_tool_streak
16 self.semantic_threshold = semantic_threshold
17 self.tool_call_history: list[dict[str, str]] = []
18
19 def _stable_call_hash(self, tool_name: str, args: dict[str, object]) -> str:
20 canonical_args = json.dumps(
21 args,
22 sort_keys=True,
23 default=str,
24 separators=(",", ":"),
25 )
26 return hashlib.sha256(f"{tool_name}:{canonical_args}".encode("utf-8")).hexdigest()
27
28 def _intent_text(self, tool_name: str, args: dict[str, object]) -> str:
29 fields = [tool_name]
30 for key in ("query", "prompt", "text", "task"):
31 value = args.get(key)
32 if value:
33 fields.append(str(value).strip().lower())
34 return " ".join(fields)
35
36 def check(self, tool_name: str, args: dict[str, object]) -> bool:
37 """Returns True if a loop is detected."""
38 call = {
39 "tool": tool_name,
40 "args_hash": self._stable_call_hash(tool_name, args),
41 "intent_text": self._intent_text(tool_name, args),
42 }
43
44 # Check for loops BEFORE appending to history
45 # Hard step limit
46 if len(self.tool_call_history) >= self.max_steps:
47 return True
48
49 # Detect repeated identical calls
50 identical_attempts = 1 + sum(
51 1 for h in self.tool_call_history if h["args_hash"] == call["args_hash"]
52 )
53 if identical_attempts >= self.max_identical:
54 return True
55
56 # Detect same-tool repetition streaks
57 same_tool_attempts = 1
58 for h in reversed(self.tool_call_history):
59 if h["tool"] != tool_name:
60 break
61 same_tool_attempts += 1
62 if same_tool_attempts >= self.same_tool_streak:
63 return True
64
65 # Detect approximate intent repetition for rephrased queries/prompts
66 recent_window = self.tool_call_history[-(self.same_tool_streak - 1):]
67 similar_attempts = 1 + sum(
68 1
69 for h in recent_window
70 if h["intent_text"]
71 and difflib.SequenceMatcher(
72 None, h["intent_text"], call["intent_text"]
73 ).ratio() >= self.semantic_threshold
74 )
75 if similar_attempts >= self.max_identical:
76 return True
77
78 # Only append after checks to avoid polluting history with the failing call
79 self.tool_call_history.append(call)
80 return False
81
82breaker = LoopBreaker(max_identical=3)
83print("first:", breaker.check("query_db", {"id": "A102"}))
84print("second:", breaker.check("query_db", {"id": "A102"}))
85print("third:", breaker.check("query_db", {"id": "A102"}))1first: False
2second: False
3third: TrueNotice that these thresholds include the current attempt. That detail matters because an off-by-one bug here burns a real extra tool call, token round-trip, and latency spike before the breaker trips.
In production, this approximate string check is usually replaced with embeddings over the tool name plus a compact state summary. The goal is the same: catch query_db({"id": "A102"}) versus search("where is order A102") before the agent burns another five steps.
Common mistake: Assuming that
max_stepsalone is enough to prevent loops. A step limit of 15 doesn't stop an agent from wasting 15 steps (and thousands of tokens) on a futile loop. Semantic loop detection stops the bleeding early, often after 2-3 repeated attempts.
This bucket has two related states: hard context-window overflow and slow budget bleed. Because most agent architectures append every tool call, reasoning step, and observation to a continuous conversational history, prompt size grows roughly linearly with each turn. Across the full run, total token spend can become roughly quadratic because each new step re-sends most of the prior history. If a task requires 20 steps, the final step includes the previous 19 steps in context, leading to massive context window bloat.
If this growth is left unchecked, it inevitably leads to context window overflows (commonly rejected with a provider error, or handled via truncation you didn't intend) or unexpectedly high API bills. A runaway agent stuck in a subtle loop can consume a large amount of tokens before anyone notices.
Effective budget management involves implementing two distinct layers: hard limits and soft management. Hard limits act as a circuit breaker, abruptly stopping the agent if it exceeds a predefined cost or token threshold. Soft management involves proactively monitoring the context window and taking action (like summarizing the history) before limits are breached.
The TokenBudget class below provides a programmatic safeguard to track token consumption against hard constraints. Its consume method takes the number of input and output tokens from each generation step as inputs. It updates the internal state and, if the predefined usage or cost thresholds are exceeded, it immediately interrupts execution by raising a BudgetExhausted exception:
1class BudgetExhausted(Exception):
2 pass
3
4class TokenBudget:
5 def __init__(
6 self,
7 max_tokens: int = 100_000,
8 max_cost_usd: float = 1.0,
9 pricing: dict[str, dict[str, float]] | None = None,
10 ):
11 self.max_tokens = max_tokens
12 self.max_cost = max_cost_usd
13 self.used_tokens = 0
14 self.estimated_cost = 0.0
15 self.pricing = pricing or {}
16
17 def consume(self, input_tokens: int, output_tokens: int, model: str):
18 """Updates token counts and raises an error if budget is exceeded."""
19 self.used_tokens += input_tokens + output_tokens
20 self.estimated_cost += self._calculate_cost(input_tokens, output_tokens, model)
21
22 # Hard stop to prevent runaway billing
23 if self.used_tokens > self.max_tokens:
24 raise BudgetExhausted(f"Token limit exceeded: {self.used_tokens}/{self.max_tokens}")
25 if self.estimated_cost > self.max_cost:
26 raise BudgetExhausted(f"Cost limit exceeded: ${self.estimated_cost:.2f}/${self.max_cost:.2f}")
27
28 def _calculate_cost(self, input_tokens: int, output_tokens: int, model: str) -> float:
29 rates = self.pricing.get(model)
30 if rates is None:
31 raise BudgetExhausted(f"Pricing missing for model: {model}")
32
33 input_cost = (input_tokens / 1_000_000) * rates["input"]
34 output_cost = (output_tokens / 1_000_000) * rates["output"]
35 return input_cost + output_cost
36
37budget = TokenBudget(
38 max_tokens=1_000,
39 max_cost_usd=0.01,
40 pricing={"order-model": {"input": 1.0, "output": 2.0}},
41)
42budget.consume(200, 50, "order-model")
43print("used tokens:", budget.used_tokens)
44try:
45 budget.consume(900, 20, "order-model")
46except BudgetExhausted as exc:
47 print("blocked:", str(exc).split(":")[0])1used tokens: 250
2blocked: Token limit exceededProduction tip: When the budget approaches a configured soft threshold, stop optional exploration and compact only validated facts and pending work. A generated summary is lossy and cannot replace the durable record of writes, approvals, or idempotency keys.
In complex architectures, such as Directed Acyclic Graphs (DAGs) or hierarchical agent swarms (where a supervisor agent delegates to worker agents), failures usually spread in two ways: a bad handoff poisons downstream reasoning, or the agent's internal state drifts away from the source of truth after a partial side effect. Both create a snowball effect where a minor mistake at the start of a pipeline compounds into a much larger failure by the end.
Consider an example failure chain in a fulfillment-support pipeline:
A second version is even more dangerous because it looks operationally healthy. Suppose an "issue refund" tool times out after the payment processor already committed the refund. If your orchestrator records that step as failed and blindly retries, you now have duplicate side effects plus an agent state that no longer matches the external system.
The solution is strict, per-agent output validation gates plus explicit state reconciliation around side effects. You should never assume an agent's output is safe to pass directly to another agent without verification, and you should never assume a timed-out write means "nothing happened."
Before passing context to the next node in the graph, run a validator. Format and authorization checks should be deterministic; factual business state should be checked against its source of truth. A model-based critic can flag candidates for review, but it cannot prove that an order was delivered or a refund committed. If validation fails, the pipeline halts and either requests a corrected proposal or escalates. For side-effecting tool calls, add read-after-write reconciliation and idempotency keys before telling downstream agents the action succeeded.
The validated_handoff function implements this pattern to secure the boundaries between components. It accepts the upstream agent's raw output and a list of validation functions as inputs. Each validator returns a small ValidationResult, so the handoff code can return the intact output if all checks pass, or a structured error dictionary meant for retry or human escalation if any validation fails:
1from dataclasses import dataclass
2from typing import Callable
3
4@dataclass
5class ValidationResult:
6 valid: bool
7 reason: str = ""
8
9Validator = Callable[[dict[str, object]], ValidationResult]
10
11def validated_handoff(
12 output: dict[str, object],
13 validators: list[Validator],
14) -> dict[str, object]:
15 """Validate agent output before passing to next agent."""
16 for validator in validators:
17 result = validator(output)
18 if not result.valid:
19 # Option 1: Fail fast
20 # raise AgentError(result.reason)
21
22 # Option 2: Return structured error for retry
23 return {
24 "status": "validation_failed",
25 "error": result.reason,
26 "fallback": "Request human review of this output"
27 }
28 return outputThis bucket has two main states: transient faults (429, timeouts, brief 503s) and persistent dependency outages where the upstream still isn't healthy after the retry window. APIs go down, rate limits are hit, and databases occasionally time out. While traditional software also faces these issues, LLM agents are particularly vulnerable because they make a high volume of sequential API calls. A single user request might trigger multiple model generations, drastically increasing the surface area for a network failure.
Many LLM APIs enforce separate request-rate and token-rate quotas (often expressed as RPM and TPM). Because an agent's context window grows with every step, late-stage agent turns consume significantly more tokens than early ones. This means an agent can suddenly hit a token-rate limit in the middle of a complex reasoning chain, even if the raw request count is still within bounds.
To handle these transient network and rate-limit errors, use exponential backoff with jitter. A simple immediate retry will likely hit the same rate limit again, whereas exponential backoff gives the API time to replenish its token buckets. Jitter (adding randomness to the retry delay) prevents the "thundering herd" problem where multiple stalled agents retry simultaneously. Only retry failures that are plausibly transient, such as 429, 502/503, or timeouts. Don't put deterministic failures like 400 schema errors or 401 auth problems on the same retry path.
Here's an example using the tenacity library in Python to wrap model calls with reliable retry logic. The call_llm_with_retry async function takes a list of conversational messages and a model string as inputs. It normalizes provider-specific transient failures into retryable exception classes, then retries with randomized exponential backoff. The call_model_api function stands in for your actual provider client:
1import logging
2import tenacity
3
4logger = logging.getLogger(__name__)
5
6class RetryableModelError(Exception):
7 pass
8
9class RateLimitExceeded(RetryableModelError):
10 pass
11
12class UpstreamTimeout(RetryableModelError):
13 pass
14
15class UpstreamUnavailable(RetryableModelError):
16 pass
17
18def normalize_provider_error(exc: Exception) -> Exception:
19 message = str(exc).lower()
20 if "rate limit" in message:
21 return RateLimitExceeded(str(exc))
22 if "timeout" in message:
23 return UpstreamTimeout(str(exc))
24 if "temporarily unavailable" in message or "503" in message:
25 return UpstreamUnavailable(str(exc))
26 return exc
27
28@tenacity.retry(
29 stop=tenacity.stop_after_attempt(3),
30 wait=tenacity.wait_random_exponential(multiplier=1, min=1, max=30),
31 retry=tenacity.retry_if_exception_type(RetryableModelError),
32 before_sleep=lambda retry_state: logger.warning(
33 f"Retry {retry_state.attempt_number}: {retry_state.outcome.exception()}"
34 ),
35)
36async def call_llm_with_retry(messages: list[dict[str, str]], model: str = "primary-model"):
37 """Wrap a provider call with exponential backoff plus jitter."""
38 try:
39 return await call_model_api(
40 messages=messages,
41 model=model,
42 timeout_seconds=30,
43 )
44 except Exception as exc:
45 raise normalize_provider_error(exc) from excBlind retries only help with transient infrastructure faults. If the model keeps making the same schema mistake, the next attempt needs better information, not just more time. Reflexion-style recovery uses feedback from the previous trial to improve the next one.[5] In production, that feedback can be much simpler than a full reflection loop: a concise validation error or tool rejection reason is often enough to change the next attempt.
External faults are where retry logic usually starts, but the same decision tree applies everywhere. Retries are only the right tool when the failure is transient and nothing important has committed yet. A 429, 503, or dropped connection usually fits that pattern. A poisoned context or partially completed side effect doesn't. If the agent already believes "invoice created" even though the tool call actually timed out, replaying the next step can duplicate charges or reason from false state.
Durable runtimes handle this with checkpoints rather than blind replay. LangGraph persists graph state as checkpoints keyed by thread_id and resumes from a saved super-step boundary.[6] Temporal persists workflow execution state in event history and replays deterministic workflow code after failures.[7] Both patterns let you resume from the last confirmed-good boundary instead of restarting a long run from scratch.
In practice, the rule is simple:
This classifier makes the distinction executable. Notice that an uncertain write never enters the ordinary retry path, even when its transport error looks transient.
1def recovery_path(error: str, side_effect_status: str, checkpoint_valid: bool) -> str:
2 if side_effect_status == "uncertain":
3 return "reconcile before replay"
4 if checkpoint_valid and error == "worker_restarted":
5 return "resume checkpoint"
6 if side_effect_status == "none" and error in {"429", "503", "timeout"}:
7 return "retry with backoff"
8 return "stop or escalate"
9
10print("lookup timeout:", recovery_path("timeout", "none", False))
11print("refund timeout:", recovery_path("timeout", "uncertain", True))1lookup timeout: retry with backoff
2refund timeout: reconcile before replayProduction tip: Pair checkpoints with idempotency keys for outbound side effects. Resume must be safe even if the previous attempt failed after the external system committed but before your agent recorded success.
A particularly insidious failure occurs when the agent's proposed operation diverges from its accepted task contract. The two common states here are wrong tool choice despite a valid request and wrong argument serialization despite choosing the right tool. A model may describe a reasonable plan yet call the wrong tool or use unsupported parameters when it emits the executable payload.
Consider our order-lookup agent. Its structured intent says lookup_order for A102. However, the emitted tool call might target a calculate_shipping tool or provide a different order ID, producing irrelevant results while the response text still sounds plausible.
This disconnect happens because generated explanations do not constrain generated tool payloads. An application should validate the payload against its task contract instead of asking for hidden reasoning or trusting a confident explanation.
The primary defense is execution verification: compare the proposed tool call to a structured task contract before execution. A critic model can help triage ambiguous output, but deterministic tool, identifier, permission, and schema checks should block clear violations.
1class ActionMismatch(Exception):
2 pass
3
4def validate_action_contract(
5 task_contract: dict[str, object],
6 tool_name: str,
7 tool_args: dict[str, object],
8 allowed_tools: set[str],
9) -> str:
10 if tool_name not in allowed_tools:
11 raise ActionMismatch("tool not allowed")
12 if task_contract["intent"] == "lookup_order" and tool_name != "query_db":
13 raise ActionMismatch("tool does not match intent")
14 if tool_args.get("id") != task_contract["order_id"]:
15 raise ActionMismatch("order ID changed")
16 return "admitted"
17
18contract = {"intent": "lookup_order", "order_id": "A102"}
19print("lookup:", validate_action_contract(contract, "query_db", {"id": "A102"}, {"query_db"}))
20try:
21 validate_action_contract(contract, "calculate_shipping", {}, {"query_db"})
22except ActionMismatch as exc:
23 print("drift:", exc)1lookup: admitted
2drift: tool not allowedSo far we've treated each failure as a separate problem with a separate fix. For failures that permit another attempt, structured feedback can help the model propose a better one. The Reflexion pattern, introduced by Shinn et al. (2023), gives an agent a way to reflect on feedback and refine its strategy before trying again.[5]
The loop is simple: Generate -> Critique -> Refine.
Think of it like a writer and an editor working on the same draft. The writer (the generator) produces the first attempt. The editor (the critique) reads it, marks what's wrong, and suggests improvements. The writer then produces a second draft that addresses the feedback. In an agent, the same LLM can play both roles by switching prompts: first a "doer" prompt, then a "critic" prompt, then a "fixer" prompt.
Here's how this looks for our order-lookup agent after a 400 Bad Request:
1[Generate] Agent: query_db({"order_id": "A102"})
2[Observe] System: 400 Bad Request: Missing required parameter 'id'.
3[Critique] Critic prompt: "The agent used 'order_id' but the schema requires 'id'.
4 The agent should check the tool schema before retrying."
5[Refine] Agent: query_db({"id": "A102"})
6[Observe] System: 200 OK: Status delayed, ETA Friday.The critical difference from simple retry is that the critique step produces new reasoning, not just a repeated attempt. The agent explicitly names what went wrong and how to fix it, which changes the probability distribution of the next generation.
Key insight: A model critique is another proposal, not proof. Use it to suggest a repair after an allowed failure, then rerun deterministic validators and source-of-truth checks before execution.
In production, you don't always need a full three-step Reflexion loop. For many failures, a single structured error message is enough. Use critique/refine for repeated or ambiguous proposals that remain safe to retry; high-stakes or potentially committed effects should stop for approval or reconciliation instead.
Borrowed from microservices architecture, the circuit breaker prevents a failing component from taking down the entire system. For agents, that usually means failing fast once a tool or model endpoint crosses a known error threshold instead of letting every request discover the outage independently.
The circuit breaker pattern operates as a state machine with three distinct states to manage failure routing:
The "Half-Open" state is critical for agent recovery. If we simply switched from Open back to Closed after a timeout, a still-failing API would immediately receive a flood of pending agent requests (the "thundering herd" problem), potentially triggering further rate limits or blowing through your budget before the circuit trips again. By allowing only a single test request through, we ensure the service is genuinely healthy before restoring full traffic.
Here's a complete implementation of this state machine. The CircuitBreaker class acts as a protective wrapper, where its call method takes an arbitrary async function and its arguments as inputs. This version explicitly gates the half-open state so only one probe request can run at a time:
1import asyncio
2import time
3
4class CircuitOpenError(Exception):
5 pass
6
7class CircuitBreaker:
8 """Prevents cascading failures by halting requests to failing services."""
9 CLOSED = "closed" # Normal operation
10 OPEN = "open" # All calls fail fast
11 HALF_OPEN = "half_open" # Test with one call
12
13 def __init__(self, failure_threshold: int = 5, reset_timeout: float = 60.0):
14 self.state = self.CLOSED
15 self.failure_count = 0
16 self.failure_threshold = failure_threshold
17 self.reset_timeout = reset_timeout
18 self.last_failure_time = 0.0
19 self._lock = asyncio.Lock()
20 self._half_open_probe_in_flight = False
21
22 async def call(self, func, *args, **kwargs):
23 """Executes the function if the circuit is closed or half-open."""
24 async with self._lock:
25 now = time.time()
26
27 if self.state == self.OPEN:
28 if now - self.last_failure_time > self.reset_timeout:
29 self.state = self.HALF_OPEN
30 self._half_open_probe_in_flight = False
31 else:
32 raise CircuitOpenError("Circuit is open - failing fast")
33
34 if self.state == self.HALF_OPEN and self._half_open_probe_in_flight:
35 raise CircuitOpenError("Half-open probe already in flight")
36
37 if self.state == self.HALF_OPEN:
38 self._half_open_probe_in_flight = True
39
40 try:
41 result = await func(*args, **kwargs)
42 except Exception:
43 async with self._lock:
44 self._on_failure()
45 self._half_open_probe_in_flight = False
46 raise
47
48 async with self._lock:
49 self._on_success()
50 self._half_open_probe_in_flight = False
51 return result
52
53 def _on_success(self):
54 self.failure_count = 0
55 self.state = self.CLOSED
56
57 def _on_failure(self):
58 self.failure_count += 1
59 self.last_failure_time = time.time()
60 if self.state == self.HALF_OPEN or self.failure_count >= self.failure_threshold:
61 self.state = self.OPENIf you run multiple app servers, keep this state in a shared store like Redis rather than in process memory. A per-process breaker protects only the worker that observed the failures.
When the primary approach fails, fall back to cheaper but more reliable alternatives. The chain below shows the order: try the most capable path first, validate its result, then degrade only when that path fails.
This chain systematically trades capability for narrower behavior. If the most advanced agent with the fullest tool set fails or hits a circuit breaker, you may downgrade to a simpler agent with fewer tools, then to a direct model call with no tools, and finally to a static status message. A static fallback is deterministic, but it cannot answer a live order or refund question. It should state that live data is unavailable and route the user to retry or support rather than inventing status.
Here's a Python implementation of this pattern. The FallbackChain class orchestrates this graceful degradation through its execute method, which takes the user's task string as input. It systematically attempts to process the task using a list of progressively simpler internal strategies, ultimately returning the first valid string response it generates, or a hardcoded default if all fail:
1import logging
2
3logger = logging.getLogger(__name__)
4
5class FallbackChain:
6 """Try multiple approaches in order of capability/cost.
7
8 Note: run_agent and call_llm are illustrative functions representing
9 your underlying execution engine.
10 """
11
12 # Pre-defined responses for last-resort fallback
13 FALLBACK_RESPONSES = {
14 "greeting": "Hello! I'm currently operating in degraded mode.",
15 "search": "Search is currently unavailable, please try again later."
16 }
17 DEFAULT_RESPONSE = "Live order status is unavailable right now. Please retry or contact support."
18
19 async def execute(self, task: str, requires_live_data: bool = False) -> str:
20 strategies = [
21 ("primary_agent", self._full_agent_execution),
22 ("backup_agent", self._simplified_agent),
23 ("direct_generation", self._direct_llm_call),
24 ("static_fallback", self._static_fallback),
25 ]
26
27 for strategy_name, strategy in strategies:
28 try:
29 result = await strategy(task)
30 if self._validate_result(strategy_name, result, requires_live_data):
31 return result
32 logger.warning(f"Strategy {strategy_name} returned invalid result")
33 except Exception as e:
34 logger.error(f"Strategy {strategy_name} failed: {e}")
35 continue
36
37 return self.DEFAULT_RESPONSE
38
39 def _validate_result(
40 self,
41 strategy_name: str,
42 result: str,
43 requires_live_data: bool,
44 ) -> bool:
45 if not result or not result.strip():
46 return False
47 if requires_live_data and strategy_name in {"direct_generation", "static_fallback"}:
48 return False
49 return True
50
51 async def _full_agent_execution(self, task: str) -> str:
52 """Full agent with tools: most capable, most expensive."""
53 # In an implementation, return only output backed by validated tool evidence.
54 return "Full agent output"
55
56 async def _simplified_agent(self, task: str) -> str:
57 """Simplified agent: fewer tools, lower step limit."""
58 # In an implementation, retain the same evidence checks as primary path.
59 return "Simplified agent output"
60
61 async def _direct_llm_call(self, task: str) -> str:
62 """Direct LLM call without tools: cheapest, fastest."""
63 # return await call_llm(task)
64 return "Direct LLM output"
65
66 async def _static_fallback(self, task: str) -> str:
67 """Static response: deterministic last-resort path."""
68 # classify_task is an illustrative function
69 # task_type = classify_task(task)
70 task_type = "unknown"
71 return self.FALLBACK_RESPONSES.get(task_type, self.DEFAULT_RESPONSE)Long-context studies show a clear positional bias: models often do best when relevant information appears near the beginning or end of the context window, and noticeably worse when that information is buried in the middle.[8] In long-running agent traces, critical constraints can drift into that low-salience middle region as more tool output and scratchpad text accumulate.
The pinned constraints pattern addresses this by systematically re-injecting a very short set of essential rules at the end of every prompt. This takes advantage of recency effects to keep high-priority instructions salient regardless of conversation length. It's a recall aid, not an enforcement boundary. Authorization checks, tool allowlists, and output validation still need to live outside the model.
For example, if an agent must never disclose sensitive customer data, simply stating this constraint once at the start of a system prompt is insufficient. As the conversation grows, the model's attention shifts to recent turns and the original constraint fades. By appending a "pinned constraints" section to every prompt generation, you keep the critical rules fresher in the model's context.
1class PinnedConstraintsAgent:
2 """Agent that re-injects critical constraints at the end of every prompt."""
3
4 CRITICAL_CONSTRAINTS = """
5CRITICAL CONSTRAINTS (must follow):
61. Never disclose PII (Personally Identifiable Information)
72. Always confirm actions that modify data
83. If uncertain, ask for clarification rather than guessing
94. Maximum 10 tool calls per user request
10"""
11
12 def __init__(self, base_system_prompt: str):
13 self.base_system_prompt = base_system_prompt
14
15 def build_prompt(self, conversation_history: list, current_task: str) -> str:
16 """Build prompt with pinned constraints at the end."""
17 prompt_parts = [
18 self.base_system_prompt,
19 "",
20 "=== CONVERSATION HISTORY ===",
21 self._format_history(conversation_history),
22 "",
23 "=== CURRENT TASK ===",
24 current_task,
25 "",
26 self.CRITICAL_CONSTRAINTS, # Pinned at the end for recency bias
27 ]
28 return "\n".join(prompt_parts)
29
30 def _format_history(self, history: list) -> str:
31 return "\n".join([f"{msg['role']}: {msg['content']}" for msg in history[-10:]])Production tip: Keep pinned constraints concise (3-5 bullet points max). Too many constraints dilute the recency effect. Prioritize safety, compliance, and task-critical rules only.
Not all failures can be resolved automatically. When an agent encounters high-stakes ambiguity, repeated failures, or potential safety issues, it should pause and route a bounded review packet to an authorized human reviewer.
Human-in-the-loop (HITL) architectures can intercept risky proposals when triggers are correctly enforced. They do not guarantee correctness: reviewers can be wrong, approvals can become stale, and execution can drift from the reviewed proposal. Use measurable escalation triggers and, for any later side effect, bind approval to the exact action and revalidate it at execution time.
Escalation should be triggered by specific, measurable conditions rather than vague "uncertainty." Effective triggers include:
Here's an implementation that demonstrates intelligent escalation logic:
1from dataclasses import dataclass
2from enum import Enum
3
4class EscalationReason(Enum):
5 POLICY_REVIEW = "policy_review"
6 REPEATED_FAILURES = "repeated_failures"
7 SAFETY_CHECK = "safety_check"
8 VALIDATION_FAILED = "validation_failed"
9 MAX_STEPS_EXCEEDED = "max_steps_exceeded"
10
11@dataclass
12class EscalationRequest:
13 reason: EscalationReason
14 context: dict[str, object]
15 priority: int # 1 (critical) to 5 (low)
16 suggested_action: str
17
18class HumanInTheLoop:
19 """Manages escalation to human operators with context preservation."""
20
21 def __init__(self,
22 review_threshold: float = 0.7,
23 max_retries: int = 3):
24 self.review_threshold = review_threshold
25 self.max_retries = max_retries
26 self.escalation_queue: list[EscalationRequest] = []
27
28 def should_escalate(self,
29 agent_state: dict[str, object],
30 failure_count: int = 0,
31 policy_score: float | None = None) -> EscalationRequest | None:
32 """Determine if human intervention is needed."""
33
34 # Check repeated failures
35 if failure_count >= self.max_retries:
36 return EscalationRequest(
37 reason=EscalationReason.REPEATED_FAILURES,
38 context=agent_state,
39 priority=2,
40 suggested_action="Review failed attempts and provide guidance"
41 )
42
43 # Check evaluated policy threshold
44 if policy_score is not None and policy_score < self.review_threshold:
45 return EscalationRequest(
46 reason=EscalationReason.POLICY_REVIEW,
47 context=agent_state,
48 priority=3,
49 suggested_action="Review policy-flagged output"
50 )
51
52 # Check safety-critical actions
53 if agent_state.get("action_type") in ["delete", "modify", "transfer", "refund"]:
54 return EscalationRequest(
55 reason=EscalationReason.SAFETY_CHECK,
56 context=agent_state,
57 priority=1,
58 suggested_action="Approve sensitive action before execution"
59 )
60
61 return None # No escalation needed
62
63 async def handle_escalation(self, request: EscalationRequest) -> dict[str, object]:
64 """Submit to human review queue and await resolution."""
65 self.escalation_queue.append(request)
66
67 # In production, this would notify via Slack, PagerDuty, etc.
68 # and wait for human response
69 return {
70 "status": "escalated",
71 "ticket_id": f"ESC-{len(self.escalation_queue)}",
72 "reason": request.reason.value,
73 "priority": request.priority,
74 "review_decision": await self._await_human_input(request)
75 }
76
77 async def _await_human_input(self, request: EscalationRequest) -> dict[str, str]:
78 # Placeholder - production would integrate with ticketing system
79 return {"decision": "pending", "action_hash": "not-approved"}Key insight: An agent that says "I need help" is more valuable than one that silently hallucinates a "Success" message. Design your escalation UX to make it clear when and why the agent failed, preserving user trust.
Production agent systems need specialized observability beyond standard APM (Application Performance Monitoring). While latency and error rates are important, they don't capture the unique failure modes of probabilistic agents. A "healthy" agent might return 200 OK status codes while being completely stuck in a logic loop or hallucinating facts. Benchmarks such as AgentBench highlight how even strong models exhibit high failure rates on realistic multi-step agent tasks when no defensive layers are present.[9][10]
Track agent-specific metrics to detect runs that are technically returning responses but failing to deliver valid results. Semantic entropy techniques measure uncertainty across sampled answers and can help prioritize review for open-ended generations; they do not verify a tool result or replace source-of-truth checks.[11]
| Metric | Type | What it Measures | Critical Alert Threshold |
|---|---|---|---|
| Loop Rate | Counter | Percentage of traces where loop detection triggered | > 5% of requests |
| Recovery Rate | Derived ratio | Percentage of failed trajectories rescued by validation, retry, fallback, or escalation | Drops below baseline |
| Fallback Rate | Counter | Frequency of downgrading to simpler models | > 10% of requests |
| Step Count | Histogram | Number of tool calls per user turn | P99 (99th percentile) > 15 steps |
| Token Usage | Histogram | Total tokens (prompt + completion) per turn | > 80% of budget |
| Validation Rejection Rate | Derived ratio | Percentage of validated outputs rejected by factual or schema checks | > 2% of validated outputs |
Counters track cumulative events like loops, fallbacks, and validation failures. Histograms record distributions like step counts and token usage. Validation rejection rate and recovery rate are usually computed as derived ratios from validation and incident counters rather than stored directly as gauges. A useful definition is:
For example, if 100 order-lookup traces hit a validation error, loop breaker, timeout, or fallback path, and 72 still return a safe answer or clean escalation, the recovery rate is 72%. This metric tells you whether your defenses are actually saving users, not merely detecting failures.
The following example uses the Prometheus client to define these custom agent metrics. AgentMetrics class sets up counters and histograms that map cleanly to the table above. Once instantiated, it produces metric objects that take labels (like the agent's name or model) as inputs, allowing your observability stack to monitor loop rates, fallback frequency, and validation failures over time:
1from prometheus_client import Counter, Histogram
2
3class AgentMetrics:
4 """Tracks specialized observability metrics for LLM agents."""
5 def __init__(self):
6 # Per-run distributions support P95/P99 alerting
7 self.step_count = Histogram(
8 "agent_steps_per_run",
9 "Tool calls per agent execution",
10 ["agent"],
11 buckets=(1, 3, 5, 8, 10, 15, 20, 30),
12 )
13
14 # Event counters
15 self.loop_detected = Counter("agent_loops_total", "Loop detections", ["agent"])
16 self.fallback_triggered = Counter("agent_fallbacks_total", "Fallback triggers", ["strategy"])
17 self.interventions_triggered = Counter(
18 "agent_interventions_total",
19 "Failure-handling interventions triggered",
20 ["agent", "intervention"],
21 )
22 self.safe_recoveries = Counter(
23 "agent_safe_recoveries_total",
24 "Interventions that ended in safe answer or clean escalation",
25 ["agent", "intervention"],
26 )
27 self.validation_failures = Counter(
28 "agent_validation_failures_total",
29 "Outputs rejected by validation gates",
30 ["agent", "reason"],
31 )
32 self.validated_outputs = Counter(
33 "agent_validated_outputs_total",
34 "Outputs checked by validation gates",
35 ["agent"],
36 )
37
38 # Budget distributions
39 self.token_usage = Histogram("agent_tokens_per_run", "Tokens per execution", ["model"])Treat these numbers as example starting points calibrated from your own baseline, not universal constants.
These agent-specific metrics should be piped directly into your alerting infrastructure (like PagerDuty or Datadog) alongside standard server metrics. Catching a spike in loop detections early can be the difference between a minor service degradation and a massive, unexpected API bill at the end of the month.
When engineers transition from deterministic software to building agentic pipelines, they often carry over assumptions that don't apply to probabilistic models. Recognizing and unlearning these patterns is critical for building resilient systems.
Common mistake: Believing you can "just retry on failure" like traditional APIs. Reality: Retrying the same hallucinating LLM call often repeats the same failure mode, or produces a different wrong answer. Effective recovery requires changing the strategy, either by modifying the prompt, switching models, or reducing the task scope.
Common mistake: Treating every timeout as safe to replay. Reality: If a side effect may already have committed, replay can duplicate work or create inconsistent state. Use idempotency keys, checkpoints, and explicit reconciliation against the source of truth.
Common mistake: Assuming that max step counts are enough to prevent loops. Reality: Step limits are a crude safety net. They don't prevent an agent from wasting 15 steps (and thousands of tokens) on a futile loop before hitting the limit. Semantic loop detection stops the bleeding early.
Common mistake: Wrapping all agent calls in generic try/except blocks. Reality: Catching all exceptions silently leads to "zombie agents" that continue operating with corrupted state. Failures should be explicit and handled by validation gates or circuit breakers, not swallowed by generic error handlers.
Here's a short exercise to test whether you can match symptoms to defenses. Read each scenario, decide which failure category it belongs to, and pick the right fix.
Scenario 1: Your order-lookup agent calls query_db({"id": "A102"}) three times in a row, gets the same answer each time, and keeps going.
Scenario 2: The agent's reasoning trace says "I will search the database" but the actual tool call is calculate_shipping({}).
Scenario 3: After a refund tool times out, the orchestrator retries and the customer receives two refund emails.
Scenario 4: The agent has used 120,000 tokens on a single request and your limit is 100,000.
Scenario 5: The agent invents a tool called fetch_order_magic that doesn't exist in your tool registry.
Scenario 1: Infinite loop (Planning failure). The LoopBreaker would catch it on the third identical call via args_hash deduplication. You should also check why the agent isn't stopping after getting a complete answer. Is the stopping condition too vague?
Scenario 2: Action-contract disconnect (Action failure). The runtime should compare the emitted tool to the structured task contract and allowlist, blocking calculate_shipping when the accepted intent is order lookup. A critique model can add review signal, but it should not replace deterministic capability checks.
Scenario 3: Cascading failure (Memory/State failure). The refund was non-idempotent and the timeout didn't mean "nothing happened." You need idempotency keys on the refund tool and read-after-write reconciliation before retrying. Blind replay caused a duplicate side effect.
Scenario 4: Token budget exhaustion (Memory failure). The TokenBudget class would raise BudgetExhausted at the hard limit. Better yet, you should trigger summarization at the 80% soft threshold to prevent hitting the hard limit at all.
Scenario 5: Tool call hallucination (Planning failure). A tool allowlist would reject fetch_order_magic before execution, returning structured feedback: Tool 'fetch_order_magic' not available. Available tools: query_db, get_eta, suggest_alternative.
To build agents that can survive production workloads, keep the following core principles in mind:
n dependent steps succeeds with roughly p^n, so a 95%-reliable step is nearly useless over a 50-step run. Reliability, not raw reasoning, is usually the bottleneck.You now understand how to detect, classify, and recover from agent failures inside a single agent workflow. The next logical step is orchestration: once planning, retrieval, execution, review, and escalation are split across multiple workers, every failure control needs an owner, a handoff contract, and a shared view of state.
429, timeouts, and short outages; route schema, auth, and policy failures to correction or escalation.ReAct: Synergizing Reasoning and Acting in Language Models.
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. · 2023 · ICLR 2023
Toolformer: Language Models Can Teach Themselves to Use Tools.
Schick, T., et al. · 2023 · NeurIPS 2023
Measuring AI Ability to Complete Long Tasks
Kwa, T., West, B., Becker, J., et al. (METR) · 2025
Tau-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
Yao, S., et al. · 2024 · arXiv preprint
Reflexion: Language Agents with Verbal Reinforcement Learning.
Shinn, N., et al. · 2023
LangGraph Persistence
LangChain · 2026
Temporal Workflow Execution Overview
Temporal Technologies · 2026
Lost in the Middle: How Language Models Use Long Contexts
Liu, N.F., et al. · 2023 · TACL 2023
AgentBench: Evaluating LLMs as Agents
Liu et al. · 2023
AgentBench: Evaluating LLMs as Agents
Liu, X., et al. · 2023
Detecting Hallucinations in Large Language Models Using Semantic Entropy
Farquhar, S., et al. · 2024 · Nature