LearnAdvanced Agents & RetrievalAgent Failure & Recovery

🤖HardLLM Agents & Tool Use

Agent Failure & Recovery

Learn how to implement validation checks, retries, checkpointed recovery, state reconciliation, loop breakers, and graceful degradation when LLM agents hallucinate, stall, or drift from their tools.

53 min read

Learning path

Step 123 of 158 in the full curriculum

Agent Memory & Persistence Recursive Language Models (RLM)

Memory preserves context across an agent run; recovery keeps that run bounded when tools, state, or generated decisions go wrong.

Traditional software and LLM agents can both fail loudly or quietly. Agent failures are especially easy to miss when the process returns a plausible answer while selecting a nonexistent tool, looping on a task, or claiming a side effect occurred without verified evidence. Without runtime controls, one stuck run can waste budget or send a user an unsupported answer.

A deploy-status agent for a CI/CD platform makes the failure surface concrete. A developer asks, "What happened to run RUN-842?" The agent plans to call the get_run tool with the run ID, fetch the build status, and return a concise summary. On a good day, this works perfectly. On a bad day, the agent might invent a tool called search_runs, send the wrong parameter name, or loop forever rephrasing the same query while the developer waits. Papers like ReAct (Reason + Act)^{[1]Reference 1ReAct: Synergizing Reasoning and Acting in Language Models.https://arxiv.org/abs/2210.03629} and Toolformer^{[2]Reference 2Toolformer: Language Models Can Teach Themselves to Use Tools.https://arxiv.org/abs/2302.04761} made tool-using LLMs practical, but they didn't make them reliable by default. Production agents need defensive architecture that assumes failure is normal, not exceptional.

Start with a concrete deploy-status agent, watch it fail in realistic ways, and then add validation checks, retry policy, circuit breakers, and bounded fallback paths. Make failure explicit before generated claims or uncertain writes reach a user.

Why agent failures require runtime checks

Agent systems inherit ordinary software faults and add generated decisions that may be syntactically valid but unsupported. Compare the recovery controls needed at execution time:

Failure Mode	Traditional Software	LLM Agent	Recovery Strategy
Invalid input	Rejected with error message	Agent hallucinates a "valid" response	Validation checks: Schema checks before tool execution
Infinite loop	CPU spin or runaway resource use	Agent keeps calling the same tool with slight variations	Semantic Loop Detection: Hash canonicalized actions and compare intent similarity
Service unavailable	Connection timeout	Agent fabricates the API response	Circuit Breaker: Fail fast and fallback to static logic
Logic error	Wrong result, sometimes unnoticed	Plausible but unsupported answer or tool choice	Verifier + source check: Validate against deterministic evidence where available

This additional risk stems from generated actions and answers. An LLM acting inside an agent can vary its tool selection or interpretation after a small change in input or sampling. Agent architecture still prevents known errors, but it also has to detect degraded trajectories while they're running.

Because you can't exhaustively unit-test every possible conversational path or generated output, the center of gravity shifts toward runtime safeguards and evaluation harnesses. In a deterministic system, you can enumerate many edge cases before deployment. In an agentic system, you still need offline evals, but you also need guardrails that constrain the model during execution. When the model drifts off course, the system surrounding it must detect the drift and either steer it back or abort the operation safely before it causes compounding harm.

Why small per-step errors compound

One number explains why agent reliability is hard. A task that needs n dependent steps succeeds only if every step succeeds. If each step is independent and succeeds with probability p, the whole task succeeds with probability:

P(\text{success}) \approx p^{n}

This product decays fast. Even a per-step success rate that sounds excellent collapses over a long trajectory:

Per-step success `p`	5 steps	20 steps	50 steps
99%	95%	82%	61%
95%	77%	36%	8%
90%	59%	12%	0.5%

A 95%-reliable step can look acceptable in a one-shot demo and becomes nearly useless for a 50-step research agent. The same model can look impressive in a demo and fall apart in production because the demo was short. The model didn't get worse, the trajectory got longer.

This compounding is why long-horizon agents still break in production. In METR's 2025 task-horizon analysis, frontier models at the time improved their 50%-reliable task length quickly, with Claude 3.7 Sonnet reaching roughly an hour of human task time on that benchmark.^{[3]Reference 3Measuring AI Ability to Complete Long Taskshttps://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/} METR's interpretation is that those gains came largely from reliability and the ability to recover from mistakes, rather than raw reasoning alone. Reliability is the bottleneck, and part of that bottleneck is engineering. Model quality alone doesn't remove it.

The benchmark numbers make the same point. On $\tau$ -bench, a tool-agent benchmark over retail and airline tasks, GPT-4o solves a meaningful share of tasks once but its pass^8 (succeeding on all eight independent attempts of the same task) falls below 25% in retail.^{[4]Reference 4Tau-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domainshttps://arxiv.org/abs/2406.12045} An agent that's right most of the time is still wrong often enough that, over a long run, failure is the expected outcome unless you engineer around it.

These defenses all attack the same equation. They either raise per-step reliability p (validation checks, action-contract checks, structured feedback) or shrink the number of unrecovered steps that count against you (retries, fallbacks, checkpoints, loop breakers, escalation). You can't make p = 1, so you build a system that survives the steps where p < 1.

The deploy-status agent: our running example

A single ReAct-style agent is tasked with looking up CI/CD runs. The agent receives a developer question, plans a tool call, executes it, and observes the result. The loop in plain English:

Developer asks: "What happened to run RUN-842?"
Agent reasons: "I need to query CI status for run RUN-842."
Agent acts: Calls get_run with {"id": "RUN-842"}.
Agent observes: "Status: failed, failing job: unit-tests."
Agent answers: "Run RUN-842 failed in unit-tests."

This loop assumes the agent picks the right tool, formats the arguments correctly, and stops after getting an answer. In practice, any of those steps can fail. Each failure section below introduces one symptom and then adds a defense.

Before adding defenses, you should already understand two ideas from earlier in the curriculum:

The agentic loop (Planning -> Action -> Observation) from the ReAct article.
Tool execution flow (how an LLM turns a goal into a specific API schema) from the function-calling article.

If those concepts are fuzzy, revisit them first. The recovery patterns below assume that foundation.

A taxonomy of failure

The agent failures fit four operating categories. This isn't the only possible taxonomy, but it gives each symptom an owner and a recovery path.

Category	Typical Symptom	Root Cause Example
Planning	Infinite loops	Hallucinating a tool that doesn't exist
Action	Syntax errors or wrong parameters	Missing a required JSON field in an API call
Reflection	"False success"	Agent thinks it finished but the output is empty or wrong
Memory	Context poisoning	Early errors cascading into later steps

These categories map cleanly onto the six concrete failure states addressed below. Planning failures show up as tool hallucinations and infinite loops. Action failures show up as wrong parameters and task-contract drift. Reflection failures show up as the agent declaring success after a failed run. Memory failures show up as cascading errors in multi-agent systems and context-window overflow.

The illustration below grounds those categories in one deploy-status agent, so the taxonomy stays tied to symptoms you can recognize in production.

Recovery loop stops on failed stages before retrying. — Each recovery stage can fail closed before state is saved. Only verified state feeds the next retry.

Worked example: the wrong parameter

Start with the simplest failure and the simplest fix. Our deploy-status agent has a tool called get_run that expects {"id": "RUN-842"}. Suppose the agent sends {"run_id": "RUN-842"} instead. The API returns a 400 Bad Request with the message: Missing required parameter 'id'.

A naive agent sees the error and tries the exact same call again. Nothing has changed, so it gets the exact same 400. That's the "just try again" anti-pattern. Without explaining why it failed, the agent is likely to repeat the error forever.

Feed the error back into the agent's context as structured feedback, rather than a raw exception string. The agent can then compare its last action with the error and the tool schema, correct the key name, and retry. That's the foundation of reflexion-style recovery: the agent reflects on its failure, diagnoses the cause, and refines its next attempt.^{[5]Reference 5Reflexion: Language Agents with Verbal Reinforcement Learning.https://arxiv.org/abs/2303.11366}

The corrected loop looks like this:

text

Step 1: get_run({"run_id": "RUN-842"}) -> 400 Bad Request: Missing required parameter 'id'.
Step 2: Agent reflects: "I used 'run_id' but the schema requires 'id'."
Step 3: get_run({"id": "RUN-842"}) -> 200 OK: Status failed, failing job unit-tests.

The retry must carry new information. The same prompt with the same context will likely produce the same wrong answer. You need to tell the model exactly what went wrong in a format it can act on.

Defense strategy: schema validation

The primary defense is strict JSON schema validation. Modern model APIs often support tool calling or structured outputs, but you must still validate arguments before execution and apply authorization separately. If the arguments are invalid, the system returns a structured error message to the repair loop rather than executing the call.

Pydantic (in Python) or Zod (in TypeScript) can enforce these schemas. This validation function models the actual get_run({"id": "RUN-842"}) tool contract and returns concise correction feedback:

validate-run-tool-arguments.py

from pydantic import BaseModel, ConfigDict, Field, ValidationError

class ToolCallError(Exception):
    pass

class QueryRunArgs(BaseModel):
    model_config = ConfigDict(extra="forbid")
    id: str = Field(..., pattern=r"^RUN-[0-9]{3,}$")

def validate_tool_call(raw_args: dict[str, object]) -> QueryRunArgs:
    try:
        return QueryRunArgs(**raw_args)
    except ValidationError as exc:
        field = exc.errors()[0]["loc"][0]
        raise ToolCallError(f"get_run requires valid field: {field}") from exc

try:
    validate_tool_call({"run_id": "RUN-842"})
except ToolCallError as exc:
    print("rejected:", exc)
print("accepted:", validate_tool_call({"id": "RUN-842"}).id)

Output

rejected: get_run requires valid field: id
accepted: RUN-842

Common mistake: Returning a raw Python traceback to the LLM. Standard error messages are often too noisy for LLMs. Agentic debugging involves isolating the minimal set of root-cause failures rather than dumping every surface-level exception. Feed the model a one-sentence description of what's wrong, not a 50-line stack trace.

Tool call hallucination

One of the most frequent failures is tool misuse. In practice, that usually means one of two concrete problems: the model invents a tool that doesn't exist, or it emits malformed arguments for a real tool. Because LLMs are predictive text engines, if they lack sufficient context to execute a task, they often hallucinate a plausible-sounding tool instead of asking for clarification.

For example, our deploy-status agent might output:

text

// Model generates this:
{"tool": "search_runs", "args": {"query": "run RUN-842", "format": "json"}}
// But the actual API expects:
{"tool": "get_run", "params": {"id": "RUN-842"}}

The model invented search_runs because the name sounds reasonable. Without a validation check, this call would be executed against a non-existent endpoint, producing a confusing error that the agent might misinterpret.

Defense strategy

Use the same schema validation from above, plus a tool allowlist. Before executing any tool call, check whether the requested tool name exists in the registered tool set. If it doesn't, return a clear error: Tool 'search_runs' is not available. Available tools: get_run, get_logs, suggest_alternative.

This transforms an ambiguous failure into structured feedback the agent can use to self-correct.

Infinite loops (the "stuck agent")

The loop bucket contains two common states: exact repeats and semantic repeats. In both cases, the agent repeatedly calls the same tool or oscillates between two states. A deploy-status agent might produce this trace:

text

Step 1: get_run({"id": "RUN-842"}) -> "Failed in unit-tests"
Step 2: get_run({"id": "RUN-842"}) -> "Failed in unit-tests"
Step 3: get_run({"id": "RUN-842"}) -> "Failed in unit-tests"
... (continues forever)

Or with semantic rephrasing:

text

Step 1: get_run({"id": "RUN-842"}) -> "Failed in unit-tests"
Step 2: search("status for run RUN-842") -> "Failed in unit-tests"
Step 3: search("RUN-842 failing job") -> "Failed in unit-tests"
Step 4: search("where is run RUN-842") -> "Failed in unit-tests"
... (continues forever)

These loops typically occur when the runtime doesn't make progress explicit. A vague stopping condition, weak error feedback, or missing state check can let the model propose the same approach again with slightly different wording. The model isn't required to prove that its new action differs from the failed one, so the surrounding system has to detect repetition.

Defense strategy

Multi-layered loop detection is the key. The LoopBreaker class provides a practical implementation of this defense by maintaining a stateful history of actions. Its check method takes a proposed tool name and its arguments as inputs to evaluate against past behavior. It combines exact deduplication, repeated-tool heuristics, and approximate intent similarity, then returns True if the current call would push the agent over a loop threshold:

defense-strategy.py

import difflib
import hashlib
import json
import re

class LoopBreaker:
    def __init__(
        self,
        max_steps: int = 15,
        max_identical: int = 3,
        same_tool_streak: int = 5,
        semantic_threshold: float = 0.9,
    ):
        self.max_steps = max_steps
        self.max_identical = max_identical
        self.same_tool_streak = same_tool_streak
        self.semantic_threshold = semantic_threshold
        self.tool_call_history: list[dict[str, str]] = []

    def _stable_call_hash(self, tool_name: str, args: dict[str, object]) -> str:
        canonical_args = json.dumps(
            args,
            sort_keys=True,
            default=str,
            separators=(",", ":"),
        )
        return hashlib.sha256(f"{tool_name}:{canonical_args}".encode("utf-8")).hexdigest()

    def _intent_text(self, tool_name: str, args: dict[str, object]) -> str:
        argument_text = " ".join(str(value).strip().lower() for value in args.values())
        identifiers = sorted(set(re.findall(r"\b[a-z]+-\d+\b", argument_text)))
        combined = f"{tool_name.replace('_', ' ')} {argument_text}"

        # Normalize tool-specific phrasing into a task-level intent.
        if identifiers and "run" in combined:
            return f"lookup run {' '.join(identifiers)}"

        fields = [tool_name.replace("_", " ")]
        fields.extend(identifiers)
        fields.extend(
            str(args[key]).strip().lower()
            for key in ("query", "prompt", "text", "task")
            if args.get(key)
        )
        return " ".join(fields)

    def check(self, tool_name: str, args: dict[str, object]) -> bool:
        """Returns True if a loop is detected."""
        call = {
            "tool": tool_name,
            "args_hash": self._stable_call_hash(tool_name, args),
            "intent_text": self._intent_text(tool_name, args),
        }

        # Check for loops BEFORE appending to history
        # Hard step limit
        if len(self.tool_call_history) >= self.max_steps:
            return True

        # Detect repeated identical calls
        identical_attempts = 1 + sum(
            1 for h in self.tool_call_history if h["args_hash"] == call["args_hash"]
        )
        if identical_attempts >= self.max_identical:
            return True

        # Detect same-tool repetition streaks
        same_tool_attempts = 1
        for h in reversed(self.tool_call_history):
            if h["tool"] != tool_name:
                break
            same_tool_attempts += 1
        if same_tool_attempts >= self.same_tool_streak:
            return True

        # Detect approximate intent repetition for rephrased queries/prompts
        recent_window = self.tool_call_history[-(self.same_tool_streak - 1):]
        similar_attempts = 1 + sum(
            1
            for h in recent_window
            if h["intent_text"]
            and difflib.SequenceMatcher(
                None, h["intent_text"], call["intent_text"]
            ).ratio() >= self.semantic_threshold
        )
        if similar_attempts >= self.max_identical:
            return True

        # Only append after checks to avoid polluting history with the failing call
        self.tool_call_history.append(call)
        return False

breaker = LoopBreaker(max_identical=3)
print("first:", breaker.check("get_run", {"id": "RUN-842"}))
print("second:", breaker.check("get_run", {"id": "RUN-842"}))
print("third:", breaker.check("get_run", {"id": "RUN-842"}))

cross_tool = LoopBreaker(max_identical=2)
print("cross-tool first:", cross_tool.check("get_run", {"id": "RUN-842"}))
print("cross-tool repeat:", cross_tool.check("search", {"query": "where is run RUN-842"}))

Output

first: False
second: False
third: True
cross-tool first: False
cross-tool repeat: True

Notice that these thresholds include the current attempt. That detail matters because an off-by-one bug here burns a real extra tool call, token round-trip, and latency spike before the breaker trips.

The task-level normalization makes the runnable guard catch get_run({"id": "RUN-842"}) versus search("where is run RUN-842") even though the tool names and argument keys differ. For broader production traffic, replace or augment this narrow normalization with embeddings over a compact task and state summary. Evaluate that semantic detector on both true loops and legitimate follow-up actions against the same entity.

Common mistake: Assuming that max_steps alone is enough to prevent loops. A step limit of 15 doesn't stop an agent from wasting 15 steps (and thousands of tokens) on a futile loop. Semantic loop detection stops the bleeding early, often after 2-3 repeated attempts.

Token budget exhaustion

This bucket has two related states: hard context-window overflow and slow token-budget bleed. In a naive agent runtime that appends each tool call and observation to one transcript, prompt size grows roughly linearly with each turn. Across the full run, total token spend can become roughly quadratic because each new step re-sends most of the prior history. If a 20-step task replays its full transcript every time, the final step includes the previous 19 steps.

If this growth is left unchecked, it can cause context-window overflow (commonly rejected with a provider error, or handled via truncation you didn't intend) and unexpectedly high API bills. A runaway agent stuck in a subtle loop can consume a large amount of tokens before anyone notices.

Defense strategy

Effective budget management involves implementing two distinct layers: hard limits and soft management. Hard limits act as a circuit breaker, abruptly stopping the agent if it exceeds a predefined cost or token threshold. Soft management involves proactively monitoring the context window and taking action (like summarizing the history) before limits are breached.

The TokenBudget class below tracks actual token consumption against hard constraints. Its consume method takes the number of input and output tokens reported for each completed generation step. It updates internal state and raises BudgetExhausted when usage crosses a configured limit, blocking any further agent steps. Pair this post-call accounting with a preflight context estimate and provider-side output cap so one generation can't overshoot too far:

defense-strategy-2.py

class BudgetExhausted(Exception):
    pass

class TokenBudget:
    def __init__(
        self,
        max_tokens: int = 100_000,
        max_cost_usd: float = 1.0,
        pricing: dict[str, dict[str, float]] | None = None,
    ):
        self.max_tokens = max_tokens
        self.max_cost = max_cost_usd
        self.used_tokens = 0
        self.estimated_cost = 0.0
        self.pricing = pricing or {}

    def consume(self, input_tokens: int, output_tokens: int, model: str):
        """Updates token counts and raises an error if budget is exceeded."""
        self.used_tokens += input_tokens + output_tokens
        self.estimated_cost += self._calculate_cost(input_tokens, output_tokens, model)

        # Hard stop to prevent runaway billing
        if self.used_tokens > self.max_tokens:
            raise BudgetExhausted(f"Token limit exceeded: {self.used_tokens}/{self.max_tokens}")
        if self.estimated_cost > self.max_cost:
            raise BudgetExhausted(f"Cost limit exceeded: ${self.estimated_cost:.2f}/${self.max_cost:.2f}")

    def _calculate_cost(self, input_tokens: int, output_tokens: int, model: str) -> float:
        rates = self.pricing.get(model)
        if rates is None:
            raise BudgetExhausted(f"Pricing missing for model: {model}")

        input_cost = (input_tokens / 1_000_000) * rates["input"]
        output_cost = (output_tokens / 1_000_000) * rates["output"]
        return input_cost + output_cost

budget = TokenBudget(
    max_tokens=1_000,
    max_cost_usd=0.01,
    pricing={"agent-model": {"input": 1.0, "output": 2.0}},
)
budget.consume(200, 50, "agent-model")
print("used tokens:", budget.used_tokens)
try:
    budget.consume(900, 20, "agent-model")
except BudgetExhausted as exc:
    print("blocked:", str(exc).split(":")[0])

Output

used tokens: 250
blocked: Token limit exceeded

Production tip: When the budget approaches a configured soft threshold, stop optional exploration and compact only validated facts and pending work. A generated summary is lossy and can't replace the durable record of writes, approvals, or idempotency keys.

Cascading failures in multi-agent systems

In complex architectures, such as Directed Acyclic Graphs (DAGs) or supervisor-worker graphs, failures usually spread in two ways: a bad handoff poisons downstream reasoning, or the agent's internal state drifts away from the source of truth after a partial side effect. Both create a snowball effect where a minor mistake near the start of a pipeline compounds into a much larger failure later.

Consider an example failure chain in an incident-triage pipeline:

Monitor Agent: "CI check passed" (Hallucinates a test result event that never happened).
Gate Agent: "Tests passed, so approve deployment" (Treats the hallucinated check as source-of-truth state).
Responder Agent: "Run RUN-842 passed yesterday" (Sends a confident user-facing answer grounded in false state).

A second version is even more dangerous because it looks operationally healthy. Suppose a rollback_deploy tool times out after the deployment controller already committed the rollback. If your orchestrator records that step as failed and blindly retries, the duplicate side effects leave the agent state out of sync with the external system.

Defense strategy

Use strict, per-agent output validation checks plus explicit state reconciliation around side effects. You should never assume an agent's output is safe to pass directly to another agent without verification, and you should never assume a timed-out write means "nothing happened."

Before passing context to the next node in the graph, run a validator. Format and authorization checks should be deterministic; factual operational state should be checked against its source of truth. A model-based critic can flag candidates for review, but it can't prove that a run passed or a rollback committed. If validation fails, the pipeline halts and either requests a corrected proposal or escalates. For side-effecting tool calls, add read-after-write reconciliation and idempotency keys before telling downstream agents the action succeeded.

The validated_handoff function implements this pattern to secure the boundaries between components. It accepts the upstream agent's raw output and a list of validation functions as inputs. Each validator returns a small ValidationResult, so the handoff code can return the intact output if all checks pass, or a structured error dictionary meant for retry or human escalation if any validation fails:

defense-strategy-3.py

from dataclasses import dataclass
from typing import Callable

@dataclass
class ValidationResult:
    valid: bool
    reason: str = ""

Validator = Callable[[dict[str, object]], ValidationResult]

def validated_handoff(
    output: dict[str, object],
    validators: list[Validator],
) -> dict[str, object]:
    """Validate agent output before passing to next agent."""
    for validator in validators:
        result = validator(output)
        if not result.valid:
            # Option 1: Fail fast
            # raise AgentError(result.reason)

            # Option 2: Return structured error for retry
            return {
                "status": "validation_failed",
                "error": result.reason,
                "fallback": "Request human review of this output"
            }
    return output

External service failures

This bucket has two main states: transient faults (429, timeouts, brief 503s) and persistent dependency outages where the upstream still isn't healthy after the retry window. APIs go down, rate limits are hit, and databases occasionally time out. While traditional software also faces these issues, LLM agents are particularly vulnerable because they make a high volume of sequential API calls. A single user request might trigger multiple model generations, increasing the surface area for a network failure.

Many LLM APIs enforce separate request-rate and token-rate quotas (often expressed as RPM and TPM). Because an agent's context window grows with every step, late-stage agent turns consume more tokens than early ones. An agent can hit a token-rate limit in the middle of a complex reasoning chain, even when the raw request count is still within bounds.

Defense strategy

To handle transient network and rate-limit errors, use exponential backoff with jitter. A simple immediate retry will likely hit the same rate limit again, whereas exponential backoff gives the API time to recover. Jitter (adding randomness to retry delay) prevents the "thundering herd" problem where multiple stalled agents retry simultaneously. Only retry failures that are plausibly transient, such as 429, 502/503, or timeouts. Respect provider retry guidance such as Retry-After when it's available. Don't put deterministic failures like 400 schema errors or 401 auth problems on the same retry path.

Here's an example using the tenacity library in Python to wrap model calls with reliable retry logic. The call_llm_with_retry async function takes a list of conversational messages and a model string as inputs. It normalizes provider-specific transient failures into retryable exception classes, then retries with randomized exponential backoff. The call_model_api function stands in for your actual provider client:

defense-strategy-4.py

import logging
import tenacity

logger = logging.getLogger(__name__)

class RetryableModelError(Exception):
    pass

class RateLimitExceeded(RetryableModelError):
    pass

class UpstreamTimeout(RetryableModelError):
    pass

class UpstreamUnavailable(RetryableModelError):
    pass

def normalize_provider_error(exc: Exception) -> Exception:
    status_code = getattr(exc, "status_code", None)
    message = str(exc).lower()
    if status_code == 429 or "rate limit" in message:
        return RateLimitExceeded(str(exc))
    if isinstance(exc, TimeoutError) or "timeout" in message:
        return UpstreamTimeout(str(exc))
    if status_code in {502, 503, 504} or "temporarily unavailable" in message:
        return UpstreamUnavailable(str(exc))
    return exc

@tenacity.retry(
    stop=tenacity.stop_after_attempt(3),
    wait=tenacity.wait_random_exponential(multiplier=1, min=1, max=30),
    retry=tenacity.retry_if_exception_type(RetryableModelError),
    before_sleep=lambda retry_state: logger.warning(
        f"Retry {retry_state.attempt_number}: {retry_state.outcome.exception()}"
    ),
)
async def call_llm_with_retry(messages: list[dict[str, str]], model: str = "primary-model"):
    """Wrap a provider call with exponential backoff plus jitter."""
    try:
        return await call_model_api(
            messages=messages,
            model=model,
            timeout_seconds=30,
        )
    except Exception as exc:
        raise normalize_provider_error(exc) from exc

Blind retries only help with transient infrastructure faults. If the model keeps making the same schema mistake, the next attempt needs better information instead of more time. Reflexion-style recovery uses feedback from the previous trial to improve the next one.^{[5]Reference 5Reflexion: Language Agents with Verbal Reinforcement Learning.https://arxiv.org/abs/2303.11366} In production, that feedback can be much simpler than a full reflection loop: a concise validation error or tool rejection reason is often enough to change the next attempt.

Prefer typed SDK exceptions and provider response headers in production. The normalizer above keeps a message fallback only to show the classification boundary without coupling the lesson to one SDK.

Failover must preserve the payload contract

A 429 or dead endpoint may route traffic to a backup model from the same provider or a second provider. That improves availability but creates a correctness risk: the backup may return a different structured-output or tool-call shape. Strict-schema support, JSON wrappers, field names, and function arguments vary across models and providers.

Validate primary and fallback responses against one canonical schema before dispatch. Give each route an adapter that maps its native output into that schema. Otherwise a malformed or plausible-but-wrong argument can reach a tool. If the fallback can't satisfy the contract, correct or escalate the validation failure. Never forward the payload unchanged.

Retry, resume, or roll back?

External faults are where retry logic usually starts, but the same decision tree applies everywhere. Retries are only the right tool when the failure is transient and nothing important has committed yet. A 429, 503, or dropped connection usually fits that pattern. A poisoned context or partially completed side effect doesn't. If the agent already believes "rollback queued" even though the tool call timed out, replaying the next step can duplicate production changes or reason from false state.

Durable runtimes handle this with checkpoints rather than blind replay. LangGraph persists graph state as checkpoints keyed by thread_id and resumes from a saved super-step boundary.^{[6]Reference 6LangGraph Persistencehttps://docs.langchain.com/oss/python/langgraph/persistence} Temporal persists workflow execution state in event history and replays deterministic workflow code after failures.^{[7]Reference 7Temporal Workflow Execution Overviewhttps://docs.temporal.io/workflow-execution} Both patterns let you resume from the last confirmed-good boundary instead of restarting a long run from scratch.

In practice, the rule is simple:

Retry when the failure is transient and the step is side-effect free.
Resume from a checkpoint when prior state is still valid but expensive to recompute.
Reconcile or roll back when the failure happened around a non-idempotent side effect such as triggering a rollback, mutating a feature flag, or posting an incident update.

This classifier makes the distinction executable. Notice that an uncertain write never enters the ordinary retry path, even when its transport error looks transient.

choose-recovery-path.py

def recovery_path(error: str, side_effect_status: str, checkpoint_valid: bool) -> str:
    if side_effect_status == "uncertain":
        return "reconcile before replay"
    if checkpoint_valid and error == "worker_restarted":
        return "resume checkpoint"
    if side_effect_status == "none" and error in {"429", "503", "timeout"}:
        return "retry with backoff"
    return "stop or escalate"

print("lookup timeout:", recovery_path("timeout", "none", False))
print("rollback timeout:", recovery_path("timeout", "uncertain", True))

Output

lookup timeout: retry with backoff
rollback timeout: reconcile before replay

Production tip: Pair checkpoints with idempotency keys for outbound side effects. Resume must be safe even if the previous attempt failed after the external system committed but before your agent recorded success.

Recovery triage split after agent failure: retry transient side-effect-free faults, resume from valid checkpoint, or reconcile uncertain writes before replay. — Classify failed step first. Retry transient faults, resume durable checkpoints, and reconcile uncertain writes before any replay.

Action-contract disconnect

A dangerous failure happens when the agent's proposed operation diverges from its accepted task contract. The two common states are wrong tool choice despite a valid request and wrong argument serialization despite choosing the right tool. A model may describe a reasonable plan yet call the wrong tool or use unsupported parameters when it emits the executable payload.

Consider our deploy-status agent. Its structured intent says lookup_run for RUN-842. However, the emitted tool call might target a create_release tool or provide a different run ID, producing irrelevant results while the response text still sounds plausible.

This disconnect happens because generated explanations don't constrain generated tool payloads. An application should validate the payload against its task contract instead of asking for hidden reasoning or trusting a confident explanation.

Defense strategy

The primary defense is execution verification: compare the proposed tool call to a structured task contract before execution. A critic model can help triage ambiguous output, but deterministic tool, identifier, permission, and schema checks should block clear violations.

validate-action-contract.py

class ActionMismatch(Exception):
    pass

def validate_action_contract(
    task_contract: dict[str, object],
    tool_name: str,
    tool_args: dict[str, object],
    allowed_tools: set[str],
) -> str:
    if tool_name not in allowed_tools:
        raise ActionMismatch("tool not allowed")
    if task_contract["intent"] == "lookup_run" and tool_name != "get_run":
        raise ActionMismatch("tool does not match intent")
    if tool_args.get("id") != task_contract["run_id"]:
        raise ActionMismatch("run ID changed")
    return "admitted"

contract = {"intent": "lookup_run", "run_id": "RUN-842"}
print("lookup:", validate_action_contract(contract, "get_run", {"id": "RUN-842"}, {"get_run"}))
try:
    validate_action_contract(contract, "create_release", {}, {"get_run"})
except ActionMismatch as exc:
    print("drift:", exc)

Output

lookup: admitted
drift: tool not allowed

Self-correction with the Reflexion pattern

So far we've treated each failure as a separate problem with a separate fix. For failures that permit another attempt, structured feedback can help the model propose a better one. The Reflexion pattern, introduced by Shinn et al. (2023), gives an agent a way to reflect on feedback and refine its strategy before trying again.^{[5]Reference 5Reflexion: Language Agents with Verbal Reinforcement Learning.https://arxiv.org/abs/2303.11366}

The loop is simple: Generate -> Critique -> Refine.

Self-critique uses a doer-critic-fixer loop. The generator produces the first attempt, the critique marks what's wrong, and the next generation addresses that feedback. In an agent, the same LLM can play those roles by switching prompts.

For the deploy-status agent, a 400 Bad Request recovery looks like this:

text

[Generate] Agent: get_run({"run_id": "RUN-842"})
[Observe] System: 400 Bad Request: Missing required parameter 'id'.
[Critique] Critic prompt: "The agent used 'run_id' but the schema requires 'id'.
           The agent should check the tool schema before retrying."
[Refine] Agent: get_run({"id": "RUN-842"})
[Observe] System: 200 OK: Status failed, failing job unit-tests.

Compared with simple retry, the critique step produces new reasoning rather than a repeated attempt. The agent explicitly names what went wrong and how to fix it, which changes the probability distribution of the next generation.

Critique isn't proof: A model critique is another proposal, not proof. Use it to suggest a repair after an allowed failure, then rerun deterministic validators and source-of-truth checks before execution.

In production, you don't always need a full three-step Reflexion loop. For many failures, a single structured error message is enough. Use critique/refine for repeated or ambiguous proposals that remain safe to retry; high-stakes or potentially committed effects should stop for approval or reconciliation instead.

The circuit breaker pattern

Borrowed from microservices architecture, the circuit breaker prevents a failing component from taking down the entire system. For agents, that usually means failing fast once a tool or model endpoint crosses a known error threshold instead of letting every request discover the outage independently.

Circuit breaker state machine showing closed traffic, open fail-fast mode, cooldown, and one half-open probe that closes on success or reopens on failure. — Circuit breaker only counts dependency-health failures. After a threshold it opens and fails fast; after cooldown it allows one half-open probe, closing on success or reopening on failure.

The circuit breaker pattern operates as a state machine with three distinct states to manage failure routing:

Closed (Normal): Requests flow through to the agent/LLM. Failures are counted.
Open (Failing): Requests fail fast immediately. This prevents wasting tokens on a dependency known to be unhealthy.
Half-Open (Recovery): After a timeout, allow one trial request. If it succeeds, reset to Closed. If it fails, return to Open.

The "Half-Open" state keeps agent recovery controlled. If we switched from Open back to Closed after a timeout, a still-failing API would immediately receive a flood of pending agent requests (the "thundering herd" problem), potentially triggering further rate limits or blowing through your budget before the circuit trips again. A single test request proves the service is healthy before full traffic returns.

Here's a complete implementation of this state machine. The CircuitBreaker class acts as a protective wrapper, where its call method takes an arbitrary async function and its arguments as inputs. This version gates the half-open state so only one probe can run at a time. It also assigns a generation to admitted calls. Opening the circuit or starting a new half-open recovery generation invalidates completions from older generations, so an old in-flight success can't close a circuit that newer failures opened:

the-circuit-breaker-pattern.py

import asyncio
import time

class CircuitOpenError(Exception):
    pass

class CircuitBreaker:
    """Prevents cascading failures by halting requests to failing services."""
    CLOSED = "closed"      # Normal operation
    OPEN = "open"          # All calls fail fast
    HALF_OPEN = "half_open"  # Test with one call

    def __init__(self, failure_threshold: int = 5, reset_timeout: float = 60.0):
        self.state = self.CLOSED
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.reset_timeout = reset_timeout
        self.last_failure_time = 0.0
        self._lock = asyncio.Lock()
        self._half_open_probe_in_flight = False
        self._generation = 0

    async def call(self, func, *args, **kwargs):
        """Executes the function if the circuit is closed or half-open."""
        async with self._lock:
            now = time.time()

            if self.state == self.OPEN:
                if now - self.last_failure_time > self.reset_timeout:
                    self.state = self.HALF_OPEN
                    self._half_open_probe_in_flight = False
                    self._generation += 1
                else:
                    raise CircuitOpenError("Circuit is open - failing fast")

            if self.state == self.HALF_OPEN and self._half_open_probe_in_flight:
                raise CircuitOpenError("Half-open probe already in flight")

            is_half_open_probe = self.state == self.HALF_OPEN
            if is_half_open_probe:
                self._half_open_probe_in_flight = True
            call_generation = self._generation

        try:
            result = await func(*args, **kwargs)
        except Exception as exc:
            async with self._lock:
                if self._is_dependency_failure(exc):
                    self._on_failure(call_generation, is_half_open_probe)
                elif call_generation == self._generation and is_half_open_probe:
                    self._half_open_probe_in_flight = False
            raise

        async with self._lock:
            self._on_success(call_generation, is_half_open_probe)
        return result

    def _on_success(self, call_generation: int, was_half_open_probe: bool):
        if call_generation != self._generation:
            return
        if self.state == self.HALF_OPEN and not was_half_open_probe:
            return
        self.failure_count = 0
        self.state = self.CLOSED
        self._half_open_probe_in_flight = False

    def _on_failure(self, call_generation: int, was_half_open_probe: bool):
        if call_generation != self._generation:
            return
        self.failure_count += 1
        self.last_failure_time = time.time()
        if was_half_open_probe or self.failure_count >= self.failure_threshold:
            self.state = self.OPEN
            self._half_open_probe_in_flight = False
            self._generation += 1

    def _is_dependency_failure(self, exc: Exception) -> bool:
        # Add provider-specific 429 and 5xx exceptions in production.
        return isinstance(exc, (TimeoutError, ConnectionError))

async def verify_stale_success_cannot_close_newer_open_circuit():
    breaker = CircuitBreaker(failure_threshold=1)
    old_call_started = asyncio.Event()
    release_old_call = asyncio.Event()

    async def old_success():
        old_call_started.set()
        await release_old_call.wait()
        return "old success"

    async def newer_failure():
        raise TimeoutError("dependency failed")

    old_task = asyncio.create_task(breaker.call(old_success))
    await old_call_started.wait()

    try:
        await breaker.call(newer_failure)
    except TimeoutError:
        pass

    assert breaker.state == breaker.OPEN
    release_old_call.set()
    assert await old_task == "old success"
    assert breaker.state == breaker.OPEN

asyncio.run(verify_stale_success_cannot_close_newer_open_circuit())

Count only failures that indicate dependency health, such as timeouts, connection failures, and selected 5xx responses. A 400 schema error or 401 auth failure needs correction or escalation, but it shouldn't open a dependency circuit.

The generation check handles a race that the one-probe flag alone can't prevent. Several calls may already be running while the circuit is Closed. If their newer peer failures open the circuit, a slower success from the old generation may still return its own result, but it no longer has permission to reset shared breaker state. Only a success admitted in the current Closed generation or the current Half-Open probe can close the circuit.

If you run multiple app servers, keep this state in a shared store like Redis rather than in process memory. A per-process breaker protects only the worker that observed the failures.

Fallback chains: graceful degradation

When the primary approach fails, fall back to cheaper but more reliable alternatives. The chain below shows the order: try the most capable path first, validate its result, then degrade only when that path fails.

Fallback chain narrowing capability after validation or dependency failure. — Fallback narrows capability after validation or dependency failure: live tools, then read-only tools, then model-only cached context, then a static handoff instead of fake success.

This chain trades open-ended behavior for narrower, safer behavior. If the full tool-using agent fails or hits a circuit breaker, you may downgrade to a simpler agent with fewer tools, then to a direct model call with no tools, and finally to a static status message. A static fallback is deterministic, but it can't answer a live deploy or rollback question. It should state that live data is unavailable and route the user to retry or on-call escalation rather than inventing status.

The next example implements the pattern in Python. The FallbackChain class runs each strategy through its execute method, which takes the user's task string as input. It tries progressively simpler internal strategies and returns the first valid string response, or a hardcoded default if all fail:

fallback-chains-graceful-degradation.py

import logging

logger = logging.getLogger(__name__)

class FallbackChain:
    """Try multiple approaches in order of capability/cost.

    Note: run_agent and call_llm are illustrative functions representing
    your underlying execution engine.
    """

    # Pre-defined responses for last-resort fallback
    FALLBACK_RESPONSES = {
        "greeting": "Hello! I'm currently operating in degraded mode.",
        "status": "Live status lookup is currently unavailable, please try again later."
    }
    DEFAULT_RESPONSE = "Live run status is unavailable right now. Please retry or contact on-call."

    async def execute(self, task: str, requires_live_data: bool = False) -> str:
        strategies = [
            ("primary_agent", self._full_agent_execution),
            ("backup_agent", self._simplified_agent),
            ("direct_generation", self._direct_llm_call),
            ("static_fallback", self._static_fallback),
        ]

        for strategy_name, strategy in strategies:
            try:
                result = await strategy(task)
                if self._validate_result(strategy_name, result, requires_live_data):
                    return result
                logger.warning(f"Strategy {strategy_name} returned invalid result")
            except Exception as e:
                logger.error(f"Strategy {strategy_name} failed: {e}")
                continue

        return self.DEFAULT_RESPONSE

    def _validate_result(
        self,
        strategy_name: str,
        result: str,
        requires_live_data: bool,
    ) -> bool:
        if not result or not result.strip():
            return False
        if requires_live_data and strategy_name in {"direct_generation", "static_fallback"}:
            return False
        return True

    async def _full_agent_execution(self, task: str) -> str:
        """Full agent with tools: most capable, most expensive."""
        # In an implementation, return only output backed by validated tool evidence.
        return "Full agent output"

    async def _simplified_agent(self, task: str) -> str:
        """Simplified agent: fewer tools, lower step limit."""
        # In an implementation, retain the same evidence checks as primary path.
        return "Simplified agent output"

    async def _direct_llm_call(self, task: str) -> str:
        """Direct LLM call without tools: cheapest, fastest."""
        # return await call_llm(task)
        return "Direct LLM output"

    async def _static_fallback(self, task: str) -> str:
        """Static response: deterministic last-resort path."""
        # classify_task is an illustrative function
        # task_type = classify_task(task)
        task_type = "unknown"
        return self.FALLBACK_RESPONSES.get(task_type, self.DEFAULT_RESPONSE)

The pinned constraints pattern

Long-context studies show a clear positional bias: models often do best when relevant information appears near the beginning or end of the context window, and noticeably worse when that information is buried in the middle.^{[8]Reference 8Lost in the Middle: How Language Models Use Long Contextshttps://arxiv.org/abs/2307.03172} In long-running agent traces, critical constraints can drift into that low-salience middle region as more tool output and scratchpad text accumulate.

The pinned constraints pattern addresses this by systematically re-injecting a very short set of essential rules at the end of every prompt. This takes advantage of recency effects to keep high-priority instructions salient regardless of conversation length. It's a recall aid, not an enforcement boundary. Authorization checks, tool allowlists, and output validation still need to live outside the model.

For example, if an agent must never disclose sensitive secret data, stating this constraint once at the start of a long prompt isn't a reliable enforcement strategy. As context grows, recall of an earlier instruction can become less reliable. By appending a "pinned constraints" section to every prompt generation, you keep critical reminders near the current task.

the-pinned-constraints-pattern.py

class PinnedConstraintsAgent:
    """Agent that re-injects critical constraints at the end of every prompt."""

    CRITICAL_CONSTRAINTS = """
CRITICAL CONSTRAINTS (must follow):
1. Never disclose PII (Personally Identifiable Information)
2. Always confirm actions that modify data
3. If uncertain, ask for clarification rather than guessing
4. Maximum 10 tool calls per user request
"""

    def __init__(self, base_system_prompt: str):
        self.base_system_prompt = base_system_prompt

    def build_prompt(self, conversation_history: list, current_task: str) -> str:
        """Build prompt with pinned constraints at the end."""
        prompt_parts = [
            self.base_system_prompt,
            "",
            "=== CONVERSATION HISTORY ===",
            self._format_history(conversation_history),
            "",
            "=== CURRENT TASK ===",
            current_task,
            "",
            self.CRITICAL_CONSTRAINTS,  # Pinned at the end for recency bias
        ]
        return "\n".join(prompt_parts)

    def _format_history(self, history: list) -> str:
        return "\n".join([f"{msg['role']}: {msg['content']}" for msg in history[-10:]])

Production tip: Keep pinned constraints concise (3-5 bullet points max). Too many constraints dilute the recency effect. Prioritize safety, compliance, and task-critical rules only.

Human-in-the-loop escalation

Not all failures can be resolved automatically. When an agent encounters high-stakes ambiguity, repeated failures, or potential safety issues, it should pause and route a bounded review packet to an authorized human reviewer.

Human-in-the-loop (HITL) architectures can intercept risky proposals when triggers are correctly enforced. They don't guarantee correctness: reviewers can be wrong, approvals can become stale, and execution can drift from the reviewed proposal. Use measurable escalation triggers and, for any later side effect, bind approval to the exact action and revalidate it at execution time.

When to escalate

Escalation should be triggered by specific, measurable conditions rather than vague "uncertainty." Effective triggers include:

Verifier thresholds: When an evaluated classifier or policy score crosses a documented review threshold
Repeated failure patterns: After 3 consecutive tool call failures or validation rejections
Safety-critical actions: Before executing any operation that changes user data, moves money, or accesses sensitive systems
Flagged uncertainty: When a verifier or policy gate flags output as requiring review

Here's an implementation that demonstrates intelligent escalation logic:

when-to-escalate.py

from dataclasses import dataclass
from enum import Enum

class EscalationReason(Enum):
    POLICY_REVIEW = "policy_review"
    REPEATED_FAILURES = "repeated_failures"
    SAFETY_CHECK = "safety_check"
    VALIDATION_FAILED = "validation_failed"
    MAX_STEPS_EXCEEDED = "max_steps_exceeded"

@dataclass
class EscalationRequest:
    reason: EscalationReason
    context: dict[str, object]
    priority: int  # 1 (critical) to 5 (low)
    suggested_action: str

class HumanInTheLoop:
    """Manages escalation to human operators with context preservation."""

    def __init__(self,
                 review_threshold: float = 0.7,
                 max_retries: int = 3):
        self.review_threshold = review_threshold
        self.max_retries = max_retries
        self.escalation_queue: list[EscalationRequest] = []

    def should_escalate(self,
                       agent_state: dict[str, object],
                       failure_count: int = 0,
                       policy_score: float | None = None) -> EscalationRequest | None:
        """Determine if human intervention is needed."""

        # Safety-critical actions always receive the highest-priority review.
        if agent_state.get("action_type") in ["delete", "modify", "transfer", "rollback"]:
            return EscalationRequest(
                reason=EscalationReason.SAFETY_CHECK,
                context=agent_state,
                priority=1,
                suggested_action="Approve sensitive action before execution"
            )

        # Check repeated failures
        if failure_count >= self.max_retries:
            return EscalationRequest(
                reason=EscalationReason.REPEATED_FAILURES,
                context=agent_state,
                priority=2,
                suggested_action="Review failed attempts and provide guidance"
            )

        # Check evaluated policy threshold
        if policy_score is not None and policy_score < self.review_threshold:
            return EscalationRequest(
                reason=EscalationReason.POLICY_REVIEW,
                context=agent_state,
                priority=3,
                suggested_action="Review policy-flagged output"
            )

        return None  # No escalation needed

    async def handle_escalation(self, request: EscalationRequest) -> dict[str, object]:
        """Submit to human review queue and await resolution."""
        self.escalation_queue.append(request)

        # In production, this would notify via Slack, PagerDuty, etc.
        # and wait for human response
        return {
            "status": "escalated",
            "ticket_id": f"ESC-{len(self.escalation_queue)}",
            "reason": request.reason.value,
            "priority": request.priority,
            "review_decision": await self._await_human_input(request)
        }

    async def _await_human_input(self, request: EscalationRequest) -> dict[str, str]:
        # Placeholder - production would integrate with ticketing system
        return {"decision": "pending", "action_hash": "not-approved"}

Escalation beats false success: An agent that says "I need help" is better than one that silently hallucinates a "Success" message. Design your escalation UX to make it clear when and why the agent failed, preserving user trust.

Monitoring and alerting for agent systems

Production agent systems need specialized observability beyond standard APM (Application Performance Monitoring). While latency and error rates are important, they don't capture the unique failure modes of probabilistic agents. A "healthy" agent might return 200 OK status codes while being completely stuck in a logic loop or hallucinating facts. Benchmarks such as AgentBench expose how difficult realistic multi-step agent tasks remain for strong models.^{[9]Reference 9AgentBench: Evaluating LLMs as Agentshttps://arxiv.org/abs/2308.03688}

Track agent-specific metrics to detect runs that are technically returning responses but failing to deliver valid results. Semantic entropy techniques measure uncertainty across sampled answers and can help prioritize review for open-ended generations; they don't verify a tool result or replace source-of-truth checks.^{[10]Reference 10Detecting Hallucinations in Large Language Models Using Semantic Entropyhttps://www.nature.com/articles/s41586-024-07421-0}

Agent observability metrics

Metric	Type	What it Measures	Critical Alert Threshold
Loop Rate	Derived ratio	Percentage of traces where loop detection triggered	> 5% of requests
Recovery Rate	Derived ratio	Percentage of failed trajectories rescued by validation, retry, fallback, or escalation	Drops below baseline
Fallback Rate	Derived ratio	Percentage of requests downgraded to simpler models	> 10% of requests
Step Count	Histogram	Number of tool calls per user turn	P99 (99th percentile) > 15 steps
Token Usage	Histogram	Total tokens (prompt + completion) per turn	> 80% of budget
Validation Rejection Rate	Derived ratio	Percentage of validated outputs rejected by factual or schema checks	> 2% of validated outputs

Counters track cumulative events like loops, fallbacks, and validation failures. Histograms record distributions like step counts and token usage. Validation rejection rate and recovery rate are usually computed as derived ratios from validation and incident counters rather than stored directly as gauges. A useful definition is:

\text{RecoveryRate} = \frac{\text{failed trajectories that ended safely}}{\text{failed trajectories that triggered an intervention}}

For example, if 100 deploy-status traces hit a validation error, loop breaker, timeout, or fallback path, and 72 still return a safe answer or clean escalation, the recovery rate is 72%. This metric tells you whether your defenses are helping users instead of only detecting failures.

Agent health board that groups failure pressure on the left, active protections in the middle, and safe recovery outcome on the right so operators judge whether guardrails still protect users. — Agent monitoring should connect pressure to protection and then to safe outcome. Raw counts matter less than whether failed runs still end safely.

The Prometheus client can define custom agent metrics directly. AgentMetrics sets up counters and histograms that map cleanly to the table above. Once instantiated, it produces metric objects that take labels, such as the agent name or model, so your observability stack can monitor loop rates, fallback frequency, and validation failures over time:

agent-observability-metrics.py

from prometheus_client import Counter, Histogram

class AgentMetrics:
    """Tracks specialized observability metrics for LLM agents."""
    def __init__(self):
        # Per-run distributions support P95/P99 alerting
        self.step_count = Histogram(
            "agent_steps_per_run",
            "Tool calls per agent execution",
            ["agent"],
            buckets=(1, 3, 5, 8, 10, 15, 20, 30),
        )

        # Event counters
        self.loop_detected = Counter("agent_loops_total", "Loop detections", ["agent"])
        self.fallback_triggered = Counter("agent_fallbacks_total", "Fallback triggers", ["strategy"])
        self.interventions_triggered = Counter(
            "agent_interventions_total",
            "Failure-handling interventions triggered",
            ["agent", "intervention"],
        )
        self.safe_recoveries = Counter(
            "agent_safe_recoveries_total",
            "Interventions that ended in safe answer or clean escalation",
            ["agent", "intervention"],
        )
        self.validation_failures = Counter(
            "agent_validation_failures_total",
            "Outputs rejected by validation checks",
            ["agent", "reason"],
        )
        self.validated_outputs = Counter(
            "agent_validated_outputs_total",
            "Outputs checked by validation checks",
            ["agent"],
        )

        # Budget distributions
        self.token_usage = Histogram("agent_tokens_per_run", "Tokens per execution", ["model"])

Critical alerts

Treat these numbers as example starting points calibrated from your own baseline, not universal constants.

Loop Detection Rate > 5%: A high loop rate indicates the agent's prompt or tools are ambiguous, causing it to oscillate. Treat it as a code/prompt issue, not an infrastructure issue.
Fallback Rate > 10%: If the system frequently downgrades to simpler models or static responses, a primary dependency, model path, verifier, or task mix may have changed; inspect traces before assigning cause.
P99 Token Usage > Budget: If the 99th percentile of requests hits the token limit, you need summarization or a tighter sliding context window before hard stops start landing on real users.
Circuit Breaker OPEN: The configured failure threshold was crossed. Investigate the dependency, client, network path, timeout budget, and breaker sensitivity before assigning the cause.

These agent-specific metrics should be piped directly into your alerting infrastructure (like PagerDuty or Datadog) alongside standard server metrics. Catching a spike in loop detections early can be the difference between a minor service degradation and a massive, unexpected API bill at the end of the month.

Common misconceptions

When engineers transition from deterministic software to building agentic pipelines, they often carry over assumptions that don't apply to probabilistic models. Recognizing and unlearning these patterns helps teams build resilient systems.

Common mistake: Believing you can "just retry on failure" like traditional APIs. Reality: Retrying the same hallucinating LLM call often repeats the same failure mode, or produces a different wrong answer. Effective recovery requires changing the strategy, either by modifying the prompt, switching models, or reducing the task scope.

Common mistake: Treating every timeout as safe to replay. Reality: If a side effect may already have committed, replay can duplicate work or create inconsistent state. Use idempotency keys, checkpoints, and explicit reconciliation against the source of truth.

Common mistake: Assuming that max step counts are enough to prevent loops. Reality: Step limits are a crude safety net. They don't prevent an agent from wasting 15 steps (and thousands of tokens) on a futile loop before hitting the limit. Semantic loop detection stops the bleeding early.

Common mistake: Wrapping all agent calls in generic try/except blocks. Reality: Catching all exceptions silently leads to "zombie agents" that continue operating with corrupted state. Failures should be explicit and handled by validation checks or circuit breakers, not swallowed by generic error handlers.

Practice: diagnose the failure

Here's a short exercise to test whether you can match symptoms to defenses. Read each scenario, decide which failure category it belongs to, and pick the right fix.

Scenario 1: Your deploy-status agent calls get_run({"id": "RUN-842"}) three times in a row, gets the same answer each time, and keeps going.
Scenario 2: The agent's reasoning trace says "I will search the database" but the actual tool call is create_release({}).
Scenario 3: After a rollback tool times out, the orchestrator retries and the developer receives two rollback notifications.
Scenario 4: The agent has used 120,000 tokens on a single request and your limit is 100,000.
Scenario 5: The agent invents a tool called fetch_run_secret that doesn't exist in your tool registry.

Solution sketches

Scenario 1: Infinite loop (Planning failure). The LoopBreaker would catch it on the third identical call via args_hash deduplication. You should also check why the agent isn't stopping after getting a complete answer. Is the stopping condition too vague?
Scenario 2: Action-contract disconnect (Action failure). The runtime should compare the emitted tool to the structured task contract and allowlist, blocking create_release when the accepted intent is deploy status. A critique model can add review signal, but it shouldn't replace deterministic capability checks.
Scenario 3: Cascading failure (Memory/State failure). The rollback was non-idempotent and the timeout didn't mean "nothing happened." You need idempotency keys on the rollback tool and read-after-write reconciliation before retrying. Blind replay caused a duplicate side effect.
Scenario 4: Token budget exhaustion (Memory failure). The TokenBudget class would raise BudgetExhausted at the hard limit. Better yet, you should trigger summarization at the 80% soft threshold to prevent hitting the hard limit at all.
Scenario 5: Tool call hallucination (Planning failure). A tool allowlist would reject fetch_run_secret before execution, returning structured feedback: Tool 'fetch_run_secret' not available. Available tools: get_run, get_logs, suggest_alternative.

Agent failure controls

Small per-step errors compound. A task of n dependent steps succeeds with roughly p^n, so a 95%-reliable step is nearly useless over a 50-step run. Reliability, not raw reasoning, is usually the bottleneck.
Generated failures can look successful. Build validation checks for claims and actions, not exception handlers alone.
The four failure categories (Planning, Action, Reflection, Memory) collapse into six concrete states you'll see again and again: tool misuse, loops, budget exhaustion, poisoned handoffs/state desync, external dependency faults, and action-contract drift.
Loop detection needs multiple signals: exact-call hashing, approximate intent similarity, repeated-tool heuristics, step limits, and token budgets.
Corrective feedback beats blind retry for deterministic mistakes. Reflexion-style critique can help ambiguous failures, but it doesn't authorize actions.
Circuit breakers limit repeated calls to unhealthy dependencies; they don't establish correctness.
Fallback chains degrade capability explicitly. Live-data questions must stop or escalate when fallback paths lack evidence.
Checkpointed recovery beats blind replay when state may already be poisoned or partially committed.
Pinned constraints exploit recency bias to keep must-follow rules salient in long conversations.
Human review must be bound to exact actions and revalidated at execution; escalation alone isn't a guarantee.
Agent-specific metrics such as loop rate, fallback rate, and token usage matter as much as latency and error rate.

Handoff to orchestration

Single-agent failure handling should feel concrete: detection, classification, retry, fallback, review, and recovery all have named contracts. Orchestration extends that discipline across planning, retrieval, execution, review, and escalation; every worker needs an owner, a handoff contract, and a shared view of state.

Mastery check

Key concepts

Retry Policies
Exponential Backoff
Fallback Chains
Loop Detection
Token Budget Limits
Checkpointed Recovery
Idempotency Keys
State Reconciliation
Circuit Breakers
Graceful Degradation
Action Contract Validation
Pinned Constraints Pattern
Human-in-the-Loop Escalation
Trust Gap

Evaluation rubric

Foundational: Implement exponential backoff with jitter for transient errors
Intermediate: Design a fallback chain degrading from agent -> LLM -> static response
Advanced: Implement loop detection with exact-call hashing plus approximate similarity checks
Advanced: Validate tool arguments using Pydantic/Zod schemas
Advanced: Calculate token usage to prevent budget exhaustion
Advanced: Differentiate transient retries from checkpoint-based recovery for poisoned or partially committed state
Advanced: Reconcile timed-out side effects against the source of truth before retrying
Advanced: Apply pinned constraints pattern to combat recency bias
Advanced: Implement human-in-the-loop escalation for high-stakes failures

Follow-up questions

Common pitfalls

Symptom: Agent retries same hallucinated tool call three times. Cause: Retry path adds time, not new information. Fix: Return tool-allowlist or schema feedback, or switch to a narrower fallback path.
Symptom: Timeout after rollback or incident notification leads to duplicate side effect. Cause: Runtime treated "no response" as "nothing happened." Fix: Use idempotency keys, reconcile against source of truth, and resume from checkpoint before any replay.
Symptom: Loop breaker fires only after large token burn. Cause: Runtime relies on max-step cap alone. Fix: Add exact-call hashes plus semantic similarity over recent intents.
Symptom: Downstream agent makes confident wrong decision from bad upstream state. Cause: Handoff trusted raw agent output. Fix: Validate each handoff and stop pipeline when source-of-truth checks fail.
Symptom: Auth errors or invalid arguments keep entering exponential backoff. Cause: Retry policy doesn't separate transient from deterministic failures. Fix: Retry only 429, timeouts, and short outages; route schema, auth, and policy failures to correction or escalation.
Symptom: Dashboards look green while users get useless replies. Cause: Team watched only latency and HTTP status. Fix: Track loop rate, recovery rate, fallback rate, validation failures, and token budget pressure.

Next Step

Continue to Recursive Language Models (RLM)

Recovery policies keep one agent's state and actions bounded. RLM applies those controls to a programmable long-context loop that stores large inputs outside the active window and delegates targeted sub-calls under explicit budgets.

PreviousAgent Memory & Persistence

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

ReAct: Synergizing Reasoning and Acting in Language Models.

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. · 2023 · ICLR 2023

Toolformer: Language Models Can Teach Themselves to Use Tools.

Schick, T., et al. · 2023 · NeurIPS 2023

Measuring AI Ability to Complete Long Tasks

Kwa, T., West, B., Becker, J., et al. (METR) · 2025

Tau-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Yao, S., et al. · 2024 · arXiv preprint

Reflexion: Language Agents with Verbal Reinforcement Learning.

Shinn, N., et al. · 2023

LangGraph Persistence

LangChain · 2026

Temporal Workflow Execution Overview

Temporal Technologies · 2026

Lost in the Middle: How Language Models Use Long Contexts

Liu, N.F., et al. · 2023 · TACL 2023

AgentBench: Evaluating LLMs as Agents

Liu, X., et al. · 2023

Detecting Hallucinations in Large Language Models Using Semantic Entropy

Farquhar, S., et al. · 2024 · Nature

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Agent Failure & Recovery

Why is "fail transparently" a better goal than "never fail" for production agents?

Why agent failures require runtime checks

What moves from test time to runtime when software becomes agentic?

Why small per-step errors compound

An agent step succeeds 98% of the time. Why might a 40-step task still fail more than half the time?

The deploy-status agent: our running example

A taxonomy of failure

A deploy agent says "deployment approved" because an earlier CI pass was hallucinated. Which failure category is this, and what defense should catch it before the reply agent sends the message?

Worked example: the wrong parameter

Why is the second get_run({"id": "RUN-842"}) attempt different from a blind retry?

Defense strategy: schema validation

The agent called get_run({"run_id": "RUN-842"}), and the tool returned Missing required parameter 'id'. What feedback should the next retry receive?

Tool call hallucination

Defense strategy

Why should hallucinated tools be rejected before any execution attempt?

Infinite loops (the "stuck agent")

Defense strategy

Why does the loop breaker track both exact argument hashes and approximate intent text?

Why does the loop breaker append a call only after checks pass?

Token budget exhaustion

Defense strategy

An agent has a 100,000-token budget and reaches 82,000 tokens before finishing. Should the system wait for the hard limit, summarize, or retry from scratch?

Why can a 20-step agent run cost more than 20 separate one-step calls?

Cascading failures in multi-agent systems

Defense strategy

External service failures

Defense strategy

Failover must preserve the payload contract

Retry, resume, or roll back?

A rollback tool times out after the deployment controller may have committed the rollback. Why is blind retry wrong, and what should the orchestrator do instead?

Which failures are good retry candidates, and which failures need a different recovery path?

Action-contract disconnect

Defense strategy

Why is action-contract validation useful even if the model's explanation looks correct?

Self-correction with the Reflexion pattern

What makes Reflexion different from a normal retry?

When is a full Reflexion loop overkill?

The circuit breaker pattern

Why should a circuit breaker use a Half-Open state instead of jumping directly from Open back to Closed after the cooldown?

Why does a circuit breaker need shared state in a multi-server deployment?

Fallback chains: graceful degradation

In the fallback chain, why should the system return the first response that passes validation instead of trying every lower-capability strategy too?

What should each fallback stage validate before returning to the user?

The pinned constraints pattern

Pinned constraints say "Never disclose PII." Is that enough to enforce privacy?

Which constraints belong in a pinned block, and which belong in code?

Human-in-the-loop escalation

When to escalate

Name two concrete triggers that should escalate an agent run to a human reviewer.

What context should an escalation request preserve for the human reviewer?

Monitoring and alerting for agent systems

Agent observability metrics

Why is recovery rate more useful than validation-failure count alone?

Critical alerts

Common misconceptions

Practice: diagnose the failure

Solution sketches

Which scenario is most dangerous financially, and why?

Agent failure controls

Handoff to orchestration

Mastery check

Key concepts

Evaluation rubric

Follow-up questions

A deploy agent keeps changing search phrasing, but every query still asks what happened to run RUN-842. Why can exact history matching miss this, and what extra signal should you add?

The incident-update tool timed out after the provider may already have posted the message. Why is retry wrong as first move, and what should recovery do instead?

Loop rate stayed flat this week, but recovery rate fell from 82% to 46%. What does that usually mean?

Circuit breaker on retrieval API is open. When is direct-model fallback acceptable, and when should system escalate instead?

How do you handle poisoned context in a long-running session without losing all useful state?

Common pitfalls

Mastery Check

Discussion