LearnAdvanced Agents & RetrievalReAct & Plan-and-Execute

🤖HardLLM Agents & Tool Use

ReAct & Plan-and-Execute

Compare ReAct for tightly coupled tool use with Plan-and-Execute for longer workflows with explicit planning and replanning.

35 min read

Learning path

Step 116 of 158 in the full curriculum

Structured Output Generation Guardrails & Safety Filters

Structured output gave the runtime a reliable shape for one model response. Agent architectures ask the next question: who decides the next step when one response isn't enough?

Agent architectures turn a one-shot model call into a stateful system that can request tools, observe results, and continue. ReAct and plan-and-execute patterns give you two control loops for multi-step product work.

A code assistant that inspects a failed CI job needs to read the error log, check the changed files, explain the failure, and open a rollback proposal if the release broke production. A plain LLM call is passive: it produces tokens, but it doesn't execute side effects on its own. Agent runtimes bridge that gap by treating the model as a reasoning engine that can choose tools, update state, and coordinate multi-step work.

Most product paths should stay deterministic. An agent becomes useful when the next safe step depends on live evidence that can't be enumerated cheaply in advance, and when your runtime can check its effects.

Before the patterns, one reality check that frames everything below. Anthropic's guidance on building effective agents draws a line between two kinds of systems^{[1]Reference 1Building Effective Agentshttps://www.anthropic.com/research/building-effective-agents}. A workflow is a system where LLMs and tools follow predefined code paths that you wrote. An agent is a system where the model dynamically directs its own process and tool use, deciding at runtime how to reach the goal. Both are useful. The guidance is blunt: find the simplest pattern that works and add complexity only when it demonstrably improves outcomes, because agentic systems trade latency and cost for flexibility. Stripped down, an agent is "just LLMs using tools based on environmental feedback in a loop." ReAct and Plan-and-Execute are two named shapes of that loop, not a special runtime category. A fixed prompt chain or a single tool-calling call is often the right answer.

Agent loop: An agent needs more than an LLM with tools. It's a loop with state, allowed actions, observations, and runtime boundaries. Some agents also need planning or long-term memory; don't add those components until the task requires them. If the steps are known in advance, a workflow is cheaper and easier to debug.

Two ideas you have already met become the agent loop. Chain-of-Thought prompting studied how intermediate reasoning steps in demonstrations can elicit multi-step reasoning before answers.^{[2]Reference 2Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.https://arxiv.org/abs/2201.11903} Function calling lets a model request an action using arguments that your code validates and executes. An agent runtime turns these ideas into a loop: request a next move, execute an allowed tool, record an observation, and repeat. Two useful loop shapes matter most here: ReAct, which decides one step at a time, and Plan-and-Execute, which drafts a roadmap before moving.

ReAct: reasoning + acting

Why one step at a time?

The ReAct (Reason + Act) pattern is an influential agent loop. Proposed by Yao et al. (2023), it interleaves reasoning traces with actions and observations on the paper's evaluated tasks.^{[3]Reference 3ReAct: Synergizing Reasoning and Acting in Language Models.https://arxiv.org/abs/2210.03629} A reasoning-only response can't inspect a live CI failure. An acting-only policy may issue tool calls without enough task context. ReAct combines a next-step decision with fresh environmental evidence.

A code agent resolving a failed release follows the same loop. It checks the CI failure (Observation), decides the error points at a migration (Reasoning), reads the migration diff (Action), then runs a targeted test before choosing the next step. ReAct applies this observe-decide-act pattern to AI.

The paper writes this pattern as Thought, Action, and Observation. Treat Thought: as explanatory notation, not as a production API contract. OpenAI's current reasoning documentation describes reasoning tokens as hidden output tokens rather than a raw trace returned to the application.^{[4]Reference 4Reasoning modelshttps://developers.openai.com/api/docs/guides/reasoning} Store the observable state you can audit: validated tool requests, tool results, bounded decision notes when a user needs an explanation, and budget usage.

The analogy that fits best is an on-call engineer debugging from live evidence. They read one signal, form a hypothesis, run the next narrow check, and update the route from what came back. ReAct is that loop. It doesn't assume the whole incident path is known upfront. It adapts after every observation.

This visual shows that feedback loop: every observation becomes the state for the next reasoning step.

ReAct loop showing state, one tool action, one observation, and a runtime gate around execution. The loop highlights fast adaptation after each observation while prompt context grows over repeated turns. — ReAct works when the next action depends on the last observation. The runtime still owns validation, step caps, retries, and tool execution between model turns.

The ReAct loop

In practice, a user's request triggers repeated next-move decisions, tool execution, and observations until the model returns a final answer or the runtime stops the loop.

A concrete trace

This paper-style ReAct trace asks the agent to triage a failed deployment. The Thought lines make the lesson legible; an application can implement the same control loop without storing private reasoning text.

text

Question: Release 2026.06.20 failed after merging PR 1842. Determine the next safe action.

Thought: I need the failing job and error first.
Action: get_ci_failure(run_id="R1842")
Observation: migration-test failed with duplicate column "team_id".

Thought: I should inspect the migration changed by the PR.
Action: read_file(path="db/migrations/20260620_add_team_id.sql")
Observation: Migration adds team_id without IF NOT EXISTS.

Thought: Idempotency is the likely cause. I need a targeted verification.
Action: run_test(name="migration_replay")
Observation: Replaying migrations on a seeded database reproduces the duplicate-column error.

Thought: The release should not proceed until the migration is made idempotent.
Action: open_patch(path="db/migrations/20260620_add_team_id.sql", change="guard ADD COLUMN")
Observation: Patch D-2041 created and migration_replay passes.

Thought: The safe next action is to review and merge the idempotency patch before retrying deploy.
Answer: Release remains blocked. Patch D-2041 makes the migration idempotent and passes migration_replay; retry deploy after review.

Notice what happens to state as the loop runs. After the first tool call, the runtime has the goal, the action, and the observation. After another call, it has another evidence record to make available to the next model turn. If every raw result is appended, context grows with the trajectory; long tasks need summaries, external state, or another control pattern. The exact token cost depends on your prompts, tools, and summarization policy.

Building the loop in code

A functional runtime validates each requested move before it executes anything. The model-facing schema in a hosted API can enforce the shape of NextMove; this dependency-free example focuses on application responsibilities: tool allowlisting, step limits, observation storage, and a final answer grounded in those observations.

execute-typed-react-moves.py

from dataclasses import dataclass, field
from typing import Callable, Literal

MoveKind = Literal["tool", "answer"]

@dataclass(frozen=True)
class NextMove:
    kind: MoveKind
    tool: str | None = None
    arguments: dict[str, str] = field(default_factory=dict)
    answer: str | None = None

@dataclass
class RuntimeState:
    goal: str
    observations: list[str] = field(default_factory=list)
    tool_calls: int = 0

def run_moves(
    moves: list[NextMove],
    tools: dict[str, Callable[[dict[str, str]], str]],
    max_steps: int = 3,
) -> tuple[str, RuntimeState]:
    state = RuntimeState(goal="Triage failed release R1842")

    for move in moves:
        if move.kind == "answer":
            if not state.observations or not move.answer:
                raise ValueError("answer requires observed evidence")
            return move.answer, state

        if state.tool_calls >= max_steps:
            return "Stopped: tool-call budget exhausted.", state
        if move.kind != "tool" or move.tool not in tools:
            raise ValueError("requested tool is not allowed")

        result = tools[move.tool](move.arguments)
        state.observations.append(f"{move.tool}: {result}")
        state.tool_calls += 1

    return "Stopped: no final answer.", state

tools = {
    "get_ci_failure": lambda args: "migration-test duplicate column team_id",
    "read_file": lambda args: "migration adds team_id without IF NOT EXISTS",
}
moves = [
    NextMove("tool", "get_ci_failure", {"run_id": "R1842"}),
    NextMove("tool", "read_file", {"path": "db/migrations/20260620_add_team_id.sql"}),
    NextMove("answer", answer="Block deploy until the migration is made idempotent."),
]
answer, state = run_moves(moves, tools)
print("tool calls:", state.tool_calls)
print("last observation:", state.observations[-1])
print("answer:", answer)

Output

tool calls: 2
last observation: read_file: migration adds team_id without IF NOT EXISTS
answer: Block deploy until the migration is made idempotent.

The model proposes NextMove records. The runtime decides whether a tool is available, whether budget remains, and whether enough evidence exists to return an answer. No free-form private reasoning trace is needed in the audit log.

When ReAct breaks

ReAct's tight interleaving of decisions and actions creates specific failure modes worth knowing:

Infinite loops. ReAct agents can get stuck repeating the same action when a tool returns unhelpful results. If a search returns no results, the agent might search again with the identical query instead of reformulating. Enforcing a maximum step count and tracking action hashes prevents this from running indefinitely.

Context overflow. Each step may append actions, observations, and summaries to the context window. For tasks requiring many steps, that context can eventually exhaust the model's limit. Strategies include summarizing older evidence, storing full results outside the prompt, or switching to a longer-context model.

Grounding failures. ReAct's key strength is grounding reasoning in actual observations rather than the model's internal beliefs. When a tool returns misleading or incomplete data, the agent can chase a false trail. Reliable implementations include sanity checks on tool outputs before feeding them back as observations.

Interface errors. If the model's requested action doesn't satisfy the expected tool schema, the runtime must reject it. JSON mode only gives valid JSON syntax; it doesn't validate correct tool fields. Use strict tool schemas where supported and validate the resulting arguments and policy in application code.^{[5]Reference 5Structured outputshttps://developers.openai.com/api/docs/guides/structured-outputs}

Self-consistency without duplicated effects

Self-consistency samples multiple reasoning paths and selects a common answer, originally for reasoning tasks with a defined final response.^{[6]Reference 6Self-Consistency Improves Chain of Thought Reasoning in Language Models.https://arxiv.org/abs/2203.11171} Applying that idea to an agent requires a boundary: candidate trajectories can inspect fixed, read-only evidence or run in a sandbox, but they shouldn't each open real pull requests, roll back services, or post incident updates. After selecting a proposal, the runtime still applies policy and executes at most one approved effect.

vote-on-read-only-recommendations.py

from collections import Counter
from typing import Protocol

class RecommendationAgent(Protocol):
    def recommend(self, evidence: dict[str, str]) -> str:
        ...

class ScriptedCandidates:
    def __init__(self, proposals: list[str]):
        self.proposals = iter(proposals)

    def recommend(self, evidence: dict[str, str]) -> str:
        assert evidence["ci_error"] == "duplicate column"
        return next(self.proposals)

def choose_proposal(agent: RecommendationAgent, evidence: dict[str, str], samples: int) -> str:
    proposals = [agent.recommend(evidence) for _ in range(samples)]
    return Counter(proposals).most_common(1)[0][0]

facts = {"ci_error": "duplicate column", "migration": "missing idempotency guard"}
candidates = ScriptedCandidates(["patch_migration", "rollback_release", "patch_migration"])
selected = choose_proposal(candidates, facts, samples=3)
print("selected proposal:", selected)
print("real effects executed during vote:", 0)

Output

selected proposal: patch_migration
real effects executed during vote: 0

Self-consistency multiplies model and read-only tool work by the number of samples. It can support a reviewable recommendation, but it isn't permission to repeat writes.

Reflexion: learning from a failed attempt

Self-consistency samples many trajectories in parallel and votes. Reflexion (Shinn et al., 2023)^{[7]Reference 7Reflexion: Language Agents with Verbal Reinforcement Learning.https://arxiv.org/abs/2303.11366} takes the opposite approach across attempts: when a trajectory fails, the agent writes a short natural-language reflection on why it failed and stores that note in memory, then retries with the reflection added to its context. The original paper calls this "verbal reinforcement learning," because the agent improves through written self-critique rather than weight updates. On the HumanEval coding benchmark, Reflexion reported 91% pass@1, above the 80% GPT-4 baseline reported in the same paper.

The mechanism fits the deployment-debugging example directly. Suppose a ReAct attempt declares a release fixed after reading one stale CI log, and an evaluator flags it as wrong because the targeted test still fails. A Reflexion-style agent records a note such as "verify the current run before proposing a deploy retry," then carries that note into the next attempt. Reflexion works best when three conditions hold: you get a clear pass or fail signal, the notes stay short and task-specific, and retries are allowed. It's a memory technique layered on top of ReAct, not a replacement control loop.

store-reflections-only-after-failure.py

from dataclasses import dataclass, field

@dataclass
class AttemptMemory:
    lessons: list[str] = field(default_factory=list)

def record_evaluated_lesson(memory: AttemptMemory, passed: bool, lesson: str) -> None:
    if not passed:
        memory.lessons.append(lesson)

def next_attempt_context(memory: AttemptMemory) -> str:
    return memory.lessons[-1] if memory.lessons else "No prior evaluated failure."

memory = AttemptMemory()
record_evaluated_lesson(memory, passed=False, lesson="Verify current CI before retrying deploy.")
record_evaluated_lesson(memory, passed=True, lesson="Ignore: successful trial.")
print("stored lessons:", len(memory.lessons))
print("next attempt reminder:", next_attempt_context(memory))

Output

stored lessons: 1
next attempt reminder: Verify current CI before retrying deploy.

The failure signal comes from an evaluator or environment check, not from the agent deciding that its own unsupported story sounds convincing.

Plan-and-Execute

Why plan first?

ReAct is useful, but it's local: it chooses the next action from the latest observation rather than from a committed global plan. For complex tasks such as "audit every failed workflow from yesterday and draft recovery actions," a ReAct agent might get lost in one flaky test and forget the overall release-readiness workflow.

If ReAct is like an on-call engineer resolving the next visible signal, Plan-and-Execute is like an incident runbook. First, the planner maps the recovery steps, then executors handle CI checks, log reads, owner lookups, patch drafting, and rollout decisions in order.

Plan-and-Execute decouples planning from execution. It's a practical runtime pattern related to plan-first prompting techniques such as Plan-and-Solve,^{[8]Reference 8Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models.https://arxiv.org/abs/2305.04091} but it doesn't imply one standard protocol. A plan may be a linear checklist or a dependency graph:

Planner: An LLM generates a multi-step plan based on the user request.
Executor: An agent (often a ReAct agent itself) executes each step of the plan.
Replanner: The system reviews the results and updates the plan if necessary.

These roles don't have to be different models. Small systems often use one model in separate planning and execution prompts, while cost-optimized systems route planning to a stronger model and execution to cheaper specialists.

This visual shows the key separation: a planner owns the global shape, executors own local work, and verifier/replanner steps keep the plan from going stale.

Plan-and-Execute flow showing one planner producing an explicit task list, parallel local executors handling bounded checks, a verifier patching only remaining work, and final synthesis after verified steps complete. — Plan-and-Execute separates global route from local work. Executors stay bounded, while verification and replanning patch only unfinished steps.

The Plan-and-Execute flow

This architecture cleanly separates planning from execution. A planner drafts the initial steps, executors work through them, and a validation loop checks whether the remaining plan still makes sense. That global plan reduces goal drift on long tasks, but it doesn't eliminate it. A weak initial plan can still send every executor in the wrong direction.

The same release, planned

Return to the R1842 example. A Plan-and-Execute agent would first emit a plan, then execute it:

Planner output

text

Check the failed CI run.
Read changed migration files.
Run a targeted migration replay test.
Decide action: patch, roll back, or escalate.
Execute the approved action and confirm.

Executor trace

text

Step 1: get_ci_failure("R1842") -> "migration-test duplicate column team_id"
Step 2: read_file("db/migrations/20260620_add_team_id.sql") -> "ADD COLUMN team_id"
Step 3: run_test("migration_replay") -> "fails before patch"
Step 4: decide_action(...) -> "patch migration idempotency"
Step 5: open_patch("db/migrations/20260620_add_team_id.sql") -> "D-2041, replay passes"

The executor for Step 4 might itself be a small ReAct agent that reasons about the inputs and picks the best action. That's a common hybrid pattern: a global planner keeps the big picture, while local ReAct loops handle individual decisions.

A useful plan records dependencies rather than pretending all five steps are sequential. In this release example, CI metadata, changed files, and deployment health can be fetched independently. The action decision must wait for all three.

run-ready-plan-steps.py

from dataclasses import dataclass

@dataclass(frozen=True)
class PlanStep:
    id: str
    action: str
    depends_on: tuple[str, ...] = ()

def ready_steps(plan: list[PlanStep], completed: set[str]) -> list[str]:
    return [
        step.id
        for step in plan
        if step.id not in completed and set(step.depends_on) <= completed
    ]

plan = [
    PlanStep("ci", "get_ci_failure"),
    PlanStep("diff", "read_changed_files"),
    PlanStep("health", "check_deploy_health"),
    PlanStep("decision", "choose_recovery", ("ci", "diff", "health")),
    PlanStep("effect", "open_patch_or_rollback", ("decision",)),
]
print("ready first:", ready_steps(plan, set()))
print("ready after facts:", ready_steps(plan, {"ci", "diff", "health"}))

Output

ready first: ['ci', 'diff', 'health']
ready after facts: ['decision']

This representation exposes concurrency without overstating it. The first three reads can run together only if the runtime supports concurrent execution and the tools don't share a conflicting resource.

Comparing the two approaches

Choosing between ReAct and Plan-and-Execute requires balancing cost, task complexity, and reliability. ReAct is useful for short, evidence-dependent decisions, but it can drift on long tasks because it lacks an explicit global plan. Plan-and-Execute gives you that plan, but it can produce brittle execution if you don't pair it with validation and replanning.

Feature	ReAct	Plan-and-Execute
Control Flow	Next move follows the latest observation	Global plan first, then localized execution
Use Case	Search, debugging, exploratory workflows, API navigation	Research, ETL, code migration, long-horizon tasks with decomposable subtasks
Failure Mode	Local loop or goal drift after many steps	Brittle initial plan or stale plan after the environment changes
Token Cost	Grows when the loop carries a large trajectory forward	Can bound each executor's context when outputs are externalized, but planning and replanning add calls
Latency	Sequential when every tool result gates the next move	Planner adds a serial step; independent executor work can run in parallel when dependencies allow
Error Recovery	Immediate pivot on the next step	Requires an explicit validation or replanning loop

Control-loop selection board that first splits work by whether live evidence changes the route, then separates agentic work into ReAct, Plan-and-Execute, or Hybrid depending on how much structure can be planned upfront. — Start with feedback. If fresh evidence can't change the next safe action, keep route in deterministic workflow code. If it can, use ReAct when structure is discovered step by step, Plan-and-Execute when route is stable upfront, and a hybrid when global structure is stable but local steps still need live adaptation.

Read it as a gate, not a spectrum. First ask whether a fresh tool result can change next safe action. If not, planner overhead buys nothing and ordinary workflow code is better. If yes, then ask how much route is known upfront: ReAct for discovered paths, Plan-and-Execute for stable decomposition, hybrid when both appear in same task.

When to use each

Use ReAct when:

The environment provides immediate feedback after each action
Task length is bounded by an explicit runtime budget
You need to ground decisions in live data (search, database lookups)
The optimal path isn't predictable upfront

Use Plan-and-Execute when:

The task has a clear overall structure (audit failed workflows, draft a recovery report)
Steps have dependencies that benefit from upfront sequencing
You can parallelize independent sub-tasks
Executor steps are well-scoped and locally checkable (running tests, scraping pages, querying APIs with known schemas)

The brittleness problem

Plan-and-Execute's main weakness is the brittleness of the initial plan. If the planner misinterprets the user's intent, the entire execution pipeline is misaligned. Worse, downstream executor steps often depend on outputs from earlier steps, so a wrong assumption in Step 1 can invalidate Steps 2 through N. A ReAct loop gets an earlier opportunity to react to new evidence, but it can still repeat a bad decision without runtime checks.

Production implementations address this by:

Constraining the planner's output format: Use structured prompts that limit the planner to a fixed set of step templates rather than free-form generation.
Adding a validation step: After the planner generates the initial plan, a second pass checks that each step is achievable and that prerequisites are satisfied.
Triggering replanning early: Rather than waiting for an executor to fail, check intermediate outputs against the original goal and replan if the delta exceeds a threshold.

When evidence breaks an assumption, patch only unfinished work. The completed CI query remains an observation; a revised plan shouldn't issue the same write again.

patch-only-unfinished-plan-steps.py

from dataclasses import dataclass

@dataclass(frozen=True)
class Step:
    id: str
    action: str

def patch_remaining(steps: list[Step], completed: set[str], ci_api_down: bool) -> list[Step]:
    remaining = [step for step in steps if step.id not in completed]
    if not ci_api_down:
        return remaining
    return [
        Step("cached_logs", "read_last_known_ci_logs"),
        Step("mark_deferred", "flag_runs_for_retry"),
        *[step for step in remaining if step.id not in {"fetch_ci_logs", "close_incident"}],
    ]

original = [
    Step("load_runs", "query_failed_workflows"),
    Step("fetch_ci_logs", "fetch_ci_logs"),
    Step("close_incident", "close_recovered_incidents"),
    Step("summary", "draft_summary"),
]
revised = patch_remaining(original, {"load_runs"}, ci_api_down=True)
print("completed kept:", ["load_runs"])
print("remaining actions:", [step.action for step in revised])

Output

completed kept: ['load_runs']
remaining actions: ['read_last_known_ci_logs', 'flag_runs_for_retry', 'draft_summary']

How tool calls flow

Calling a tool is shorthand. The side effect still happens in your runtime (your Python or TypeScript code). At the API boundary, the model may emit raw JSON, a structured tool-call object, or another structured output rather than plain text.^{[5]Reference 5Structured outputshttps://developers.openai.com/api/docs/guides/structured-outputs} Your application is the part that validates arguments, executes the external call, and feeds the result back into the next model turn.

To make this work, the LLM needs to know exactly what tools are available and how to use them. In practice, the runtime passes structured tool definitions either in the prompt or through the provider's tool-calling API. Those definitions are usually JSON-schema-like rather than the full JSON Schema spec, and the model uses the field descriptions, enums, and required keys to construct a valid request.^{[5]Reference 5Structured outputshttps://developers.openai.com/api/docs/guides/structured-outputs}

Tool use connects model decisions to fresh, authorized evidence and controlled effects. A CI-status tool can return the latest failed step; a write tool can open a rollback proposal only after application policy approves it.

Tool definition

Tools need an argument contract so the LLM knows the allowed request shape. Tool-calling APIs can expose such schemas directly to the model. In a supported strict schema mode, the provider enforces the declared structure and types. The application still validates authorization, resource existence, semantic constraints, and policy, including whether this actor may inspect the requested CI run.

This example deliberately leaves include_logs optional because it's an application-level schema. If you submit a schema in OpenAI strict mode, every property must appear in required; represent a conceptually optional field with a nullable type instead.^{[5]Reference 5Structured outputshttps://developers.openai.com/api/docs/guides/structured-outputs}

tool-definition.py

ci_status_tool_schema = {
    "name": "get_ci_status",
    "description": "Get the current status for a CI run",
    "parameters": {
        "type": "object",
        "properties": {
            "run_id": {
                "type": "string",
                "description": "The CI run identifier"
            },
            "include_logs": {
                "type": "boolean"
            }
        },
        "required": ["run_id"],
        "additionalProperties": False
    }
}

def execute_ci_lookup(arguments: dict[str, object], authorized_runs: set[str]) -> str:
    allowed_keys = {"run_id", "include_logs"}
    if "run_id" not in arguments or set(arguments) - allowed_keys:
        return "reject: invalid tool arguments"

    run_id = arguments["run_id"]
    include_logs = arguments.get("include_logs", False)
    if not isinstance(run_id, str) or not isinstance(include_logs, bool):
        return "reject: invalid tool arguments"
    if run_id not in authorized_runs:
        return "deny: run not authorized for actor"
    return f"allow: return status for {run_id}"

print(execute_ci_lookup({"run_id": "R1842"}, {"R1842"}))
print(execute_ci_lookup({"run_id": "R1842", "include_logs": "yes"}, {"R1842"}))
print(execute_ci_lookup({"run_id": "R9999"}, {"R1842"}))

Output

allow: return status for R1842
reject: invalid tool arguments
deny: run not authorized for actor

The execution loop

The underlying mechanics of "Tool Use" involve a hidden round-trip handled by your application:

User: "Why did CI run R1842 fail?"
LLM: Returns a tool call record or JSON such as {"tool": "get_ci_status", "args": {"run_id": "R1842"}}
Runtime: Pauses generation. Parses the tool call payload. Calls API. Gets "failed: migration-test duplicate column".
Runtime: Feeds the result back as a tool-result message containing "failed: migration-test duplicate column"
LLM: "Run R1842 failed in migration-test because the migration re-adds an existing column."

This hidden round-trip relies entirely on your application code. The LLM doesn't execute the API request itself; it merely generates the structured request representing the intended call. The host application needs to securely execute the external call, manage connection timeouts, handle authentication, and then format the response back into a format that the LLM can ingest for its next reasoning step.

Common mistake: Treating schema validation as sufficient. Constrained outputs reduce syntax errors, but your runtime still needs to handle semantically bad arguments, missing auth, and tool timeouts.

A write needs another boundary: retrying the same agent turn must not create duplicate effects. Give each intended effect an idempotency key owned by your application, not invented afresh on every model retry.

make-agent-writes-idempotent.py

def open_rollback_once(
    release_id: str,
    idempotency_key: str,
    applied: dict[str, str],
) -> str:
    if idempotency_key in applied:
        return f"replay: {applied[idempotency_key]}"
    proposal_id = f"RB-{len(applied) + 2041}"
    applied[idempotency_key] = proposal_id
    return f"created: {proposal_id} for {release_id}"

effects: dict[str, str] = {}
key = "approve-rollback:release-2026-06-20:policy-v3"
print(open_rollback_once("release-2026-06-20", key, effects))
print(open_rollback_once("release-2026-06-20", key, effects))
print("rollback proposals created:", len(effects))

Output

created: RB-2041 for release-2026-06-20
replay: RB-2041
rollback proposals created: 1

When agents break

An unconstrained loop can spend far beyond the intended request budget or issue duplicate writes. Production systems need runtime controls before an agent is allowed to affect deployments, repositories, or incident communications.

Agents operate in dynamic environments where external state can change between steps. When a tool returns an unexpected format, or an API call times out, a naive agent might blindly retry the exact same action or invent a response. Because an observation influences later moves, errors can compound. A single unsupported conclusion can derail the execution plan.

To build reliable agents, engineers need strong guardrails at the runtime layer. This means treating the LLM as an unreliable sub-component rather than a deterministic program. You need to validate all structured outputs, enforce hard step and token budgets, and return actionable error messages back to the model when a failure occurs.

enforce-read-and-write-budgets.py

from dataclasses import dataclass

@dataclass
class Budget:
    reads_left: int
    writes_left: int

def authorize_action(kind: str, budget: Budget) -> str:
    if kind == "read" and budget.reads_left > 0:
        budget.reads_left -= 1
        return "allow read"
    if kind == "write" and budget.writes_left > 0:
        budget.writes_left -= 1
        return "allow write"
    return f"stop: {kind} budget exhausted"

budget = Budget(reads_left=2, writes_left=1)
print(authorize_action("read", budget))
print(authorize_action("write", budget))
print(authorize_action("write", budget))

Output

allow read
allow write
stop: write budget exhausted

Failure Mode	Symptom	Cause	Fix
Infinite Loops	Agent repeats the same action (e.g., `search("R1842 failure")`) forever.	Tool returns an unhelpful result and the agent doesn't reformulate.	Cycle Detection: Detect repeated recent patterns without progress, then stop or change strategy.
Hallucinated Tools	Agent calls `VideoGenerator()` when no such tool exists.	Model invents a tool name that wasn't in the allowlist.	Allowlisted Tools: Reject unknown tool names before execution and return a bounded error observation.
Context Overflow	Conversation history exceeds token limit.	Every step appends raw observations or large summaries.	External State + Summary: Keep authoritative results outside the prompt and pass a bounded summary plus recent evidence.
Goal Drift	Agent forgets the original user intent after many steps.	Long trajectory pushes the original query out of the model's attention window.	Periodic Goal Restatement: Inject the original user query or a compact goal summary back into the next model call every K steps.
Interface Errors	Requested action fails its tool schema.	Model emits missing, invalid, or unsupported arguments.	Strict Tool Contract: Use provider strict schemas when available, then validate arguments and policy in runtime code.
Brittle Plans	Plan-and-Execute planner misinterprets intent, cascading failures through all steps.	Planner made a wrong assumption at T=0 and executors blindly followed it.	Plan Validation: Add a second-pass check that each plan step is achievable; trigger replanning early rather than waiting for executor failure.

Detecting cycles in code

To prevent loops, track normalized actions in a short recent window. The detector below catches repeated single actions and short alternating patterns. A production detector also needs progress signals, because an agent can repeat a valid read while receiving new pages of results.

detecting-cycles-in-code.py

def normalized(call: dict[str, object]) -> tuple[str, tuple[tuple[str, str], ...]]:
    args = call.get("args", {})
    assert isinstance(args, dict)
    return str(call["tool"]), tuple(sorted((str(k), str(v)) for k, v in args.items()))

def repeated_recent_pattern(history: list[dict[str, object]], max_period: int = 3) -> bool:
    actions = [normalized(call) for call in history]
    for period in range(1, min(max_period, len(actions) // 2) + 1):
        if actions[-period:] == actions[-2 * period:-period]:
            return True
    return False

same = [
    {"tool": "get_ci_failure", "args": {"run_id": "R1842"}},
    {"tool": "get_ci_failure", "args": {"run_id": "R1842"}},
]
alternating = [
    {"tool": "get_ci_failure", "args": {"run_id": "R1842"}},
    {"tool": "read_file", "args": {"path": "db/migrations/20260620_add_team_id.sql"}},
    {"tool": "get_ci_failure", "args": {"run_id": "R1842"}},
    {"tool": "read_file", "args": {"path": "db/migrations/20260620_add_team_id.sql"}},
]
progress = [
    {"tool": "get_ci_failure", "args": {"run_id": "R1842"}},
    {"tool": "read_file", "args": {"path": "db/migrations/20260620_add_team_id.sql"}},
]
print("same-action loop:", repeated_recent_pattern(same))
print("alternating loop:", repeated_recent_pattern(alternating))
print("progress loop:", repeated_recent_pattern(progress))

Output

same-action loop: True
alternating loop: True
progress loop: False

Try it yourself

The partial ReAct trace below comes from a different release. Three steps are shown, but the fourth step contains a mistake. Read the trace, identify what goes wrong, and decide what the agent should have done instead.

text

Question: Release R2099 failed in production after a config change.

Thought: I need the alert and affected service.
Action: get_alert(alert_id="A2099")
Observation: p95 latency is high in us-east for api-gateway.

Thought: I should check whether the release reached all regions.
Action: get_deploy_state(release_id="R2099")
Observation: us-east runs commit abc123; eu-west runs previous commit def456.

Thought: eu-west is healthy on an older commit, so I will close the incident as rollback complete.
Action: close_incident(alert_id="A2099", resolution="rollback complete")
Observation: Incident closed while us-east latency remains high.

Thought: ???

What went wrong? The agent compared a healthy region on an older commit with an unhealthy region on the new commit, but never verified that the affected region was rolled back or recovered. A different region's health doesn't prove the incident is resolved. The agent should have checked the service health in us-east after rollback or kept the incident open.

The fix: Add an enforced verify_region_recovered precondition before the close action. A prompt reminder is useful context, but it can't block a write when the model ignores it.

Assume the monitoring system returns current latency for the affected region. The runtime can block closure until the target region meets its recovery threshold.

enforce-region-recovery-before-close.py

def incident_action(region_latency_ms: int, max_p95_ms: int) -> str:
    if region_latency_ms <= max_p95_ms:
        return "eligible for reviewed closure"
    return "keep incident open"

observed = {"us-east": 940, "eu-west": 120}
affected_region = "us-east"
print("healthy elsewhere:", observed["eu-west"] <= 200)
print("affected recovered:", observed[affected_region] <= 200)
print("action:", incident_action(observed[affected_region], max_p95_ms=200))

Output

healthy elsewhere: True
affected recovered: False
action: keep incident open

Agent architecture choices

Start with the simplest pattern. An agent is "just LLMs using tools based on environmental feedback in a loop"^{[1]Reference 1Building Effective Agentshttps://www.anthropic.com/research/building-effective-agents}. If the steps are known in advance, a fixed workflow or a single tool-calling call beats an autonomous loop on cost, latency, and debuggability. Reach for ReAct or Plan-and-Execute only when the next step genuinely depends on what the model observes.
ReAct fits short, interactive tasks whose next move depends on fresh evidence. In practice, start with a small tool-calling loop and add orchestration only when evaluations justify it.^{[1]Reference 1Building Effective Agentshttps://www.anthropic.com/research/building-effective-agents}
Plan-and-Execute fits work with a stable global structure and checkable local steps. It separates planning from execution while requiring validation and replanning.
Reflexion^{[7]Reference 7Reflexion: Language Agents with Verbal Reinforcement Learning.https://arxiv.org/abs/2303.11366} adds a memory of written lessons from failed attempts on top of a ReAct loop. It's a refinement, not a separate control architecture.
Multi-agent systems can use a Plan-and-Execute shape when planner, executor, and verifier roles become separate workers. These control-loop terms set up later orchestration work.
Memory matters for both architectures. ReAct trajectories grow one observation at a time, while Plan-and-Execute systems need somewhere to store intermediate outputs between phases. A dedicated article on agent memory and persistence follows later in the path.
Observability needs owned state. Log validated tool calls, planner outputs, redacted observations, approvals, retries, and budget usage. Don't make raw chain-of-thought your audit artifact.

Mastery check

Key concepts

ReAct Pattern
Plan-and-Execute
Chain-of-Thought
Tool Use
Self-Consistency
Reflexion
Cycle Detection
Replanning

Evaluation rubric

Foundational: Implement a ReAct-style loop with validated actions, observations, and runtime-owned budgets
Intermediate: Design a Plan-and-Execute architecture with distinct Planner and Executor roles
Advanced: Compare the latency, token-cost, and recovery trade-offs between ReAct and Plan-and-Execute
Advanced: Explain how localized executor context and external memory reduce prompt growth
Advanced: Implement failure recovery mechanisms like replanning, cycle detection, and step-limit guardrails

Follow-up questions

Common pitfalls

Symptom: a ReAct agent gets lost in long audits or migrations. Cause: the task needs a stable global plan, but the loop only sees the next local step. Fix: switch to Plan-and-Execute or use a planner plus local ReAct executors.
Symptom: tool calls repeat until the budget is gone. Cause: loop detection and hard step limits live only in the prompt, not in runtime code. Fix: enforce step caps, cycle detection, and timeout budgets outside the model.
Symptom: the agent gets slower and forgets earlier intent after many turns. Cause: every raw observation and intermediate update was appended forever. Fix: retain authoritative state outside the prompt, summarize older evidence, restate the goal periodically, and keep only recent detail in context.
Symptom: every executor follows a bad plan faithfully. Cause: the initial Plan-and-Execute plan was treated as truth instead of a draft. Fix: validate the plan early and patch the remaining plan when new evidence breaks old assumptions.

Next Step

Continue to Guardrails & Safety Filters

There, you'll master layered guardrails for production LLM systems, including <span data-glossary="prompt-injection">prompt injection</span> defense, PII controls, structured output constraints, and policy-driven enforcement.

PreviousStructured Output Generation

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Building Effective Agents

Anthropic · 2024

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.

Wei, J., et al. · 2022 · NeurIPS

ReAct: Synergizing Reasoning and Acting in Language Models.

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. · 2023 · ICLR 2023

Reasoning models

OpenAI · 2026

Structured outputs

OpenAI · 2024

Self-Consistency Improves Chain of Thought Reasoning in Language Models.

Wang, X., et al. · 2022

Reflexion: Language Agents with Verbal Reinforcement Learning.

Shinn, N., et al. · 2023

Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models.

Wang, L., et al. · 2023

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

ReAct & Plan-and-Execute

Why is "agent = LLM + tools" an incomplete mental model?

What makes the deployment-debugging assistant an agent instead of a single tool call?

You need to summarize a changelog, translate the summary, then post it to a fixed release channel. The three steps never change. Workflow or agent?

ReAct: reasoning + acting

Why one step at a time?

Why is ReAct a good fit when the failure state is uncertain?

The ReAct loop

A concrete trace

In the R1842 trace, why does ReAct wait for the CI observation before deciding whether to patch or roll back?

What state must the runtime preserve between ReAct turns?

Building the loop in code

In the minimal Python loop, which parts are deterministic software and which part is probabilistic?

When ReAct breaks

A ReAct agent calls get_ci_failure("R1842") three times and receives the same log each time. Which failure mode is this, and what should the runtime do?

What is the difference between a parsing error and a grounding failure?

Self-consistency without duplicated effects

Why is self-consistency risky for a ReAct agent that can open PRs, roll back releases, or page teams?

When can self-consistency improve ReAct without creating production risk?

Reflexion: learning from a failed attempt

How is Reflexion different from self-consistency, and what does it require to work?

Plan-and-Execute

Why plan first?

What is the planner responsible for, and what should it avoid doing?

The Plan-and-Execute flow

The same release, planned

Planner output

Executor trace

Why might Step 4, decide_action(...), be its own ReAct loop?

Comparing the two approaches

You need to audit all failed workflows from yesterday, group them by failure mode, draft recovery actions, and send one summary. Which architecture fits better, and why?

Why doesn't Plan-and-Execute automatically reduce latency?

When to use each

What question should you ask before choosing Plan-and-Execute?

The brittleness problem

An executor discovers that the CI API is down, but the original plan still says "fetch CI logs for every failed workflow." What should happen next?

What should a replanner patch: the whole plan or the remaining plan?

How tool calls flow

Tool definition

The execution loop

The model emits a valid tool call schema: get_ci_status({"run_id": "R1842"}). Name two problems that can still happen after schema validation passes.

Why is "the model called a tool" technically imprecise?

When agents break

Why are step caps and budgets runtime controls rather than prompt controls?

Which failure mode is characteristic of Plan-and-Execute, and which one is common in ReAct?

Detecting cycles in code

Why is even the improved cycle detector above incomplete?

Try it yourself

What invariant was missing before close_incident?

Agent architecture choices

Mastery check

Key concepts

Evaluation rubric

Follow-up questions

When can Plan-and-Execute fit better than ReAct?

How do you prevent agents from entering infinite loops?

How should agents handle conflicting information from different tools?

What is the role of 'reflection' in reliable agent architectures?

Common pitfalls

Mastery Check

Discussion