Compare ReAct for tightly coupled tool use with Plan-and-Execute for longer workflows with explicit planning and replanning.
Agent architectures turn a one-shot model call into a stateful system that can request tools, observe results, and continue. This chapter compares ReAct and plan-and-execute patterns so you can choose a control loop for multi-step product work.
Imagine asking a smart assistant to not just draft a reply about a delayed shipment, but to check the order record, request a carrier update, and create a replacement-shipment task if the package is lost. A plain LLM call is passive: it produces , but it doesn't execute side effects on its own. Agent runtimes bridge that gap by treating the model as a reasoning engine that can choose tools, update state, and coordinate multi-step work.
Most product paths should stay deterministic. An agent becomes useful when the next safe step depends on live evidence that can't be enumerated cheaply in advance, and when your runtime can check its effects.
Before the patterns, one reality check that frames everything below. Anthropic's guidance on building effective agents draws a line between two kinds of systems[1]. A workflow is a system where LLMs and tools follow predefined code paths that you wrote. An agent is a system where the model dynamically directs its own process and tool use, deciding at runtime how to reach the goal. Both are useful. The guidance is blunt: find the simplest pattern that works and add complexity only when it demonstrably improves outcomes, because agentic systems trade latency and cost for flexibility. Stripped down, an agent is "just LLMs using tools based on environmental feedback in a loop." ReAct and Plan-and-Execute are two named shapes of that loop, not magic. A fixed prompt chain or a single tool-calling call is often the right answer.
Key insight: An agent isn't just an LLM with tools. It's a loop with state, allowed actions, observations, and runtime boundaries. Some agents also need planning or long-term memory; don't add those components until the task requires them. If the steps are known in advance, a workflow is cheaper and easier to debug.
This article builds on two ideas you have already met. Chain-of-Thought prompting studied visible intermediate reasoning traces before answers.[2] lets a model request an action using arguments that your code validates and executes. An agent runtime turns these ideas into a loop: request a next move, execute an allowed tool, record an observation, and repeat. We will look at two useful loop shapes: ReAct, which decides one step at a time, and Plan-and-Execute, which drafts a roadmap before moving.
The ReAct (Reason + Act) pattern is an influential agent loop. Proposed by Yao et al. (2023), it interleaves reasoning with actions and observations on the paper's evaluated tasks.[3] A reasoning-only response can't check a live carrier scan. An acting-only policy may issue tool calls without enough task context. ReAct combines a next-step decision with fresh environmental evidence.
Think of a fulfillment agent resolving a delayed order. It checks tracking (Observation), decides the carrier scan is stale (Reasoning), opens a carrier inquiry (Action), then reads the new status before choosing the next step. ReAct applies this same observe-decide-act loop to AI.
The paper writes this pattern as Thought, Action, and Observation. Treat Thought: as explanatory notation, not as a production API contract. OpenAI's current reasoning documentation describes reasoning tokens as hidden output tokens rather than a raw trace returned to the application.[4] Store the observable state you can audit: validated tool requests, tool results, bounded decision notes when a user needs an explanation, and budget usage.
The analogy that fits best is a warehouse picker who doesn't have a complete pick list. They walk to the first bin, read the label, decide what to grab, and only then plan the next aisle based on what they just learned. ReAct is that picker. It doesn't assume it knows the whole warehouse layout. It adapts after every shelf.
The figure below shows that feedback loop: every observation becomes the state for the next reasoning step.
In practice, a user's request triggers repeated next-move decisions, tool execution, and observations until the model returns a final answer or the runtime stops the loop.
Here is a paper-style ReAct trace where the agent must resolve a delayed order. The Thought lines make the lesson legible; an application can implement the same control loop without storing private reasoning text.
1Question: Order A102 is delayed. The customer needs it by Friday.
2
3Thought: I need to check the current carrier status first.
4Action: get_tracking(order_id="A102")
5Observation: Carrier scan shows "in_transit", last updated 48 hours ago.
6
7Thought: The scan is stale. I should check inventory to see if we can reship.
8Action: check_inventory(sku="A102-SKU")
9Observation: 12 units in warehouse WH-7.
10
11Thought: We have stock. I need the customer's shipping preference.
12Action: get_customer_profile(order_id="A102")
13Observation: Customer selected "standard shipping" but noted "needed by Friday" in comments.
14
15Thought: Standard won't arrive by Friday. Express shipping from WH-7 would arrive Thursday.
16Action: create_reshipment(order_id="A102", warehouse="WH-7", shipping="express")
17Observation: Reshipment R-2041 created. ETA Thursday.
18
19Thought: The customer will receive the package on time.
20Answer: Order A102 has been reshipped from warehouse WH-7 with express shipping. New tracking: R-2041. ETA Thursday.Notice what happens to state as the loop runs. After the first tool call, the runtime has the goal, the action, and the observation. After another call, it has another evidence record to make available to the next model turn. If every raw result is appended, context grows with the trajectory; long tasks need summaries, external state, or another control pattern. The exact token cost depends on your prompts, tools, and summarization policy.
A functional runtime validates each requested move before it executes anything. The model-facing schema in a hosted API can enforce the shape of NextMove; this dependency-free example focuses on application responsibilities: tool allowlisting, step limits, observation storage, and a final answer grounded in those observations.
1from dataclasses import dataclass, field
2from typing import Callable, Literal
3
4MoveKind = Literal["tool", "answer"]
5
6@dataclass(frozen=True)
7class NextMove:
8 kind: MoveKind
9 tool: str | None = None
10 arguments: dict[str, str] = field(default_factory=dict)
11 answer: str | None = None
12
13@dataclass
14class RuntimeState:
15 goal: str
16 observations: list[str] = field(default_factory=list)
17 tool_calls: int = 0
18
19def run_moves(
20 moves: list[NextMove],
21 tools: dict[str, Callable[[dict[str, str]], str]],
22 max_steps: int = 3,
23) -> tuple[str, RuntimeState]:
24 state = RuntimeState(goal="Resolve delayed order A102 before Friday")
25
26 for move in moves:
27 if move.kind == "answer":
28 if not state.observations or not move.answer:
29 raise ValueError("answer requires observed evidence")
30 return move.answer, state
31
32 if state.tool_calls >= max_steps:
33 return "Stopped: tool-call budget exhausted.", state
34 if move.kind != "tool" or move.tool not in tools:
35 raise ValueError("requested tool is not allowed")
36
37 result = tools[move.tool](move.arguments)
38 state.observations.append(f"{move.tool}: {result}")
39 state.tool_calls += 1
40
41 return "Stopped: no final answer.", state
42
43tools = {
44 "get_tracking": lambda args: "scan stale for 48h",
45 "check_inventory": lambda args: "12 units available in WH-7",
46}
47moves = [
48 NextMove("tool", "get_tracking", {"order_id": "A102"}),
49 NextMove("tool", "check_inventory", {"sku": "A102-SKU"}),
50 NextMove("answer", answer="Eligible for a reviewed express reshipment."),
51]
52answer, state = run_moves(moves, tools)
53print("tool calls:", state.tool_calls)
54print("last observation:", state.observations[-1])
55print("answer:", answer)1tool calls: 2
2last observation: check_inventory: 12 units available in WH-7
3answer: Eligible for a reviewed express reshipment.The model proposes NextMove records. The runtime decides whether a tool is available, whether budget remains, and whether enough evidence exists to return an answer. No free-form private reasoning trace is needed in the audit log.
ReAct's tight interleaving of decisions and actions creates specific failure modes worth knowing:
Infinite loops. ReAct agents can get stuck repeating the same action when a tool returns unhelpful results. If a search returns no results, the agent might search again with the identical query instead of reformulating. Enforcing a maximum step count and tracking action hashes prevents this from running indefinitely.
Context overflow. Each step may append actions, observations, and summaries to the context window. For tasks requiring many steps, that context can eventually exhaust the model's limit. Strategies include summarizing older evidence, storing full results outside the prompt, or switching to a longer-context model.
Grounding failures. ReAct's key strength is grounding reasoning in actual observations rather than the model's internal beliefs. When a tool returns misleading or incomplete data, the agent can chase a false trail. Reliable implementations include sanity checks on tool outputs before feeding them back as observations.
Interface errors. If the model's requested action doesn't satisfy the expected tool schema, the runtime must reject it. JSON mode only gives valid JSON syntax; it doesn't ensure correct tool fields. Use strict tool schemas where supported and validate the resulting arguments and policy in application code.[5]
Self-consistency samples multiple reasoning paths and selects a common answer, originally for reasoning tasks with a defined final response.[6] Applying that idea to an agent requires a boundary: candidate trajectories can inspect fixed, read-only evidence or run in a sandbox, but they shouldn't each issue real refunds, replacements, or messages. After selecting a proposal, the runtime still applies policy and executes at most one approved effect.
1from collections import Counter
2from typing import Protocol
3
4class RecommendationAgent(Protocol):
5 def recommend(self, evidence: dict[str, str]) -> str:
6 ...
7
8class ScriptedCandidates:
9 def __init__(self, proposals: list[str]):
10 self.proposals = iter(proposals)
11
12 def recommend(self, evidence: dict[str, str]) -> str:
13 assert evidence["inventory"] == "available"
14 return next(self.proposals)
15
16def choose_proposal(agent: RecommendationAgent, evidence: dict[str, str], samples: int) -> str:
17 proposals = [agent.recommend(evidence) for _ in range(samples)]
18 return Counter(proposals).most_common(1)[0][0]
19
20facts = {"tracking": "stale scan", "inventory": "available"}
21candidates = ScriptedCandidates(["reship_for_review", "escalate", "reship_for_review"])
22selected = choose_proposal(candidates, facts, samples=3)
23print("selected proposal:", selected)
24print("real effects executed during vote:", 0)1selected proposal: reship_for_review
2real effects executed during vote: 0Self-consistency multiplies model and read-only tool work by the number of samples. It can support a reviewable recommendation, but it isn't permission to repeat writes.
Self-consistency samples many trajectories in parallel and votes. Reflexion (Shinn et al., 2023)[7] takes the opposite approach across attempts: when a trajectory fails, the agent writes a short natural-language reflection on why it failed and stores that note in memory, then retries with the reflection added to its context. The original paper calls this "verbal reinforcement learning," because the agent improves through written self-critique rather than weight updates. On the HumanEval coding benchmark, Reflexion reported 91% pass@1, above the 80% GPT-4 baseline the paper cites.
The mechanism fits the order-recovery example directly. Suppose a ReAct attempt closes a non-delivery ticket using a stale carrier scan, and an evaluator flags it as wrong. A Reflexion-style agent records a lesson such as "don't trust a carrier scan older than 24 hours; check for a fresher event first," then carries that lesson into the next attempt. Reflexion works best when three conditions hold: you get a clear pass or fail signal, the lessons stay short and task-specific, and retries are allowed. It's a memory technique layered on top of ReAct, not a replacement control loop.
1from dataclasses import dataclass, field
2
3@dataclass
4class AttemptMemory:
5 lessons: list[str] = field(default_factory=list)
6
7def record_evaluated_lesson(memory: AttemptMemory, passed: bool, lesson: str) -> None:
8 if not passed:
9 memory.lessons.append(lesson)
10
11def next_attempt_context(memory: AttemptMemory) -> str:
12 return memory.lessons[-1] if memory.lessons else "No prior evaluated failure."
13
14memory = AttemptMemory()
15record_evaluated_lesson(memory, passed=False, lesson="Verify delivery distance before closing dispute.")
16record_evaluated_lesson(memory, passed=True, lesson="Ignore: successful trial.")
17print("stored lessons:", len(memory.lessons))
18print("next attempt reminder:", next_attempt_context(memory))1stored lessons: 1
2next attempt reminder: Verify delivery distance before closing dispute.The failure signal comes from an evaluator or environment check, not from the agent deciding that its own unsupported story sounds convincing.
ReAct is powerful, but it's local: it chooses the next action from the latest observation rather than from a committed global plan. For complex tasks such as "audit every delayed order from yesterday and draft recovery actions," a ReAct agent might get lost in one carrier exception and forget the overall recovery workflow.
If ReAct is like a support rep resolving the next visible issue, Plan-and-Execute is like a warehouse incident runbook. First, the planner maps the recovery steps, then executors handle tracking checks, inventory checks, customer messaging, and refund decisions in order.
Plan-and-Execute decouples planning from execution. It's a practical runtime pattern related to plan-first prompting techniques such as Plan-and-Solve,[8] but it doesn't imply one standard protocol. A plan may be a linear checklist or a dependency graph:
These roles don't have to be different models. Small systems often use one model in separate planning and execution prompts, while cost-optimized systems route planning to a stronger model and execution to cheaper specialists.
The figure below shows the key separation: a planner owns the global shape, executors own local work, and verifier/replanner steps keep the plan from going stale.
This architecture cleanly separates planning from execution. A planner drafts the initial steps, executors work through them, and a validation loop checks whether the remaining plan still makes sense. That global plan reduces goal drift on long tasks, but it doesn't eliminate it. A weak initial plan can still send every executor in the wrong direction.
Return to the Order A102 example. A Plan-and-Execute agent would first emit a plan, then execute it:
11. Check carrier status for A102.
22. Check inventory for the ordered SKU.
33. Read customer profile for shipping preference and urgency.
44. Decide action: reship, refund, or escalate.
55. Execute the chosen action and confirm.1Step 1: get_tracking("A102") -> "in_transit, stale scan"
2Step 2: check_inventory("A102-SKU") -> "12 units in WH-7"
3Step 3: get_customer_profile("A102") -> "standard shipping, needs by Friday"
4Step 4: decide_action(...) -> "reship from WH-7 with express"
5Step 5: create_reshipment("A102", "WH-7", "express") -> "R-2041, ETA Thursday"The executor for Step 4 might itself be a small ReAct agent that reasons about the inputs and picks the best action. That's a common hybrid pattern: a global planner keeps the big picture, while local ReAct loops handle individual decisions.
A useful plan records dependencies rather than pretending all five steps are sequential. In this order example, tracking, inventory, and customer preference can be fetched independently. The action decision must wait for all three.
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class PlanStep:
5 id: str
6 action: str
7 depends_on: tuple[str, ...] = ()
8
9def ready_steps(plan: list[PlanStep], completed: set[str]) -> list[str]:
10 return [
11 step.id
12 for step in plan
13 if step.id not in completed and set(step.depends_on) <= completed
14 ]
15
16plan = [
17 PlanStep("tracking", "get_tracking"),
18 PlanStep("inventory", "check_inventory"),
19 PlanStep("profile", "get_customer_profile"),
20 PlanStep("decision", "choose_recovery", ("tracking", "inventory", "profile")),
21 PlanStep("effect", "create_reshipment", ("decision",)),
22]
23print("ready first:", ready_steps(plan, set()))
24print("ready after facts:", ready_steps(plan, {"tracking", "inventory", "profile"}))1ready first: ['tracking', 'inventory', 'profile']
2ready after facts: ['decision']This representation exposes concurrency without overstating it. The first three reads can run together only if the runtime supports concurrent execution and the tools don't share a conflicting resource.
Choosing between ReAct and Plan-and-Execute requires balancing cost, task complexity, and reliability. ReAct is useful for short, evidence-dependent decisions, but it can drift on long tasks because it lacks an explicit global plan. Plan-and-Execute gives you that plan, but it can produce brittle execution if you don't pair it with validation and replanning.
| Feature | ReAct | Plan-and-Execute |
|---|---|---|
| Control Flow | Next move follows the latest observation | Global plan first, then localized execution |
| Use Case | Search, debugging, exploratory workflows, API navigation | Research, ETL, code migration, long-horizon tasks with decomposable subtasks |
| Failure Mode | Local loop or goal drift after many steps | Brittle initial plan or stale plan after the environment changes |
| Token Cost | Grows when the loop carries a large trajectory forward | Can bound each executor's context when outputs are externalized, but planning and replanning add calls |
| Latency | Sequential when every tool result gates the next move | Planner adds a serial step; independent executor work can run in parallel when dependencies allow |
| Error Recovery | Immediate pivot on the next step | Requires an explicit validation or replanning loop |
Use ReAct when:
Use Plan-and-Execute when:
Plan-and-Execute's main weakness is the brittleness of the initial plan. If the planner misinterprets the user's intent, the entire execution pipeline is misaligned. Worse, downstream executor steps often depend on outputs from earlier steps, so a wrong assumption in Step 1 can invalidate Steps 2 through N. A ReAct loop gets an earlier opportunity to react to new evidence, but it can still repeat a bad decision without runtime checks.
Robust implementations address this by:
When evidence breaks an assumption, patch only unfinished work. The completed carrier query remains an observation; a revised plan shouldn't issue the same write again.
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class Step:
5 id: str
6 action: str
7
8def patch_remaining(steps: list[Step], completed: set[str], carrier_api_down: bool) -> list[Step]:
9 remaining = [step for step in steps if step.id not in completed]
10 if not carrier_api_down:
11 return remaining
12 return [
13 Step("cached_tracking", "read_last_known_scan"),
14 Step("mark_deferred", "flag_orders_for_retry"),
15 *[step for step in remaining if step.id not in {"fetch_tracking", "close_ticket"}],
16 ]
17
18original = [
19 Step("load_orders", "query_delayed_orders"),
20 Step("fetch_tracking", "fetch_carrier_scans"),
21 Step("close_ticket", "close_recovered_orders"),
22 Step("summary", "draft_summary"),
23]
24revised = patch_remaining(original, {"load_orders"}, carrier_api_down=True)
25print("completed kept:", ["load_orders"])
26print("remaining actions:", [step.action for step in revised])1completed kept: ['load_orders']
2remaining actions: ['read_last_known_scan', 'flag_orders_for_retry', 'draft_summary']While we conceptually say an agent "calls" a tool, the side effect still happens in your runtime (your Python or TypeScript code). At the API boundary, the model may emit raw JSON, a structured tool-call object, or another schema-constrained payload rather than plain prose.[5] Your application is the part that validates arguments, executes the external call, and feeds the result back into the next model turn.
To make this work, the LLM needs to know exactly what tools are available and how to use them. In practice, the runtime passes structured tool definitions either in the prompt or through the provider's tool-calling API. Those definitions are usually JSON-schema-like rather than the full JSON Schema spec, and the model uses the field descriptions, enums, and required keys to construct a valid request.[5]
Tool use connects model decisions to fresh, authorized evidence and controlled effects. An order-status tool can return the latest carrier event; a write tool can create a reshipment only after application policy approves it.
Tools need an argument contract so the LLM knows the allowed request shape. Tool-calling APIs can expose such schemas directly to the model. The example below defines a closed JSON-schema-like contract, then shows application checks that still matter: type validation and whether this merchant may see the requested order.
1order_status_tool_schema = {
2 "name": "get_order_status",
3 "description": "Get the current fulfillment status for an order",
4 "parameters": {
5 "type": "object",
6 "properties": {
7 "order_id": {
8 "type": "string",
9 "description": "The merchant order ID"
10 },
11 "include_tracking": {
12 "type": "boolean"
13 }
14 },
15 "required": ["order_id"],
16 "additionalProperties": False
17 }
18}
19
20def execute_order_lookup(arguments: dict[str, object], merchant_orders: set[str]) -> str:
21 allowed_keys = {"order_id", "include_tracking"}
22 if "order_id" not in arguments or set(arguments) - allowed_keys:
23 return "reject: invalid tool arguments"
24
25 order_id = arguments["order_id"]
26 include_tracking = arguments.get("include_tracking", False)
27 if not isinstance(order_id, str) or not isinstance(include_tracking, bool):
28 return "reject: invalid tool arguments"
29 if order_id not in merchant_orders:
30 return "deny: order not authorized for merchant"
31 return f"allow: return status for {order_id}"
32
33print(execute_order_lookup({"order_id": "A102"}, {"A102"}))
34print(execute_order_lookup({"order_id": "A102", "include_tracking": "yes"}, {"A102"}))
35print(execute_order_lookup({"order_id": "B900"}, {"A102"}))1allow: return status for A102
2reject: invalid tool arguments
3deny: order not authorized for merchantThe underlying mechanics of "Tool Use" involve a hidden round-trip handled by your application:
{"tool": "get_order_status", "args": {"order_id": "A102"}}"delayed, ETA Friday"."delayed, ETA Friday"This hidden round-trip relies entirely on your application code. The LLM doesn't execute the API request itself; it merely generates the structured request representing the intended call. The host application needs to securely execute the external call, manage connection timeouts, handle authentication, and then format the response back into a format that the LLM can ingest for its next reasoning step.
Common mistake: Treating schema validation as sufficient. Constrained outputs reduce syntax errors, but your runtime still needs to handle semantically bad arguments, missing auth, and tool timeouts.
A write needs another boundary: retrying the same agent turn must not create duplicate effects. Give each intended effect an idempotency key owned by your application, not invented afresh on every model retry.
1def create_reshipment_once(
2 order_id: str,
3 idempotency_key: str,
4 applied: dict[str, str],
5) -> str:
6 if idempotency_key in applied:
7 return f"replay: {applied[idempotency_key]}"
8 shipment_id = f"R-{len(applied) + 2041}"
9 applied[idempotency_key] = shipment_id
10 return f"created: {shipment_id} for {order_id}"
11
12effects: dict[str, str] = {}
13key = "approve-reship:A102:policy-v3"
14print(create_reshipment_once("A102", key, effects))
15print(create_reshipment_once("A102", key, effects))
16print("shipments created:", len(effects))1created: R-2041 for A102
2replay: R-2041
3shipments created: 1An unconstrained loop can spend far beyond the intended request budget or issue duplicate writes. Production systems need runtime controls before an agent is allowed to affect orders, refunds, or customer communication.
Agents operate in dynamic environments where external state can change between steps. When a tool returns an unexpected format, or an API call times out, a naive agent might blindly retry the exact same action or invent a response. Because an observation influences later moves, errors can compound. A single unsupported conclusion can derail the execution plan.
To build reliable agents, engineers need strong guardrails at the runtime layer. This means treating the LLM as an unreliable sub-component rather than a deterministic program. You need to validate all structured outputs, enforce hard limits on execution steps, and provide clear, actionable error messages back to the model when a failure occurs.
1from dataclasses import dataclass
2
3@dataclass
4class Budget:
5 reads_left: int
6 writes_left: int
7
8def authorize_action(kind: str, budget: Budget) -> str:
9 if kind == "read" and budget.reads_left > 0:
10 budget.reads_left -= 1
11 return "allow read"
12 if kind == "write" and budget.writes_left > 0:
13 budget.writes_left -= 1
14 return "allow write"
15 return f"stop: {kind} budget exhausted"
16
17budget = Budget(reads_left=2, writes_left=1)
18print(authorize_action("read", budget))
19print(authorize_action("write", budget))
20print(authorize_action("write", budget))1allow read
2allow write
3stop: write budget exhausted| Failure Mode | Symptom | Cause | Fix |
|---|---|---|---|
| Infinite Loops | Agent repeats the same action (e.g., search("order A102 status")) forever. | Tool returns an unhelpful result and the agent doesn't reformulate. | Cycle Detection: Detect repeated recent patterns without progress, then stop or change strategy. |
| Hallucinated Tools | Agent calls VideoGenerator() when no such tool exists. | Model invents a tool name that wasn't in the allowlist. | Allowlisted Tools: Reject unknown tool names before execution and return a bounded error observation. |
| Context Overflow | Conversation history exceeds token limit. | Every step appends raw observations or large summaries. | External State + Summary: Keep authoritative results outside the prompt and pass a bounded summary plus recent evidence. |
| Goal Drift | Agent forgets the original user intent after many steps. | Long trajectory pushes the original query out of the model's attention window. | Periodic Goal Restatement: Inject the original user query or a compact goal summary back into the next model call every K steps. |
| Interface Errors | Requested action fails its tool schema. | Model emits missing, invalid, or unsupported arguments. | Strict Tool Contract: Use provider strict schemas when available, then validate arguments and policy in runtime code. |
| Brittle Plans | Plan-and-Execute planner misinterprets intent, cascading failures through all steps. | Planner made a wrong assumption at T=0 and executors blindly followed it. | Plan Validation: Add a second-pass check that each plan step is achievable; trigger replanning early rather than waiting for executor failure. |
To prevent loops, track normalized actions in a short recent window. The detector below catches repeated single actions and short alternating patterns. A production detector also needs progress signals, because an agent can repeat a valid read while receiving new pages of results.
1def normalized(call: dict[str, object]) -> tuple[str, tuple[tuple[str, str], ...]]:
2 args = call.get("args", {})
3 assert isinstance(args, dict)
4 return str(call["tool"]), tuple(sorted((str(k), str(v)) for k, v in args.items()))
5
6def repeated_recent_pattern(history: list[dict[str, object]], max_period: int = 3) -> bool:
7 actions = [normalized(call) for call in history]
8 for period in range(1, min(max_period, len(actions) // 2) + 1):
9 if actions[-period:] == actions[-2 * period:-period]:
10 return True
11 return False
12
13same = [
14 {"tool": "get_tracking", "args": {"order_id": "A102"}},
15 {"tool": "get_tracking", "args": {"order_id": "A102"}},
16]
17alternating = [
18 {"tool": "get_tracking", "args": {"order_id": "A102"}},
19 {"tool": "check_inventory", "args": {"sku": "A102-SKU"}},
20 {"tool": "get_tracking", "args": {"order_id": "A102"}},
21 {"tool": "check_inventory", "args": {"sku": "A102-SKU"}},
22]
23progress = [
24 {"tool": "get_tracking", "args": {"order_id": "A102"}},
25 {"tool": "check_inventory", "args": {"sku": "A102-SKU"}},
26]
27print("same-action loop:", repeated_recent_pattern(same))
28print("alternating loop:", repeated_recent_pattern(alternating))
29print("progress loop:", repeated_recent_pattern(progress))1same-action loop: True
2alternating loop: True
3progress loop: FalseBelow is a partial ReAct trace for a different order. Three steps are shown, but the fourth step contains a mistake. Read the trace, identify what goes wrong, and decide what the agent should have done instead.
1Question: Order B305 is marked delivered, but the customer says it never arrived.
2
3Thought: I need to check the carrier's proof-of-delivery photo.
4Action: get_pod(order_id="B305")
5Observation: Photo shows package left at front door. GPS tag: 48.8566° N, 2.3522° E.
6
7Thought: The customer claims non-delivery. I should verify the delivery address.
8Action: get_order_address(order_id="B305")
9Observation: Delivery address: 12 Merchant Way, Paris 75004.
10
11Thought: The GPS tag matches Paris. I'll close the ticket as delivered.
12Action: close_ticket(order_id="B305", resolution="delivered")
13Observation: Ticket closed. Customer satisfaction score dropped.
14
15Thought: ???What went wrong? The agent checked that the GPS tag was in Paris, but it never compared the GPS coordinates to the exact delivery address. A package left at the wrong building in the same city is still a failed delivery. The agent should have checked the distance to the address coordinates. If the distance was above a threshold, it should have flagged a possible misdelivery rather than closing the ticket.
The fix: Add an enforced verify_delivery_distance precondition before the close action. A prompt reminder is useful context, but it can't block a write when the model ignores it.
Assume an address service returns reference coordinates for the delivery address. The runtime can calculate the distance and route distant proof to investigation instead of trusting the model's interpretation.
1from math import asin, cos, radians, sin, sqrt
2
3def distance_meters(left: tuple[float, float], right: tuple[float, float]) -> float:
4 lat1, lon1 = map(radians, left)
5 lat2, lon2 = map(radians, right)
6 dlat, dlon = lat2 - lat1, lon2 - lon1
7 haversine = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
8 return 2 * 6_371_000 * asin(sqrt(haversine))
9
10def dispute_action(proof: tuple[float, float], expected: tuple[float, float], max_meters: float) -> str:
11 separation = distance_meters(proof, expected)
12 if separation <= max_meters:
13 return "eligible for reviewed closure"
14 return "investigate possible misdelivery"
15
16proof_photo_location = (48.8566, 2.3522)
17delivery_address_location = (48.8560, 2.3590)
18separation = distance_meters(proof_photo_location, delivery_address_location)
19print("distance over 100m:", separation > 100)
20print("action:", dispute_action(proof_photo_location, delivery_address_location, max_meters=100))1distance over 100m: True
2action: investigate possible misdeliveryBuilding Effective Agents
Anthropic · 2024
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.
Wei, J., et al. · 2022 · NeurIPS
ReAct: Synergizing Reasoning and Acting in Language Models.
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. · 2023 · ICLR 2023
Reasoning models
OpenAI · 2026
Structured outputs
OpenAI · 2024
Self-Consistency Improves Chain of Thought Reasoning in Language Models.
Wang, X., et al. · 2022
Reflexion: Language Agents with Verbal Reinforcement Learning.
Shinn, N., et al. · 2023
Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models.
Wang, L., et al. · 2023