LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnAdvanced Agents & RetrievalReAct & Plan-and-Execute
🤖HardLLM Agents & Tool Use

ReAct & Plan-and-Execute

Compare ReAct for tightly coupled tool use with Plan-and-Execute for longer workflows with explicit planning and replanning.

33 min read
Learning path
Step 114 of 155 in the full curriculum
Structured Output GenerationGuardrails & Safety Filters

Agent architectures turn a one-shot model call into a stateful system that can request tools, observe results, and continue. This chapter compares ReAct and plan-and-execute patterns so you can choose a control loop for multi-step product work.

Imagine asking a smart assistant to not just draft a reply about a delayed shipment, but to check the order record, request a carrier update, and create a replacement-shipment task if the package is lost. A plain LLM call is passive: it produces , but it doesn't execute side effects on its own. Agent runtimes bridge that gap by treating the model as a reasoning engine that can choose tools, update state, and coordinate multi-step work.

Most product paths should stay deterministic. An agent becomes useful when the next safe step depends on live evidence that can't be enumerated cheaply in advance, and when your runtime can check its effects.

Before the patterns, one reality check that frames everything below. Anthropic's guidance on building effective agents draws a line between two kinds of systems[1]. A workflow is a system where LLMs and tools follow predefined code paths that you wrote. An agent is a system where the model dynamically directs its own process and tool use, deciding at runtime how to reach the goal. Both are useful. The guidance is blunt: find the simplest pattern that works and add complexity only when it demonstrably improves outcomes, because agentic systems trade latency and cost for flexibility. Stripped down, an agent is "just LLMs using tools based on environmental feedback in a loop." ReAct and Plan-and-Execute are two named shapes of that loop, not magic. A fixed prompt chain or a single tool-calling call is often the right answer.

Key insight: An agent isn't just an LLM with tools. It's a loop with state, allowed actions, observations, and runtime boundaries. Some agents also need planning or long-term memory; don't add those components until the task requires them. If the steps are known in advance, a workflow is cheaper and easier to debug.

This article builds on two ideas you have already met. Chain-of-Thought prompting studied visible intermediate reasoning traces before answers.[2] lets a model request an action using arguments that your code validates and executes. An agent runtime turns these ideas into a loop: request a next move, execute an allowed tool, record an observation, and repeat. We will look at two useful loop shapes: ReAct, which decides one step at a time, and Plan-and-Execute, which drafts a roadmap before moving.

ReAct: reasoning + acting

Why one step at a time?

The ReAct (Reason + Act) pattern is an influential agent loop. Proposed by Yao et al. (2023), it interleaves reasoning with actions and observations on the paper's evaluated tasks.[3] A reasoning-only response can't check a live carrier scan. An acting-only policy may issue tool calls without enough task context. ReAct combines a next-step decision with fresh environmental evidence.

Think of a fulfillment agent resolving a delayed order. It checks tracking (Observation), decides the carrier scan is stale (Reasoning), opens a carrier inquiry (Action), then reads the new status before choosing the next step. ReAct applies this same observe-decide-act loop to AI.

The paper writes this pattern as Thought, Action, and Observation. Treat Thought: as explanatory notation, not as a production API contract. OpenAI's current reasoning documentation describes reasoning tokens as hidden output tokens rather than a raw trace returned to the application.[4] Store the observable state you can audit: validated tool requests, tool results, bounded decision notes when a user needs an explanation, and budget usage.

The analogy that fits best is a warehouse picker who doesn't have a complete pick list. They walk to the first bin, read the label, decide what to grab, and only then plan the next aisle based on what they just learned. ReAct is that picker. It doesn't assume it knows the whole warehouse layout. It adapts after every shelf.

The figure below shows that feedback loop: every observation becomes the state for the next reasoning step.

ReAct control loop showing current state, tool action, and observation, with separate cards for prompt growth, fast adaptation, and runtime guardrails. ReAct control loop showing current state, tool action, and observation, with separate cards for prompt growth, fast adaptation, and runtime guardrails.
ReAct is strongest when the next useful action depends on the previous tool result. The runtime owns validation, step caps, and tool execution between model turns.

The ReAct loop

In practice, a user's request triggers repeated next-move decisions, tool execution, and observations until the model returns a final answer or the runtime stops the loop.

A concrete trace

Here is a paper-style ReAct trace where the agent must resolve a delayed order. The Thought lines make the lesson legible; an application can implement the same control loop without storing private reasoning text.

text
1Question: Order A102 is delayed. The customer needs it by Friday. 2 3Thought: I need to check the current carrier status first. 4Action: get_tracking(order_id="A102") 5Observation: Carrier scan shows "in_transit", last updated 48 hours ago. 6 7Thought: The scan is stale. I should check inventory to see if we can reship. 8Action: check_inventory(sku="A102-SKU") 9Observation: 12 units in warehouse WH-7. 10 11Thought: We have stock. I need the customer's shipping preference. 12Action: get_customer_profile(order_id="A102") 13Observation: Customer selected "standard shipping" but noted "needed by Friday" in comments. 14 15Thought: Standard won't arrive by Friday. Express shipping from WH-7 would arrive Thursday. 16Action: create_reshipment(order_id="A102", warehouse="WH-7", shipping="express") 17Observation: Reshipment R-2041 created. ETA Thursday. 18 19Thought: The customer will receive the package on time. 20Answer: Order A102 has been reshipped from warehouse WH-7 with express shipping. New tracking: R-2041. ETA Thursday.

Notice what happens to state as the loop runs. After the first tool call, the runtime has the goal, the action, and the observation. After another call, it has another evidence record to make available to the next model turn. If every raw result is appended, context grows with the trajectory; long tasks need summaries, external state, or another control pattern. The exact token cost depends on your prompts, tools, and summarization policy.

Building the loop in code

A functional runtime validates each requested move before it executes anything. The model-facing schema in a hosted API can enforce the shape of NextMove; this dependency-free example focuses on application responsibilities: tool allowlisting, step limits, observation storage, and a final answer grounded in those observations.

execute-typed-react-moves.py
1from dataclasses import dataclass, field 2from typing import Callable, Literal 3 4MoveKind = Literal["tool", "answer"] 5 6@dataclass(frozen=True) 7class NextMove: 8 kind: MoveKind 9 tool: str | None = None 10 arguments: dict[str, str] = field(default_factory=dict) 11 answer: str | None = None 12 13@dataclass 14class RuntimeState: 15 goal: str 16 observations: list[str] = field(default_factory=list) 17 tool_calls: int = 0 18 19def run_moves( 20 moves: list[NextMove], 21 tools: dict[str, Callable[[dict[str, str]], str]], 22 max_steps: int = 3, 23) -> tuple[str, RuntimeState]: 24 state = RuntimeState(goal="Resolve delayed order A102 before Friday") 25 26 for move in moves: 27 if move.kind == "answer": 28 if not state.observations or not move.answer: 29 raise ValueError("answer requires observed evidence") 30 return move.answer, state 31 32 if state.tool_calls >= max_steps: 33 return "Stopped: tool-call budget exhausted.", state 34 if move.kind != "tool" or move.tool not in tools: 35 raise ValueError("requested tool is not allowed") 36 37 result = tools[move.tool](move.arguments) 38 state.observations.append(f"{move.tool}: {result}") 39 state.tool_calls += 1 40 41 return "Stopped: no final answer.", state 42 43tools = { 44 "get_tracking": lambda args: "scan stale for 48h", 45 "check_inventory": lambda args: "12 units available in WH-7", 46} 47moves = [ 48 NextMove("tool", "get_tracking", {"order_id": "A102"}), 49 NextMove("tool", "check_inventory", {"sku": "A102-SKU"}), 50 NextMove("answer", answer="Eligible for a reviewed express reshipment."), 51] 52answer, state = run_moves(moves, tools) 53print("tool calls:", state.tool_calls) 54print("last observation:", state.observations[-1]) 55print("answer:", answer)
Output
1tool calls: 2 2last observation: check_inventory: 12 units available in WH-7 3answer: Eligible for a reviewed express reshipment.

The model proposes NextMove records. The runtime decides whether a tool is available, whether budget remains, and whether enough evidence exists to return an answer. No free-form private reasoning trace is needed in the audit log.

When ReAct breaks

ReAct's tight interleaving of decisions and actions creates specific failure modes worth knowing:

Infinite loops. ReAct agents can get stuck repeating the same action when a tool returns unhelpful results. If a search returns no results, the agent might search again with the identical query instead of reformulating. Enforcing a maximum step count and tracking action hashes prevents this from running indefinitely.

Context overflow. Each step may append actions, observations, and summaries to the context window. For tasks requiring many steps, that context can eventually exhaust the model's limit. Strategies include summarizing older evidence, storing full results outside the prompt, or switching to a longer-context model.

Grounding failures. ReAct's key strength is grounding reasoning in actual observations rather than the model's internal beliefs. When a tool returns misleading or incomplete data, the agent can chase a false trail. Reliable implementations include sanity checks on tool outputs before feeding them back as observations.

Interface errors. If the model's requested action doesn't satisfy the expected tool schema, the runtime must reject it. JSON mode only gives valid JSON syntax; it doesn't ensure correct tool fields. Use strict tool schemas where supported and validate the resulting arguments and policy in application code.[5]

Self-consistency without duplicated effects

Self-consistency samples multiple reasoning paths and selects a common answer, originally for reasoning tasks with a defined final response.[6] Applying that idea to an agent requires a boundary: candidate trajectories can inspect fixed, read-only evidence or run in a sandbox, but they shouldn't each issue real refunds, replacements, or messages. After selecting a proposal, the runtime still applies policy and executes at most one approved effect.

vote-on-read-only-recommendations.py
1from collections import Counter 2from typing import Protocol 3 4class RecommendationAgent(Protocol): 5 def recommend(self, evidence: dict[str, str]) -> str: 6 ... 7 8class ScriptedCandidates: 9 def __init__(self, proposals: list[str]): 10 self.proposals = iter(proposals) 11 12 def recommend(self, evidence: dict[str, str]) -> str: 13 assert evidence["inventory"] == "available" 14 return next(self.proposals) 15 16def choose_proposal(agent: RecommendationAgent, evidence: dict[str, str], samples: int) -> str: 17 proposals = [agent.recommend(evidence) for _ in range(samples)] 18 return Counter(proposals).most_common(1)[0][0] 19 20facts = {"tracking": "stale scan", "inventory": "available"} 21candidates = ScriptedCandidates(["reship_for_review", "escalate", "reship_for_review"]) 22selected = choose_proposal(candidates, facts, samples=3) 23print("selected proposal:", selected) 24print("real effects executed during vote:", 0)
Output
1selected proposal: reship_for_review 2real effects executed during vote: 0

Self-consistency multiplies model and read-only tool work by the number of samples. It can support a reviewable recommendation, but it isn't permission to repeat writes.

Reflexion: learning from a failed attempt

Self-consistency samples many trajectories in parallel and votes. Reflexion (Shinn et al., 2023)[7] takes the opposite approach across attempts: when a trajectory fails, the agent writes a short natural-language reflection on why it failed and stores that note in memory, then retries with the reflection added to its context. The original paper calls this "verbal reinforcement learning," because the agent improves through written self-critique rather than weight updates. On the HumanEval coding benchmark, Reflexion reported 91% pass@1, above the 80% GPT-4 baseline the paper cites.

The mechanism fits the order-recovery example directly. Suppose a ReAct attempt closes a non-delivery ticket using a stale carrier scan, and an evaluator flags it as wrong. A Reflexion-style agent records a lesson such as "don't trust a carrier scan older than 24 hours; check for a fresher event first," then carries that lesson into the next attempt. Reflexion works best when three conditions hold: you get a clear pass or fail signal, the lessons stay short and task-specific, and retries are allowed. It's a memory technique layered on top of ReAct, not a replacement control loop.

store-reflections-only-after-failure.py
1from dataclasses import dataclass, field 2 3@dataclass 4class AttemptMemory: 5 lessons: list[str] = field(default_factory=list) 6 7def record_evaluated_lesson(memory: AttemptMemory, passed: bool, lesson: str) -> None: 8 if not passed: 9 memory.lessons.append(lesson) 10 11def next_attempt_context(memory: AttemptMemory) -> str: 12 return memory.lessons[-1] if memory.lessons else "No prior evaluated failure." 13 14memory = AttemptMemory() 15record_evaluated_lesson(memory, passed=False, lesson="Verify delivery distance before closing dispute.") 16record_evaluated_lesson(memory, passed=True, lesson="Ignore: successful trial.") 17print("stored lessons:", len(memory.lessons)) 18print("next attempt reminder:", next_attempt_context(memory))
Output
1stored lessons: 1 2next attempt reminder: Verify delivery distance before closing dispute.

The failure signal comes from an evaluator or environment check, not from the agent deciding that its own unsupported story sounds convincing.

Plan-and-Execute

Why plan first?

ReAct is powerful, but it's local: it chooses the next action from the latest observation rather than from a committed global plan. For complex tasks such as "audit every delayed order from yesterday and draft recovery actions," a ReAct agent might get lost in one carrier exception and forget the overall recovery workflow.

If ReAct is like a support rep resolving the next visible issue, Plan-and-Execute is like a warehouse incident runbook. First, the planner maps the recovery steps, then executors handle tracking checks, inventory checks, customer messaging, and refund decisions in order.

Plan-and-Execute decouples planning from execution. It's a practical runtime pattern related to plan-first prompting techniques such as Plan-and-Solve,[8] but it doesn't imply one standard protocol. A plan may be a linear checklist or a dependency graph:

  1. Planner: An LLM generates a multi-step plan based on the user request.
  2. Executor: An agent (often a ReAct agent itself) executes each step of the plan.
  3. Replanner: The system reviews the results and updates the plan if necessary.

These roles don't have to be different models. Small systems often use one model in separate planning and execution prompts, while cost-optimized systems route planning to a stronger model and execution to cheaper specialists.

The figure below shows the key separation: a planner owns the global shape, executors own local work, and verifier/replanner steps keep the plan from going stale.

Plan-and-Execute architecture showing planner, explicit task plan, local executors, verifier or replanner, and final synthesis, with cards for global view, local work, and replan trigger. Plan-and-Execute architecture showing planner, explicit task plan, local executors, verifier or replanner, and final synthesis, with cards for global view, local work, and replan trigger.
Plan-and-Execute keeps long work organized by separating the global route from local execution and small replans.

The Plan-and-Execute flow

This architecture cleanly separates planning from execution. A planner drafts the initial steps, executors work through them, and a validation loop checks whether the remaining plan still makes sense. That global plan reduces goal drift on long tasks, but it doesn't eliminate it. A weak initial plan can still send every executor in the wrong direction.

The same order, planned

Return to the Order A102 example. A Plan-and-Execute agent would first emit a plan, then execute it:

Planner output

text
11. Check carrier status for A102. 22. Check inventory for the ordered SKU. 33. Read customer profile for shipping preference and urgency. 44. Decide action: reship, refund, or escalate. 55. Execute the chosen action and confirm.

Executor trace

text
1Step 1: get_tracking("A102") -> "in_transit, stale scan" 2Step 2: check_inventory("A102-SKU") -> "12 units in WH-7" 3Step 3: get_customer_profile("A102") -> "standard shipping, needs by Friday" 4Step 4: decide_action(...) -> "reship from WH-7 with express" 5Step 5: create_reshipment("A102", "WH-7", "express") -> "R-2041, ETA Thursday"

The executor for Step 4 might itself be a small ReAct agent that reasons about the inputs and picks the best action. That's a common hybrid pattern: a global planner keeps the big picture, while local ReAct loops handle individual decisions.

A useful plan records dependencies rather than pretending all five steps are sequential. In this order example, tracking, inventory, and customer preference can be fetched independently. The action decision must wait for all three.

run-ready-plan-steps.py
1from dataclasses import dataclass 2 3@dataclass(frozen=True) 4class PlanStep: 5 id: str 6 action: str 7 depends_on: tuple[str, ...] = () 8 9def ready_steps(plan: list[PlanStep], completed: set[str]) -> list[str]: 10 return [ 11 step.id 12 for step in plan 13 if step.id not in completed and set(step.depends_on) <= completed 14 ] 15 16plan = [ 17 PlanStep("tracking", "get_tracking"), 18 PlanStep("inventory", "check_inventory"), 19 PlanStep("profile", "get_customer_profile"), 20 PlanStep("decision", "choose_recovery", ("tracking", "inventory", "profile")), 21 PlanStep("effect", "create_reshipment", ("decision",)), 22] 23print("ready first:", ready_steps(plan, set())) 24print("ready after facts:", ready_steps(plan, {"tracking", "inventory", "profile"}))
Output
1ready first: ['tracking', 'inventory', 'profile'] 2ready after facts: ['decision']

This representation exposes concurrency without overstating it. The first three reads can run together only if the runtime supports concurrent execution and the tools don't share a conflicting resource.

Comparing the two approaches

Choosing between ReAct and Plan-and-Execute requires balancing cost, task complexity, and reliability. ReAct is useful for short, evidence-dependent decisions, but it can drift on long tasks because it lacks an explicit global plan. Plan-and-Execute gives you that plan, but it can produce brittle execution if you don't pair it with validation and replanning.

FeatureReActPlan-and-Execute
Control FlowNext move follows the latest observationGlobal plan first, then localized execution
Use CaseSearch, debugging, exploratory workflows, API navigationResearch, ETL, code migration, long-horizon tasks with decomposable subtasks
Failure ModeLocal loop or goal drift after many stepsBrittle initial plan or stale plan after the environment changes
Token CostGrows when the loop carries a large trajectory forwardCan bound each executor's context when outputs are externalized, but planning and replanning add calls
LatencySequential when every tool result gates the next movePlanner adds a serial step; independent executor work can run in parallel when dependencies allow
Error RecoveryImmediate pivot on the next stepRequires an explicit validation or replanning loop
Decision guide comparing ReAct and Plan-and-Execute by dependency shape, task horizon, and hybrid use, with a final rule to pick based on whether next step depends on fresh observation. Decision guide comparing ReAct and Plan-and-Execute by dependency shape, task horizon, and hybrid use, with a final rule to pick based on whether next step depends on fresh observation.
Choose ReAct when each next step depends on fresh observations. Choose Plan-and-Execute when the task has a stable global structure and independent substeps that can be checked or parallelized.

When to use each

Use ReAct when:

  • The environment provides immediate feedback after each action
  • Task length is bounded by an explicit runtime budget
  • You need to ground decisions in live data (search, database lookups)
  • The optimal path isn't predictable upfront

Use Plan-and-Execute when:

  • The task has a clear overall structure (audit shipments, draft a recovery report)
  • Steps have dependencies that benefit from upfront sequencing
  • You can parallelize independent sub-tasks
  • Executor steps are well-scoped and locally checkable (running tests, scraping pages, querying APIs with known schemas)

The brittleness problem

Plan-and-Execute's main weakness is the brittleness of the initial plan. If the planner misinterprets the user's intent, the entire execution pipeline is misaligned. Worse, downstream executor steps often depend on outputs from earlier steps, so a wrong assumption in Step 1 can invalidate Steps 2 through N. A ReAct loop gets an earlier opportunity to react to new evidence, but it can still repeat a bad decision without runtime checks.

Robust implementations address this by:

  • Constraining the planner's output format: Use structured prompts that limit the planner to a fixed set of step templates rather than free-form generation.
  • Adding a validation step: After the planner generates the initial plan, a second pass checks that each step is achievable and that prerequisites are satisfied.
  • Triggering replanning early: Rather than waiting for an executor to fail, check intermediate outputs against the original goal and replan if the delta exceeds a threshold.

When evidence breaks an assumption, patch only unfinished work. The completed carrier query remains an observation; a revised plan shouldn't issue the same write again.

patch-only-unfinished-plan-steps.py
1from dataclasses import dataclass 2 3@dataclass(frozen=True) 4class Step: 5 id: str 6 action: str 7 8def patch_remaining(steps: list[Step], completed: set[str], carrier_api_down: bool) -> list[Step]: 9 remaining = [step for step in steps if step.id not in completed] 10 if not carrier_api_down: 11 return remaining 12 return [ 13 Step("cached_tracking", "read_last_known_scan"), 14 Step("mark_deferred", "flag_orders_for_retry"), 15 *[step for step in remaining if step.id not in {"fetch_tracking", "close_ticket"}], 16 ] 17 18original = [ 19 Step("load_orders", "query_delayed_orders"), 20 Step("fetch_tracking", "fetch_carrier_scans"), 21 Step("close_ticket", "close_recovered_orders"), 22 Step("summary", "draft_summary"), 23] 24revised = patch_remaining(original, {"load_orders"}, carrier_api_down=True) 25print("completed kept:", ["load_orders"]) 26print("remaining actions:", [step.action for step in revised])
Output
1completed kept: ['load_orders'] 2remaining actions: ['read_last_known_scan', 'flag_orders_for_retry', 'draft_summary']

How tool calls actually flow

While we conceptually say an agent "calls" a tool, the side effect still happens in your runtime (your Python or TypeScript code). At the API boundary, the model may emit raw JSON, a structured tool-call object, or another schema-constrained payload rather than plain prose.[5] Your application is the part that validates arguments, executes the external call, and feeds the result back into the next model turn.

To make this work, the LLM needs to know exactly what tools are available and how to use them. In practice, the runtime passes structured tool definitions either in the prompt or through the provider's tool-calling API. Those definitions are usually JSON-schema-like rather than the full JSON Schema spec, and the model uses the field descriptions, enums, and required keys to construct a valid request.[5]

Tool use connects model decisions to fresh, authorized evidence and controlled effects. An order-status tool can return the latest carrier event; a write tool can create a reshipment only after application policy approves it.

Tool definition

Tools need an argument contract so the LLM knows the allowed request shape. Tool-calling APIs can expose such schemas directly to the model. The example below defines a closed JSON-schema-like contract, then shows application checks that still matter: type validation and whether this merchant may see the requested order.

tool-definition.py
1order_status_tool_schema = { 2 "name": "get_order_status", 3 "description": "Get the current fulfillment status for an order", 4 "parameters": { 5 "type": "object", 6 "properties": { 7 "order_id": { 8 "type": "string", 9 "description": "The merchant order ID" 10 }, 11 "include_tracking": { 12 "type": "boolean" 13 } 14 }, 15 "required": ["order_id"], 16 "additionalProperties": False 17 } 18} 19 20def execute_order_lookup(arguments: dict[str, object], merchant_orders: set[str]) -> str: 21 allowed_keys = {"order_id", "include_tracking"} 22 if "order_id" not in arguments or set(arguments) - allowed_keys: 23 return "reject: invalid tool arguments" 24 25 order_id = arguments["order_id"] 26 include_tracking = arguments.get("include_tracking", False) 27 if not isinstance(order_id, str) or not isinstance(include_tracking, bool): 28 return "reject: invalid tool arguments" 29 if order_id not in merchant_orders: 30 return "deny: order not authorized for merchant" 31 return f"allow: return status for {order_id}" 32 33print(execute_order_lookup({"order_id": "A102"}, {"A102"})) 34print(execute_order_lookup({"order_id": "A102", "include_tracking": "yes"}, {"A102"})) 35print(execute_order_lookup({"order_id": "B900"}, {"A102"}))
Output
1allow: return status for A102 2reject: invalid tool arguments 3deny: order not authorized for merchant

The execution loop

The underlying mechanics of "Tool Use" involve a hidden round-trip handled by your application:

  1. User: "Where is order A102?"
  2. LLM: Returns a tool call record or JSON such as {"tool": "get_order_status", "args": {"order_id": "A102"}}
  3. Runtime: Pauses generation. Parses the tool call payload. Calls API. Gets "delayed, ETA Friday".
  4. Runtime: Feeds the result back as a tool-result message containing "delayed, ETA Friday"
  5. LLM: "Order A102 is delayed and now expected Friday."

This hidden round-trip relies entirely on your application code. The LLM doesn't execute the API request itself; it merely generates the structured request representing the intended call. The host application needs to securely execute the external call, manage connection timeouts, handle authentication, and then format the response back into a format that the LLM can ingest for its next reasoning step.

Common mistake: Treating schema validation as sufficient. Constrained outputs reduce syntax errors, but your runtime still needs to handle semantically bad arguments, missing auth, and tool timeouts.

A write needs another boundary: retrying the same agent turn must not create duplicate effects. Give each intended effect an idempotency key owned by your application, not invented afresh on every model retry.

make-agent-writes-idempotent.py
1def create_reshipment_once( 2 order_id: str, 3 idempotency_key: str, 4 applied: dict[str, str], 5) -> str: 6 if idempotency_key in applied: 7 return f"replay: {applied[idempotency_key]}" 8 shipment_id = f"R-{len(applied) + 2041}" 9 applied[idempotency_key] = shipment_id 10 return f"created: {shipment_id} for {order_id}" 11 12effects: dict[str, str] = {} 13key = "approve-reship:A102:policy-v3" 14print(create_reshipment_once("A102", key, effects)) 15print(create_reshipment_once("A102", key, effects)) 16print("shipments created:", len(effects))
Output
1created: R-2041 for A102 2replay: R-2041 3shipments created: 1

When agents break

An unconstrained loop can spend far beyond the intended request budget or issue duplicate writes. Production systems need runtime controls before an agent is allowed to affect orders, refunds, or customer communication.

Agents operate in dynamic environments where external state can change between steps. When a tool returns an unexpected format, or an API call times out, a naive agent might blindly retry the exact same action or invent a response. Because an observation influences later moves, errors can compound. A single unsupported conclusion can derail the execution plan.

To build reliable agents, engineers need strong guardrails at the runtime layer. This means treating the LLM as an unreliable sub-component rather than a deterministic program. You need to validate all structured outputs, enforce hard limits on execution steps, and provide clear, actionable error messages back to the model when a failure occurs.

enforce-read-and-write-budgets.py
1from dataclasses import dataclass 2 3@dataclass 4class Budget: 5 reads_left: int 6 writes_left: int 7 8def authorize_action(kind: str, budget: Budget) -> str: 9 if kind == "read" and budget.reads_left > 0: 10 budget.reads_left -= 1 11 return "allow read" 12 if kind == "write" and budget.writes_left > 0: 13 budget.writes_left -= 1 14 return "allow write" 15 return f"stop: {kind} budget exhausted" 16 17budget = Budget(reads_left=2, writes_left=1) 18print(authorize_action("read", budget)) 19print(authorize_action("write", budget)) 20print(authorize_action("write", budget))
Output
1allow read 2allow write 3stop: write budget exhausted
Failure ModeSymptomCauseFix
Infinite LoopsAgent repeats the same action (e.g., search("order A102 status")) forever.Tool returns an unhelpful result and the agent doesn't reformulate.Cycle Detection: Detect repeated recent patterns without progress, then stop or change strategy.
Hallucinated ToolsAgent calls VideoGenerator() when no such tool exists.Model invents a tool name that wasn't in the allowlist.Allowlisted Tools: Reject unknown tool names before execution and return a bounded error observation.
Context OverflowConversation history exceeds token limit.Every step appends raw observations or large summaries.External State + Summary: Keep authoritative results outside the prompt and pass a bounded summary plus recent evidence.
Goal DriftAgent forgets the original user intent after many steps.Long trajectory pushes the original query out of the model's attention window.Periodic Goal Restatement: Inject the original user query or a compact goal summary back into the next model call every K steps.
Interface ErrorsRequested action fails its tool schema.Model emits missing, invalid, or unsupported arguments.Strict Tool Contract: Use provider strict schemas when available, then validate arguments and policy in runtime code.
Brittle PlansPlan-and-Execute planner misinterprets intent, cascading failures through all steps.Planner made a wrong assumption at T=0 and executors blindly followed it.Plan Validation: Add a second-pass check that each plan step is achievable; trigger replanning early rather than waiting for executor failure.

Detecting cycles in code

To prevent loops, track normalized actions in a short recent window. The detector below catches repeated single actions and short alternating patterns. A production detector also needs progress signals, because an agent can repeat a valid read while receiving new pages of results.

detecting-cycles-in-code.py
1def normalized(call: dict[str, object]) -> tuple[str, tuple[tuple[str, str], ...]]: 2 args = call.get("args", {}) 3 assert isinstance(args, dict) 4 return str(call["tool"]), tuple(sorted((str(k), str(v)) for k, v in args.items())) 5 6def repeated_recent_pattern(history: list[dict[str, object]], max_period: int = 3) -> bool: 7 actions = [normalized(call) for call in history] 8 for period in range(1, min(max_period, len(actions) // 2) + 1): 9 if actions[-period:] == actions[-2 * period:-period]: 10 return True 11 return False 12 13same = [ 14 {"tool": "get_tracking", "args": {"order_id": "A102"}}, 15 {"tool": "get_tracking", "args": {"order_id": "A102"}}, 16] 17alternating = [ 18 {"tool": "get_tracking", "args": {"order_id": "A102"}}, 19 {"tool": "check_inventory", "args": {"sku": "A102-SKU"}}, 20 {"tool": "get_tracking", "args": {"order_id": "A102"}}, 21 {"tool": "check_inventory", "args": {"sku": "A102-SKU"}}, 22] 23progress = [ 24 {"tool": "get_tracking", "args": {"order_id": "A102"}}, 25 {"tool": "check_inventory", "args": {"sku": "A102-SKU"}}, 26] 27print("same-action loop:", repeated_recent_pattern(same)) 28print("alternating loop:", repeated_recent_pattern(alternating)) 29print("progress loop:", repeated_recent_pattern(progress))
Output
1same-action loop: True 2alternating loop: True 3progress loop: False

Try it yourself

Below is a partial ReAct trace for a different order. Three steps are shown, but the fourth step contains a mistake. Read the trace, identify what goes wrong, and decide what the agent should have done instead.

text
1Question: Order B305 is marked delivered, but the customer says it never arrived. 2 3Thought: I need to check the carrier's proof-of-delivery photo. 4Action: get_pod(order_id="B305") 5Observation: Photo shows package left at front door. GPS tag: 48.8566° N, 2.3522° E. 6 7Thought: The customer claims non-delivery. I should verify the delivery address. 8Action: get_order_address(order_id="B305") 9Observation: Delivery address: 12 Merchant Way, Paris 75004. 10 11Thought: The GPS tag matches Paris. I'll close the ticket as delivered. 12Action: close_ticket(order_id="B305", resolution="delivered") 13Observation: Ticket closed. Customer satisfaction score dropped. 14 15Thought: ???

What went wrong? The agent checked that the GPS tag was in Paris, but it never compared the GPS coordinates to the exact delivery address. A package left at the wrong building in the same city is still a failed delivery. The agent should have checked the distance to the address coordinates. If the distance was above a threshold, it should have flagged a possible misdelivery rather than closing the ticket.

The fix: Add an enforced verify_delivery_distance precondition before the close action. A prompt reminder is useful context, but it can't block a write when the model ignores it.

Assume an address service returns reference coordinates for the delivery address. The runtime can calculate the distance and route distant proof to investigation instead of trusting the model's interpretation.

enforce-proof-of-delivery-distance.py
1from math import asin, cos, radians, sin, sqrt 2 3def distance_meters(left: tuple[float, float], right: tuple[float, float]) -> float: 4 lat1, lon1 = map(radians, left) 5 lat2, lon2 = map(radians, right) 6 dlat, dlon = lat2 - lat1, lon2 - lon1 7 haversine = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2 8 return 2 * 6_371_000 * asin(sqrt(haversine)) 9 10def dispute_action(proof: tuple[float, float], expected: tuple[float, float], max_meters: float) -> str: 11 separation = distance_meters(proof, expected) 12 if separation <= max_meters: 13 return "eligible for reviewed closure" 14 return "investigate possible misdelivery" 15 16proof_photo_location = (48.8566, 2.3522) 17delivery_address_location = (48.8560, 2.3590) 18separation = distance_meters(proof_photo_location, delivery_address_location) 19print("distance over 100m:", separation > 100) 20print("action:", dispute_action(proof_photo_location, delivery_address_location, max_meters=100))
Output
1distance over 100m: True 2action: investigate possible misdelivery

Key takeaways

  • Start with the simplest pattern. An agent is "just LLMs using tools based on environmental feedback in a loop"[1]. If the steps are known in advance, a fixed workflow or a single tool-calling call beats an autonomous loop on cost, latency, and debuggability. Reach for ReAct or Plan-and-Execute only when the next step genuinely depends on what the model observes.
  • ReAct is a useful pattern for short, interactive tasks whose next move depends on fresh evidence. In practice, start with a small tool-calling loop and add orchestration only when evaluations justify it.[1]
  • Plan-and-Execute fits work with a stable global structure and checkable local steps. It separates planning from execution while requiring validation and replanning.
  • Reflexion[7] adds a memory of written lessons from failed attempts on top of a ReAct loop. It is a refinement, not a separate control architecture.
  • Multi-agent systems can use a Plan-and-Execute shape when planner, executor, and verifier roles become separate workers. This chapter gives you the control-loop vocabulary you will need before later orchestration work.
  • Memory matters for both architectures. ReAct trajectories grow one observation at a time, while Plan-and-Execute systems need somewhere to store intermediate outputs between phases. A dedicated article on agent memory and persistence follows later in the path.
  • Observability is key. Log state you actually control: validated tool calls, planner outputs, redacted observations, approvals, retries, and budget usage. Don't make raw chain-of-thought your audit artifact.

Mastery check

Key concepts

  • ReAct Pattern
  • Plan-and-Execute
  • Chain-of-Thought
  • Tool Use
  • Self-Consistency
  • Reflexion
  • Cycle Detection
  • Replanning

Evaluation rubric

  • Foundational: Implement a ReAct-style loop with validated actions, observations, and runtime-owned budgets
  • Intermediate: Design a Plan-and-Execute architecture with distinct Planner and Executor roles
  • Advanced: Compare the latency, token-cost, and recovery trade-offs between ReAct and Plan-and-Execute
  • Advanced: Explain how localized executor context and external memory reduce prompt growth
  • Advanced: Implement failure recovery mechanisms like replanning, cycle detection, and step-limit guardrails

Follow-up questions

Common pitfalls

  • Symptom: a ReAct agent gets lost in long audits or migrations. Cause: the task needs a stable global plan, but the loop only sees the next local step. Fix: switch to Plan-and-Execute or use a planner plus local ReAct executors.
  • Symptom: tool calls repeat until the budget is gone. Cause: loop detection and hard step limits live only in the prompt, not in runtime code. Fix: enforce step caps, cycle detection, and timeout budgets outside the model.
  • Symptom: the agent gets slower and forgets earlier intent after many turns. Cause: every raw observation and intermediate update was appended forever. Fix: retain authoritative state outside the prompt, summarize older evidence, restate the goal periodically, and keep only recent detail in context.
  • Symptom: every executor follows a bad plan faithfully. Cause: the initial Plan-and-Execute plan was treated as truth instead of a draft. Fix: validate the plan early and patch the remaining plan when new evidence breaks old assumptions.
Next Step
Continue to Guardrails & Safety Filters

There, you'll master layered guardrails for production LLM systems, including prompt injection defense, PII controls, structured output constraints, and policy-driven enforcement.

PreviousStructured Output Generation
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Building Effective Agents

Anthropic · 2024

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.

Wei, J., et al. · 2022 · NeurIPS

ReAct: Synergizing Reasoning and Acting in Language Models.

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. · 2023 · ICLR 2023

Reasoning models

OpenAI · 2026

Structured outputs

OpenAI · 2024

Self-Consistency Improves Chain of Thought Reasoning in Language Models.

Wang, X., et al. · 2022

Reflexion: Language Agents with Verbal Reinforcement Learning.

Shinn, N., et al. · 2023

Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models.

Wang, L., et al. · 2023