LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Posts
BlogHow to Build an AI Agent from Scratch
🤖 Agents🏊 Deep Dive🏷️ Tutorial

How to Build an AI Agent from Scratch

Build a working AI agent from the raw loop: define tools, let the model choose one, execute it in Python, append the observation, and add the guardrails that keep agents reliable.

LeetLLM TeamFebruary 19, 2026Updated May 26, 202625 min read
How to Build an AI Agent from Scratch cover image

How to Build an AI Agent from Scratch

Imagine a support assistant with access to an order database, a billing API, an audit log, and a policy archive. You don't have to spell out every database query in advance. You give it a goal, it decides which tool to use next, your code executes that tool, and the result becomes context for the next decision. That's the core of an AI agent.

Frameworks hide this loop behind helpful abstractions. This article strips those abstractions away. We'll build the loop directly, wire real tool calls, hit the first failure modes, and end with a production-oriented scaffold you can compare against LangGraph, CrewAI, or an OpenAI Agents SDK project later.

Before we start, you need to be comfortable writing Python, making HTTP requests, and reading JSON. If you've ever written a script that calls an API and parses the response, you have enough background. We will use the OpenAI Python client, but the same loop works with any model that supports tool calling.

By the end, you can explain the agent loop in plain English, implement a small tool-using agent, and recognize why production agents need limits, logging, memory, and human approval around the core.

What is an AI agent?

Agent concept diagram showing a user goal flowing into an LLM decision, then code execution, then an observation that feeds the next step. Agent concept diagram showing a user goal flowing into an LLM decision, then code execution, then an observation that feeds the next step.
The agent loop separates decision from execution: the model picks the next step, your code does the real work, and the observation feeds the next turn.

Strip away framework names and an AI agent is small in concept. It's made of three parts:

  1. An LLM (Large Language Model) that can select the next step
  2. A set of tools your code can execute when the LLM requests them (search, calculate, read files, call APIs)
  3. A loop that runs until the task is done or the agent gives up
CapabilitySingle LLM CallAgent Loop
Multi-step reasoningLimited to one responseIterates until completion
External actionsNo direct tool executionCalls APIs, files, and services
Error recoveryUser must retry manuallyCan retry or switch strategy
State handlingContext only in one promptMaintains a running message/tool history

That's it. The LLM reads a prompt, decides whether it needs to use a tool, calls it, reads the result, and decides again. This cycle repeats until the LLM produces a final answer.

The important boundary is this: the LLM doesn't "do" anything itself. It decides what should happen next. The tools do the work. The loop connects decisions to actions to observations to more decisions.

Tools like LangGraph, CrewAI, and OpenAI's tool-calling APIs all implement this same loop with different abstractions. Understanding the raw loop makes every framework easier to learn because you can see which part of the loop the framework is managing for you.

One caution before you reach for the loop at all. Anthropic's guidance is to find the simplest solution first and only add complexity when it earns its keep, which sometimes means not building an agent.[1] A workflow with fixed code paths is more predictable and cheaper than a model that decides its own steps. Reach for an open-ended loop when the task genuinely needs the model to choose what to do next based on what it observes. The rest of this article builds that loop so you understand it, not so you reach for it by default.

A trace through one turn

Before we look at code, let's walk through a concrete scenario. Suppose a customer asks: "Can I still return order 12345?"

Our agent has three tools:

  • get_order_status(order_id) looks up an order in the database
  • read_file(path) reads the billing policy from a file
  • get_current_date() returns today's date

Here is how one loop iteration plays out:

  1. User message: "Can I still return order 12345?"
  2. Agent thinks: "I need the delivery date for order 12345 and the billing policy to check the time window."
  3. Agent acts: It calls get_order_status with {"order_id": "12345"}
  4. Observation: The database returns {"total": 89.99, "items": 3, "delivered": "2025-04-10"}
  5. Agent thinks: "The order was delivered on April 10. I need the policy to know how many days I have."
  6. Agent acts: It calls read_file with {"path": "policy.txt"}
  7. Observation: The file returns "Returns accepted within 30 days of delivery."
  8. Agent acts: It calls get_current_date with {}
  9. Observation: The date tool returns "2025-05-11".
  10. Agent thinks: "April 10 to May 11 is 31 days. That exceeds the 30-day window."
  11. Final answer: "Order 12345 was delivered on April 10, so the 30-day return window closed on May 10. This order is no longer eligible for a return."

Notice what happened. The LLM never touched the database or the file system directly. It emitted structured requests. Our Python code executed those requests, appended the results to the conversation, and called the LLM again. The LLM selected steps and wrote text. The code handled the real work.

The simplest possible agent

Let's start with the smallest thing that could possibly work. We'll build an agent that can use exactly one tool: a calculator. In our support scenario, this might compute a shipping surcharge or a prorated refund. By constraining the environment to a single tool, we can focus entirely on the core loop that drives the agent's decision-making process.

To make this work, we provide the LLM with a strict definition of what the tool is, what it does, and what inputs it expects. We do this by passing a JSON schema that describes our function. When the model determines that a calculation is necessary, it generates a structured request matching that schema rather than only writing plain text.[2]

The code below sets up the OpenAI Python client and defines a calculator tool. The example uses Chat Completions because the raw message loop is easy to inspect directly: every assistant message, tool call, and observation is a plain dict you append to a list. OpenAI now recommends the newer Responses API for tool-calling and multi-turn projects, and it manages more of this state for you, but the control flow is identical. Define tools, let the model choose, execute the call, append the result, and continue. To run it locally, install the openai package, set OPENAI_API_KEY, and use a tool-capable model your account can access.[2]

the-simplest-possible-agent.py
1from openai import OpenAI 2import ast 3import json 4import operator 5 6client = OpenAI() 7 8ALLOWED_BINARY_OPERATORS = { 9 ast.Add: operator.add, 10 ast.Sub: operator.sub, 11 ast.Mult: operator.mul, 12 ast.Div: operator.truediv, 13 ast.FloorDiv: operator.floordiv, 14 ast.Mod: operator.mod, 15} 16 17ALLOWED_UNARY_OPERATORS = { 18 ast.UAdd: operator.pos, 19 ast.USub: operator.neg, 20} 21 22def evaluate_node(node: ast.AST) -> int | float: 23 if isinstance(node, ast.Constant) and type(node.value) in (int, float): 24 return node.value 25 26 if isinstance(node, ast.BinOp) and type(node.op) in ALLOWED_BINARY_OPERATORS: 27 left = evaluate_node(node.left) 28 right = evaluate_node(node.right) 29 return ALLOWED_BINARY_OPERATORS[type(node.op)](left, right) 30 31 if isinstance(node, ast.UnaryOp) and type(node.op) in ALLOWED_UNARY_OPERATORS: 32 return ALLOWED_UNARY_OPERATORS[type(node.op)](evaluate_node(node.operand)) 33 34 raise ValueError("Only numeric arithmetic is supported.") 35 36def safe_calculate(expression: str) -> str: 37 if len(expression) > 120: 38 raise ValueError("Expression too long.") 39 tree = ast.parse(expression, mode="eval") 40 result = evaluate_node(tree.body) 41 return str(result) 42 43# Define a tool 44tools = [ 45 { 46 "type": "function", 47 "function": { 48 "name": "calculate", 49 "description": "Evaluate a math expression and return the result", 50 "strict": True, 51 "parameters": { 52 "type": "object", 53 "properties": { 54 "expression": { 55 "type": "string", 56 "description": "A Python math expression, e.g. '2 + 3 * 4'" 57 } 58 }, 59 "required": ["expression"], 60 "additionalProperties": False 61 } 62 } 63 } 64] 65 66def calculate(expression: str) -> str: 67 """Evaluate a limited arithmetic expression.""" 68 try: 69 return safe_calculate(expression) 70 except (SyntaxError, ValueError, ZeroDivisionError) as e: 71 return f"Error: {e}" 72 73# The agent loop 74def run_agent(user_message: str) -> str: 75 """Run the agent loop until it completes the task or runs out of iterations.""" 76 messages = [ 77 {"role": "system", "content": "You are a helpful assistant. " 78 "Use the calculate tool when you need to do math."}, 79 {"role": "user", "content": user_message} 80 ] 81 82 while True: 83 response = client.chat.completions.create( 84 model="gpt-4.1", # or another tool-capable model you can access 85 messages=messages, 86 tools=tools, 87 ) 88 89 choice = response.choices[0] 90 91 # If the model wants to call a tool 92 if choice.finish_reason == "tool_calls": 93 assistant_message = choice.message.model_dump(exclude_none=True) 94 messages.append(assistant_message) 95 for tool_call in choice.message.tool_calls: 96 args = json.loads(tool_call.function.arguments) 97 result = calculate(args["expression"]) 98 99 # Add the tool result 100 messages.append({ 101 "role": "tool", 102 "tool_call_id": tool_call.id, 103 "content": result, 104 }) 105 elif choice.finish_reason == "stop": 106 # Model is done, return the final answer 107 return choice.message.content 108 else: 109 return f"Agent stopped unexpectedly: {choice.finish_reason}" 110 111# Test it 112answer = run_agent("What is 42 * 17 + 389?") 113print(answer)

The exact sentence depends on the model, but the answer should include 1103.

This is already a working agent loop. The while True block repeats model calls. The if choice.finish_reason == "tool_calls" branch is where the model asks to act. Your code parses the arguments, executes the tool, appends the observation, and calls the model again.

The calculator uses an AST whitelist instead of eval(). That matters even in a tutorial. Readers copy code, and tool inputs are model-generated strings. Treat those strings as untrusted unless a sandbox, parser, or allowlist proves otherwise.

The mistake to avoid is stopping here. A raw loop without error handling, timeouts, cost tracking, loop limits, structured logging, and human approval is a demo, not a production system. We'll add those layers after the core pattern is clear.

Adding real tools

A calculator is fun, but a useful agent needs tools that interact with the world. Let's add three more tools that fit our support scenario: an order lookup, a policy reader, and a date helper. The following snippets show how to define these operations. Each tool accepts simple string arguments and returns string results, expanding what our agent can accomplish.

adding-real-tools.py
1import json 2from datetime import datetime, timezone 3from pathlib import Path 4 5# Tool registry: maps tool names to functions 6TOOL_REGISTRY = {} 7 8def tool(func): 9 """Decorator that registers a function as a tool.""" 10 TOOL_REGISTRY[func.__name__] = func 11 return func 12 13@tool 14def calculate(expression: str) -> str: 15 """Evaluate a limited arithmetic expression and return the result.""" 16 try: 17 return safe_calculate(expression) 18 except (SyntaxError, ValueError, ZeroDivisionError) as e: 19 return f"Error: {e}" 20 21# Mock order database 22ORDER_DB = { 23 "12345": {"total": 89.99, "items": 3, "delivered": "2025-04-10"}, 24 "67890": {"total": 149.50, "items": 1, "delivered": "2025-05-01"}, 25} 26 27@tool 28def get_order_status(order_id: str) -> str: 29 """Look up an order by ID and return its total, item count, and delivery date.""" 30 order = ORDER_DB.get(order_id) 31 if not order: 32 return f"Error: Order {order_id} not found." 33 return json.dumps(order) 34 35POLICY_DIR = Path("policies").resolve() 36 37@tool 38def read_file(path: str) -> str: 39 """Read a policy file from the approved policy directory.""" 40 try: 41 target = (POLICY_DIR / path).resolve() 42 target.relative_to(POLICY_DIR) 43 content = target.read_text(encoding="utf-8") 44 if len(content) > 3000: 45 return content[:3000] + "\n... (truncated)" 46 return content 47 except FileNotFoundError: 48 return f"Error: File not found: {path}" 49 except ValueError: 50 return f"Error: Path outside policy directory: {path}" 51 52@tool 53def get_current_date() -> str: 54 """Get the current UTC date in ISO 8601 format.""" 55 return datetime.now(timezone.utc).date().isoformat()

The @tool decorator is a pattern you'll see in many agent frameworks. It registers tools in a lookup table so the agent loop can find the right function when the model asks for it.

Now the agent loop becomes generic. We implement a dispatcher function that takes the tool's name and its arguments, and dynamically executes it. If the requested tool is missing from our registry, it gracefully returns an error string.

adding-real-tools-2.py
1def execute_tool(name: str, args: dict) -> str: 2 """Look up and execute a tool by name.""" 3 if name not in TOOL_REGISTRY: 4 return f"Error: Unknown tool '{name}'" 5 try: 6 return TOOL_REGISTRY[name](**args) 7 except TypeError as e: 8 return f"Error: Bad arguments for '{name}': {e}"

The ReAct pattern

What we've built so far follows the same outer loop popularized by ReAct (Reasoning + Acting)[3]. In the original paper, the model emits explicit reasoning traces alongside actions. In production systems that reasoning may be hidden, summarized, or replaced with planner state, but the outer loop is the same:

ReAct loop diagram showing think, act, observe, and continue-or-stop as a bounded cycle. ReAct loop diagram showing think, act, observe, and continue-or-stop as a bounded cycle.
ReAct is a bounded think-act-observe loop. Each tool result becomes context for the next model call.

Each iteration of the loop has three phases:

  1. Think (Reason): The LLM reads the conversation history and decides what to do next
  2. Act: It calls a tool with specific arguments
  3. Observe: The tool result is added to the conversation, and the loop continues

Conceptually, the agent's next step might be: "I need to search for both values, then calculate the product." In production systems that reasoning may be an explicit scratchpad, a hidden reasoning trace, or a planner state. The important part is the loop: each tool result feeds back into the next decision.

For a deeper architecture comparison, the LeetLLM lesson on ReAct, Plan-and-Execute, and other agentic architectures covers when ReAct works, when it doesn't, and what alternatives exist for complex multi-step tasks.

Where it breaks

Here's where things get interesting. Once you move beyond calculator demos, five failure modes show up quickly. These aren't edge cases. They're the problems you hit first when a tool-using loop meets messy real-world tasks.

Agent failure modes visual summarizing loop repetition, hallucinated tools, context overflow, malformed JSON arguments, and wrong tool choice. Agent failure modes visual summarizing loop repetition, hallucinated tools, context overflow, malformed JSON arguments, and wrong tool choice.
Most early agent failures come from loop control, tool validation, context growth, JSON parsing, and tool-routing policy.

Failure 1: The infinite loop

The agent calls the same tool with the same arguments, gets the same result, and tries again without making progress.

What causes it

The model doesn't recognize that the tool result already answered the question, or it doesn't know how to recover from an error.

The fix

Add a maximum iteration count to stop the loop after a set number of steps, prevent unbounded recursion, and bound token cost. We pass the user message into our loop and enforce a strict loop boundary. If the task is incomplete after these iterations, we return a fallback string instead of waiting indefinitely.

the-fix.py
1MAX_ITERATIONS = 10 2 3def run_agent(user_message: str) -> str: 4 messages = [...] 5 6 for iteration in range(MAX_ITERATIONS): 7 response = client.chat.completions.create(...) 8 choice = response.choices[0] 9 10 if choice.finish_reason != "tool_calls": 11 return choice.message.content 12 13 # ... process tool calls ... 14 15 return "I wasn't able to complete this task within the step limit."

Failure 2: Hallucinated tool calls

The model invents a tool that doesn't exist ("I'll use the query_database tool to...") and crashes when the lookup fails.

What causes it

The tool list in the system prompt doesn't match the model's expectations, or the model is trying to use tools it's seen in training data.

The fix

Validate tool names before execution, and return a helpful error back to the model so it can realize its mistake and recover. The updated lookup function below verifies the incoming tool name. When an invalid tool is requested, it provides an error message containing the valid options, enabling the agent to self-correct.

the-fix-2.py
1def execute_tool(name: str, args: dict) -> str: 2 if name not in TOOL_REGISTRY: 3 return (f"Error: Tool '{name}' doesn't exist. " 4 f"Available tools: {list(TOOL_REGISTRY.keys())}") 5 return TOOL_REGISTRY[name](**args)

Failure 3: Context window overflow

After enough tool calls, the conversation history exceeds the model's context window. The agent either crashes or starts losing the original instructions.

What causes it

Each tool result adds tokens. A policy file might return 500 tokens. After 10 tool calls, you've consumed 5,000+ tokens of context just on tool results.

The fix

Summarize or truncate tool results before adding them to context. The following function cuts off long strings to avoid exceeding the maximum token limit. It takes the raw text output from a tool and ensures it stays below a specified character threshold, returning a safer, shortened string.

the-fix-3.py
1def truncate_result(result: str, max_chars: int = 2000) -> str: 2 """Truncate tool results to prevent context overflow.""" 3 if len(result) > max_chars: 4 return result[:max_chars] + "\n... (truncated)" 5 return result

Failure 4: Argument parsing errors

The model generates malformed JSON for tool arguments. This happens often with smaller models.

What causes it

LLMs don't always produce valid JSON, particularly under complex argument schemas.

The fix

Retry with an error message so the model can attempt to fix its JSON format. Better yet, enable strict structured tool schemas when your provider supports them.[4] By wrapping the JSON parser in a try-except block, we catch decoding failures. We then append the error as a tool response, prompting the model to re-generate valid arguments in the next iteration.

the-fix-4.py
1try: 2 args = json.loads(tool_call.function.arguments) 3except json.JSONDecodeError as e: 4 messages.append({ 5 "role": "tool", 6 "tool_call_id": tool_call.id, 7 "content": f"Error parsing arguments: {e}. Please try again with valid JSON." 8 }) 9 continue

Failure 5: Wrong tool selection

The model picks the wrong tool for the job. It tries to calculate a date comparison, or read_file for a fact that's already in the conversation history.

What causes it

Bad tool descriptions. If the description is vague ("do math stuff"), the model can't decide when to use it.

The fix

Write precise tool descriptions that include when to use the tool, not just what it does. Here's a comparison between a vague description and an effective one. The updated description clarifies expected inputs and scenarios, which helps the LLM select the correct operation for its current step.

text
1# Bad 2"description": "Do calculations" 3 4# Good 5"description": "Evaluate a Python math expression and return the numeric result. " 6 "Use this when you need to perform arithmetic, compute percentages, " 7 "or solve math problems. Input must be a valid Python expression."

The LeetLLM lesson on Agent Failure States, Retries, and Fallback Strategies catalogs more failure modes and recovery strategies. This article focuses on the first five because they appear as soon as you move from a calculator demo to real tools.

Adding memory

Our agent has a problem: it forgets everything between runs. Each invocation starts fresh. For many use cases that's fine, but conversational and task-continuation agents usually need more than one memory layer. In practice it's useful to separate short-term context, retrieval memory, and durable application state.

Memory TypeMechanismLifetimePrimary Use CaseCost Impact
Short-termAppending to messages arrayCurrent sessionContext for multi-turn chatsIncreases token cost linearly
Retrieval memoryStoring embeddings in a vector DBPersistent across sessionsRecalling relevant past facts, docs, or episodesRequires separate DB storage/retrieval
Durable stateRelational DB or key-value storePersistent across sessionsExact user preferences, permissions, workflow stateRequires transactional storage and schema design

Short-term memory (conversation history)

Short-term memory keeps the messages array around between calls instead of rebuilding it each time. Here's how we initialize the agent with a persistent message list. We define a class that stores the system prompt upon creation. The run method then appends the user's input and maintains the ongoing context across multiple questions.

This approach works for short interactions, but it scales linearly in cost and latency. Every new turn forces the model to re-process the entire conversation history, which eventually hits context window limits. For simple agents, however, this in-memory list is the most direct way to give the model a sense of continuity.

short-term-memory-conversation-history.py
1class Agent: 2 """A conversational agent that maintains short-term memory (context history).""" 3 4 def __init__(self, system_prompt: str): 5 """Initialize the agent with a system prompt.""" 6 self.messages = [ 7 {"role": "system", "content": system_prompt} 8 ] 9 10 def run(self, user_message: str) -> str: 11 """Process a user message and run the agent loop.""" 12 self.messages.append({"role": "user", "content": user_message}) 13 14 for iteration in range(MAX_ITERATIONS): 15 response = self._call_llm() 16 # ... process response ... 17 18 return "I wasn't able to finish within the step limit."

Retrieval memory (persistent)

For information that persists across sessions, you need a database. A simple retrieval layer stores embeddings and lets the agent fetch semantically related context when it needs it. We can expose this as a tool the agent can decide to call. The remember tool accepts a string fact and inserts its embedding, while the recall tool searches for related memories based on a text query.

By treating memory as another set of tools, the agent can request saves and recalls through the same loop, subject to your product policy. In more advanced setups, a separate process might summarize conversation history and extract these facts without requiring explicit tool calls from the primary reasoning agent. The following sketch shows the remember and recall tool interface.

retrieval-memory-persistent.py
1@tool 2def remember(fact: str) -> str: 3 """Store an important fact for future reference.""" 4 db.insert(embed(fact), metadata={"fact": fact, "timestamp": now()}) 5 return f"Remembered: {fact}" 6 7@tool 8def recall(query: str) -> str: 9 """Search memory for relevant past facts.""" 10 results = db.search(embed(query), top_k=5) 11 return "\n".join(r.metadata["fact"] for r in results)

One warning: a vector database is great for fuzzy recall, but it's the wrong source of truth for exact state like account balances, permission flags, or whether an order was already shipped. Keep exact business facts in a relational or key-value store and expose that state through tools.

The LeetLLM lesson on Agent Memory and Persistence Patterns covers the full spectrum, from simple conversation history to episodic memory and working memory architectures.

Making it production-ready

The 200-line agent works for demos. For production, you need several more layers. Here's a checklist:

Production agent guardrail stack showing the core tool loop surrounded by budget, timeout, approval, memory, logging, and evaluation layers. Production agent guardrail stack showing the core tool loop surrounded by budget, timeout, approval, memory, logging, and evaluation layers.
A production agent is not only a tool loop. The loop sits inside a control stack: memory, budget, timeout, logging, approval, and evaluation each block a different class of failure.

Cost controls

LLM calls are expensive. A runaway agent can burn through hundreds of dollars in minutes. To prevent this, implement a cost tracking class that monitors token usage and halts the agent if it exceeds a specified budget. The CostTracker examines the usage metrics from each API response. Because providers expose usage fields and pricing tables a little differently, keep the per-model rates in configuration instead of hardcoding them deep in the loop.

cost-controls.py
1class CostLimitExceeded(RuntimeError): 2 pass 3 4class CostTracker: 5 """Tracks LLM API usage costs across multiple calls.""" 6 7 def __init__( 8 self, 9 max_cost_usd: float = 1.0, 10 input_cost_per_million: float = 0.0, 11 output_cost_per_million: float = 0.0, 12 ): 13 """Initialize tracker with a maximum allowed cost in USD.""" 14 self.total_cost = 0.0 15 self.max_cost = max_cost_usd 16 self.input_cost_per_million = input_cost_per_million 17 self.output_cost_per_million = output_cost_per_million 18 19 def track(self, response): 20 """Calculate and accumulate the cost of a single API response.""" 21 usage = response.usage 22 input_tokens = getattr(usage, "input_tokens", None) 23 if input_tokens is None: 24 input_tokens = getattr(usage, "prompt_tokens", 0) 25 26 output_tokens = getattr(usage, "output_tokens", None) 27 if output_tokens is None: 28 output_tokens = getattr(usage, "completion_tokens", 0) 29 30 cost = ( 31 input_tokens * self.input_cost_per_million + 32 output_tokens * self.output_cost_per_million 33 ) / 1_000_000 34 self.total_cost += cost 35 36 if self.total_cost > self.max_cost: 37 raise CostLimitExceeded( 38 f"Agent exceeded cost limit: ${self.total_cost:.4f} > ${self.max_cost}" 39 )

Observability

When an agent fails, you need to know why. Log each loop step to create an audit trail of the agent's actions, arguments, and outcomes. By adding standard Python logging calls to each phase, we capture the chosen tools, their arguments, and the resulting values. This gives you a clean execution trace without pretending you can or should log the model's hidden chain-of-thought.

observability.py
1import logging 2 3logger = logging.getLogger("agent") 4 5for iteration in range(MAX_ITERATIONS): 6 logger.info(f"Iteration {iteration + 1}/{MAX_ITERATIONS}") 7 8 response = client.chat.completions.create(...) 9 choice = response.choices[0] 10 11 if choice.finish_reason == "tool_calls": 12 for tool_call in choice.message.tool_calls: 13 logger.info(f" Tool: {tool_call.function.name}") 14 logger.info(f" Args: {tool_call.function.arguments}") 15 result = execute_tool(...) 16 logger.info(f" Result: {result[:200]}") 17 else: 18 logger.info(f" Final answer: {choice.message.content[:200]}")

Timeouts

A tool might hang. A web search might time out. You need per-tool timeouts and a total execution timeout. A portable pattern is to run the tool in a worker and wait with a timeout. This helper function takes the tool's name and arguments alongside a timeout duration. It either returns the successful result or catches the timeout to yield an error string.

timeouts.py
1from concurrent.futures import ThreadPoolExecutor, TimeoutError as FuturesTimeoutError 2 3TOOL_EXECUTOR = ThreadPoolExecutor(max_workers=8) 4 5def execute_tool_with_timeout(name: str, args: dict, timeout_seconds: int = 30) -> str: 6 future = TOOL_EXECUTOR.submit(execute_tool, name, args) 7 try: 8 return future.result(timeout=timeout_seconds) 9 except FuturesTimeoutError: 10 return f"Error: Tool '{name}' timed out after {timeout_seconds}s"

This pattern lets your main loop move on, but it doesn't terminate arbitrary Python code that's already stuck in a thread. For untrusted or side-effectful tools, use a separate process or sandbox you can kill cleanly.

Human-in-the-loop

For high-stakes actions (deleting files, sending emails, making payments), require human approval before execution. You can intercept specific tool calls and pause the process to request explicit confirmation. The updated function checks the requested tool against a restricted set. It pauses to prompt the human user and proceeds only if explicit permission is granted.

human-in-the-loop.py
1DANGEROUS_TOOLS = {"delete_file", "send_email", "execute_sql"} 2 3def execute_tool_safe(name: str, args: dict) -> str: 4 if name in DANGEROUS_TOOLS: 5 print(f"\nApproval required: agent wants to call {name}({args})") 6 approval = input("Approve? (y/n): ") 7 if approval.lower() != "y": 8 return "Action was rejected by the user." 9 10 return execute_tool_with_timeout(name, args)

The LeetLLM lesson on Human-in-the-Loop Agent Architecture covers escalation policies, approval workflows, and how to design agent UX that builds user trust.

Evaluation

A final-answer check isn't enough for agents. Two very different trajectories can produce the same answer, and a lucky answer can hide terrible tool use. Evaluate both the outcome and the path: did the agent choose the right tool, stop at the right time, and recover correctly when something failed?

For general-purpose agents, benchmarks like GAIA measure multi-step reasoning and tool use.[5] For coding agents, SWE-bench measures whether the system can resolve real repository tasks end to end.[6] If you need scalable review, an LLM judge can score trajectories or outputs, but sample with humans too because judge models have blind spots of their own.[7]

The complete agent

Putting it all together, here is a production-oriented scaffold. This version still fits in roughly 200 lines of actual logic, but now it wires together the guardrails from earlier sections: bounded loops, short-term memory, cost tracking, JSON validation, and timeout-aware tool execution.

By encapsulating the agent logic within a class, we can maintain state across multiple conversational turns. The class constructor initializes the agent with a system prompt, a predefined set of tools, and constraints on iterations and budget.

The run method orchestrates the lifecycle for a given user message. It takes the user's input, enters the iterative decision loop, safely executes any requested tools, appends the results to the context, and returns a final answer.

the-complete-agent.py
1class Agent: 2 """A production-oriented agent with guardrails and short-term memory.""" 3 4 def __init__( 5 self, 6 system_prompt: str, 7 tools: list[dict], 8 model: str = "gpt-4.1", 9 max_iterations: int = 10, 10 max_cost_usd: float = 1.0, 11 input_cost_per_million: float = 0.0, 12 output_cost_per_million: float = 0.0, 13 ): 14 """Initialize the agent with constraints and available tools.""" 15 self.system_prompt = system_prompt 16 self.tools = tools 17 self.model = model 18 self.max_iterations = max_iterations 19 self.cost_tracker = CostTracker( 20 max_cost_usd, 21 input_cost_per_million, 22 output_cost_per_million, 23 ) 24 self.messages = [{"role": "system", "content": system_prompt}] 25 26 def _call_llm(self): 27 """Call the model with the current conversation state.""" 28 return client.chat.completions.create( 29 model=self.model, 30 messages=self.messages, 31 tools=self.tools, 32 ) 33 34 def run(self, user_message: str) -> str: 35 """Run the agent loop for a given user message.""" 36 self.messages.append({"role": "user", "content": user_message}) 37 38 for _ in range(self.max_iterations): 39 response = self._call_llm() 40 try: 41 self.cost_tracker.track(response) 42 except CostLimitExceeded as e: 43 return str(e) 44 45 choice = response.choices[0] 46 47 if choice.finish_reason == "stop": 48 self.messages.append(choice.message.model_dump(exclude_none=True)) 49 return choice.message.content or "" 50 51 if choice.finish_reason != "tool_calls": 52 return f"Agent stopped unexpectedly: {choice.finish_reason}" 53 54 self.messages.append(choice.message.model_dump(exclude_none=True)) 55 56 for tool_call in choice.message.tool_calls or []: 57 try: 58 args = json.loads(tool_call.function.arguments) 59 except json.JSONDecodeError as e: 60 result = ( 61 f"Error parsing arguments: {e}. " 62 "Please try again with valid JSON." 63 ) 64 else: 65 result = execute_tool_safe( 66 tool_call.function.name, args 67 ) 68 69 self.messages.append({ 70 "role": "tool", 71 "tool_call_id": tool_call.id, 72 "content": truncate_result(result), 73 }) 74 75 return "I reached the maximum number of steps. Here's what I found so far..."

This is enough to back a real CLI or HTTP service. In a deployed system you would still add persistence, auth, retries with backoff, and structured traces around this loop.

Try it yourself

The best way to solidify this is to build something small and see where it breaks.

Exercise: Extend the order-support agent with a new tool called check_return_eligibility(order_id). This tool does four things:

  1. Call get_order_status to find the delivery date.
  2. Call read_file to read the billing policy from policy.txt and extract the return window (for example, 30 days).
  3. Call get_current_date to check today's date.
  4. Return either "Eligible for return" or "Return window expired on YYYY-MM-DD".

Solution sketch

You don't need to change the LLM loop at all. A simple approach is to implement check_return_eligibility as a regular Python function that calls the other tools directly, then register it with the @tool decorator. The agent will then be able to use this compound tool in one shot, or you can let the LLM orchestrate the three individual calls itself and observe which approach produces cleaner behavior.

If you let the LLM orchestrate the three calls, watch for these behaviors:

  • Does it call get_order_status first, or does it try to read the policy before it knows the account ID?
  • Does it correctly parse the ISO date strings and compare them?
  • Does it stop after one eligibility check, or does it loop redundantly?

Before you hand this to the LLM, test the business rule without a model. That separates "my Python is wrong" from "the model chose the wrong tool path."

solution-sketch.py
1from datetime import date, datetime, timedelta 2import json 3import re 4 5ORDER_DB = { 6 "12345": {"total": 89.99, "items": 3, "delivered": "2025-04-10"}, 7 "67890": {"total": 149.50, "items": 1, "delivered": "2025-05-01"}, 8} 9 10POLICY_TEXT = "Returns accepted within 30 days of delivery." 11 12def get_order_status(order_id: str) -> str: 13 order = ORDER_DB.get(order_id) 14 if order is None: 15 return f"Error: Order {order_id} not found." 16 return json.dumps(order) 17 18def read_file(path: str) -> str: 19 if path != "policy.txt": 20 return f"Error: File not found: {path}" 21 return POLICY_TEXT 22 23def get_current_date() -> str: 24 return date(2025, 5, 11).isoformat() 25 26def extract_return_window_days(policy_text: str) -> int: 27 match = re.search(r"within (\d+) days", policy_text) 28 if match is None: 29 raise ValueError("Policy does not include a return window.") 30 return int(match.group(1)) 31 32def check_return_eligibility(order_id: str, today: date | None = None) -> str: 33 order = json.loads(get_order_status(order_id)) 34 delivered = datetime.fromisoformat(order["delivered"]).date() 35 window_days = extract_return_window_days(read_file("policy.txt")) 36 current_day = today or datetime.fromisoformat(get_current_date()).date() 37 expires = delivered + timedelta(days=window_days) 38 39 if current_day <= expires: 40 return f"Eligible for return until {expires.isoformat()}" 41 return f"Return window expired on {expires.isoformat()}" 42 43expired = check_return_eligibility("12345") 44eligible = check_return_eligibility("67890") 45 46print(expired) 47print(eligible) 48print("expired_check:", expired == "Return window expired on 2025-05-10") 49print("eligible_check:", eligible == "Eligible for return until 2025-05-31")
Output
1Return window expired on 2025-05-10 2Eligible for return until 2025-05-31 3expired_check: True 4eligible_check: True

Expected output for a first attempt

A beginner agent often returns a verbose explanation instead of a concise yes/no. That's fine. The goal is to see the tool calls line up with the reasoning.

Quick self-check

Key takeaways

After building this from scratch, these are the practical lessons:

1. Tool descriptions carry routing policy

The prompt and model matter, but tool descriptions are routing policy. A bad tool description causes the model to pick the wrong tool or pass wrong arguments, and then the entire agent goes off the rails. Invest time here.

2. Start with one tool and add more gradually

Starting with 1-2 tools and adding more as needed usually produces cleaner behavior than starting with a large tool catalog. When you have too many tools, the model struggles with selection.

The agent loop is small. The hard engineering lives around it. The core loop (call the model, execute a tool, repeat) is the easy part. Cost controls, timeouts, error handling, observability, and human approval are where production reliability comes from.

3. Smaller models need more guardrails

Smaller or less tool-tuned models need stricter schemas, more examples in tool descriptions, and tighter output parsing. If you're building with open-weight models, budget extra time for tool reliability.

4. Have an escape hatch

Max iterations, cost caps, and timeouts aren't optional. An agent without limits is a liability. Set conservative limits first and relax them based on observed behavior.

Now that you understand the core agent loop, MCP and Tool Protocol Standards will make more sense because MCP standardizes how tools, resources, and prompts are exposed to agents. Structured Output and Constrained Generation is the natural next concept for making tool argument parsing more reliable.

Going further

What we built here is a single-agent ReAct loop. This is a common baseline architecture, and it works well for straightforward multi-step tasks where each observation can guide the next action. Production agent systems often use more structured orchestration patterns for complex workloads.[8]

For example, when a task requires rigorous planning or extensive coordination, you might encounter advanced frameworks:

  • Plan-and-Execute: The agent first creates a plan (a list of steps), then executes each step. Better for multi-step tasks.
  • Multi-Agent Systems: Multiple specialized agents that coordinate. One researches, another writes, a third reviews.
  • DAG-Based Orchestration: Complex workflows modeled as directed acyclic graphs with conditional branching.
  • Tool Protocol Standards: MCP (Model Context Protocol)[9] is an open standard for exposing tools, resources, and prompts through a common client-server interface, which makes integrations more portable.

These advanced architectures might seem intimidating at first, but they all build on the same foundation: an LLM, a set of tools, and an iterative loop. Once you understand this core loop, the rest is mostly about arranging the same building blocks into different configurations.

You now have a working mental model of that foundation. The next articles in the LeetLLM path go deeper: ReAct and Plan-and-Execute architectures shows when to switch from a simple loop to a planner, Function Calling and Tool Use dissects how providers structure tool schemas and enforce argument shapes, and Agent Failure States and Recovery catalogs the full set of production failure modes and tested recovery strategies.

PreviousRun Qwen3.5 Locally with OllamaNextRAG vs Fine-Tuning vs Prompting
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Building Effective Agents

Anthropic · 2024

Function calling

OpenAI · 2026 · OpenAI API Docs

ReAct: Synergizing Reasoning and Acting in Language Models.

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. · 2022 · ICLR 2023

Structured outputs

OpenAI · 2024

GAIA: a Benchmark for General AI Assistants

Mialon, G., et al. · 2023 · ICLR 2024

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Jimenez et al. · 2024 · ICLR 2024

JudgeLM: Fine-tuned Large Language Models are Scalable Judges.

Zhu, K., et al. · 2024

The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling.

Masterman, T., et al. · 2024 · arXiv preprint

Introducing the Model Context Protocol

Anthropic · 2024