Build a working AI agent from the raw loop: define tools, let the model choose one, execute it in Python, append the observation, and add the guardrails that keep agents reliable.

Imagine a support assistant with access to an order database, a billing API, an audit log, and a policy archive. You don't have to spell out every database query in advance. You give it a goal, it decides which tool to use next, your code executes that tool, and the result becomes context for the next decision. That's the core of an AI agent.
Frameworks hide this loop behind helpful abstractions. This article strips those abstractions away. We'll build the loop directly, wire real tool calls, hit the first failure modes, and end with a production-oriented scaffold you can compare against LangGraph, CrewAI, or an OpenAI Agents SDK project later.
Before we start, you need to be comfortable writing Python, making HTTP requests, and reading JSON. If you've ever written a script that calls an API and parses the response, you have enough background. We will use the OpenAI Python client, but the same loop works with any model that supports tool calling.
By the end, you can explain the agent loop in plain English, implement a small tool-using agent, and recognize why production agents need limits, logging, memory, and human approval around the core.
Strip away framework names and an AI agent is small in concept. It's made of three parts:
| Capability | Single LLM Call | Agent Loop |
|---|---|---|
| Multi-step reasoning | Limited to one response | Iterates until completion |
| External actions | No direct tool execution | Calls APIs, files, and services |
| Error recovery | User must retry manually | Can retry or switch strategy |
| State handling | Context only in one prompt | Maintains a running message/tool history |
That's it. The LLM reads a prompt, decides whether it needs to use a tool, calls it, reads the result, and decides again. This cycle repeats until the LLM produces a final answer.
The important boundary is this: the LLM doesn't "do" anything itself. It decides what should happen next. The tools do the work. The loop connects decisions to actions to observations to more decisions.
Tools like LangGraph, CrewAI, and OpenAI's tool-calling APIs all implement this same loop with different abstractions. Understanding the raw loop makes every framework easier to learn because you can see which part of the loop the framework is managing for you.
One caution before you reach for the loop at all. Anthropic's guidance is to find the simplest solution first and only add complexity when it earns its keep, which sometimes means not building an agent.[1] A workflow with fixed code paths is more predictable and cheaper than a model that decides its own steps. Reach for an open-ended loop when the task genuinely needs the model to choose what to do next based on what it observes. The rest of this article builds that loop so you understand it, not so you reach for it by default.
Before we look at code, let's walk through a concrete scenario. Suppose a customer asks: "Can I still return order 12345?"
Our agent has three tools:
get_order_status(order_id) looks up an order in the databaseread_file(path) reads the billing policy from a fileget_current_date() returns today's dateHere is how one loop iteration plays out:
get_order_status with {"order_id": "12345"}{"total": 89.99, "items": 3, "delivered": "2025-04-10"}read_file with {"path": "policy.txt"}get_current_date with {}"2025-05-11".Notice what happened. The LLM never touched the database or the file system directly. It emitted structured requests. Our Python code executed those requests, appended the results to the conversation, and called the LLM again. The LLM selected steps and wrote text. The code handled the real work.
Let's start with the smallest thing that could possibly work. We'll build an agent that can use exactly one tool: a calculator. In our support scenario, this might compute a shipping surcharge or a prorated refund. By constraining the environment to a single tool, we can focus entirely on the core loop that drives the agent's decision-making process.
To make this work, we provide the LLM with a strict definition of what the tool is, what it does, and what inputs it expects. We do this by passing a JSON schema that describes our function. When the model determines that a calculation is necessary, it generates a structured request matching that schema rather than only writing plain text.[2]
The code below sets up the OpenAI Python client and defines a calculator tool. The example uses Chat Completions because the raw message loop is easy to inspect directly: every assistant message, tool call, and observation is a plain dict you append to a list. OpenAI now recommends the newer Responses API for tool-calling and multi-turn projects, and it manages more of this state for you, but the control flow is identical. Define tools, let the model choose, execute the call, append the result, and continue. To run it locally, install the openai package, set OPENAI_API_KEY, and use a tool-capable model your account can access.[2]
1from openai import OpenAI
2import ast
3import json
4import operator
5
6client = OpenAI()
7
8ALLOWED_BINARY_OPERATORS = {
9 ast.Add: operator.add,
10 ast.Sub: operator.sub,
11 ast.Mult: operator.mul,
12 ast.Div: operator.truediv,
13 ast.FloorDiv: operator.floordiv,
14 ast.Mod: operator.mod,
15}
16
17ALLOWED_UNARY_OPERATORS = {
18 ast.UAdd: operator.pos,
19 ast.USub: operator.neg,
20}
21
22def evaluate_node(node: ast.AST) -> int | float:
23 if isinstance(node, ast.Constant) and type(node.value) in (int, float):
24 return node.value
25
26 if isinstance(node, ast.BinOp) and type(node.op) in ALLOWED_BINARY_OPERATORS:
27 left = evaluate_node(node.left)
28 right = evaluate_node(node.right)
29 return ALLOWED_BINARY_OPERATORS[type(node.op)](left, right)
30
31 if isinstance(node, ast.UnaryOp) and type(node.op) in ALLOWED_UNARY_OPERATORS:
32 return ALLOWED_UNARY_OPERATORS[type(node.op)](evaluate_node(node.operand))
33
34 raise ValueError("Only numeric arithmetic is supported.")
35
36def safe_calculate(expression: str) -> str:
37 if len(expression) > 120:
38 raise ValueError("Expression too long.")
39 tree = ast.parse(expression, mode="eval")
40 result = evaluate_node(tree.body)
41 return str(result)
42
43# Define a tool
44tools = [
45 {
46 "type": "function",
47 "function": {
48 "name": "calculate",
49 "description": "Evaluate a math expression and return the result",
50 "strict": True,
51 "parameters": {
52 "type": "object",
53 "properties": {
54 "expression": {
55 "type": "string",
56 "description": "A Python math expression, e.g. '2 + 3 * 4'"
57 }
58 },
59 "required": ["expression"],
60 "additionalProperties": False
61 }
62 }
63 }
64]
65
66def calculate(expression: str) -> str:
67 """Evaluate a limited arithmetic expression."""
68 try:
69 return safe_calculate(expression)
70 except (SyntaxError, ValueError, ZeroDivisionError) as e:
71 return f"Error: {e}"
72
73# The agent loop
74def run_agent(user_message: str) -> str:
75 """Run the agent loop until it completes the task or runs out of iterations."""
76 messages = [
77 {"role": "system", "content": "You are a helpful assistant. "
78 "Use the calculate tool when you need to do math."},
79 {"role": "user", "content": user_message}
80 ]
81
82 while True:
83 response = client.chat.completions.create(
84 model="gpt-4.1", # or another tool-capable model you can access
85 messages=messages,
86 tools=tools,
87 )
88
89 choice = response.choices[0]
90
91 # If the model wants to call a tool
92 if choice.finish_reason == "tool_calls":
93 assistant_message = choice.message.model_dump(exclude_none=True)
94 messages.append(assistant_message)
95 for tool_call in choice.message.tool_calls:
96 args = json.loads(tool_call.function.arguments)
97 result = calculate(args["expression"])
98
99 # Add the tool result
100 messages.append({
101 "role": "tool",
102 "tool_call_id": tool_call.id,
103 "content": result,
104 })
105 elif choice.finish_reason == "stop":
106 # Model is done, return the final answer
107 return choice.message.content
108 else:
109 return f"Agent stopped unexpectedly: {choice.finish_reason}"
110
111# Test it
112answer = run_agent("What is 42 * 17 + 389?")
113print(answer)The exact sentence depends on the model, but the answer should include 1103.
This is already a working agent loop. The while True block repeats model calls. The if choice.finish_reason == "tool_calls" branch is where the model asks to act. Your code parses the arguments, executes the tool, appends the observation, and calls the model again.
The calculator uses an AST whitelist instead of eval(). That matters even in a tutorial. Readers copy code, and tool inputs are model-generated strings. Treat those strings as untrusted unless a sandbox, parser, or allowlist proves otherwise.
The mistake to avoid is stopping here. A raw loop without error handling, timeouts, cost tracking, loop limits, structured logging, and human approval is a demo, not a production system. We'll add those layers after the core pattern is clear.
A calculator is fun, but a useful agent needs tools that interact with the world. Let's add three more tools that fit our support scenario: an order lookup, a policy reader, and a date helper. The following snippets show how to define these operations. Each tool accepts simple string arguments and returns string results, expanding what our agent can accomplish.
1import json
2from datetime import datetime, timezone
3from pathlib import Path
4
5# Tool registry: maps tool names to functions
6TOOL_REGISTRY = {}
7
8def tool(func):
9 """Decorator that registers a function as a tool."""
10 TOOL_REGISTRY[func.__name__] = func
11 return func
12
13@tool
14def calculate(expression: str) -> str:
15 """Evaluate a limited arithmetic expression and return the result."""
16 try:
17 return safe_calculate(expression)
18 except (SyntaxError, ValueError, ZeroDivisionError) as e:
19 return f"Error: {e}"
20
21# Mock order database
22ORDER_DB = {
23 "12345": {"total": 89.99, "items": 3, "delivered": "2025-04-10"},
24 "67890": {"total": 149.50, "items": 1, "delivered": "2025-05-01"},
25}
26
27@tool
28def get_order_status(order_id: str) -> str:
29 """Look up an order by ID and return its total, item count, and delivery date."""
30 order = ORDER_DB.get(order_id)
31 if not order:
32 return f"Error: Order {order_id} not found."
33 return json.dumps(order)
34
35POLICY_DIR = Path("policies").resolve()
36
37@tool
38def read_file(path: str) -> str:
39 """Read a policy file from the approved policy directory."""
40 try:
41 target = (POLICY_DIR / path).resolve()
42 target.relative_to(POLICY_DIR)
43 content = target.read_text(encoding="utf-8")
44 if len(content) > 3000:
45 return content[:3000] + "\n... (truncated)"
46 return content
47 except FileNotFoundError:
48 return f"Error: File not found: {path}"
49 except ValueError:
50 return f"Error: Path outside policy directory: {path}"
51
52@tool
53def get_current_date() -> str:
54 """Get the current UTC date in ISO 8601 format."""
55 return datetime.now(timezone.utc).date().isoformat()The @tool decorator is a pattern you'll see in many agent frameworks. It registers tools in a lookup table so the agent loop can find the right function when the model asks for it.
Now the agent loop becomes generic. We implement a dispatcher function that takes the tool's name and its arguments, and dynamically executes it. If the requested tool is missing from our registry, it gracefully returns an error string.
1def execute_tool(name: str, args: dict) -> str:
2 """Look up and execute a tool by name."""
3 if name not in TOOL_REGISTRY:
4 return f"Error: Unknown tool '{name}'"
5 try:
6 return TOOL_REGISTRY[name](**args)
7 except TypeError as e:
8 return f"Error: Bad arguments for '{name}': {e}"What we've built so far follows the same outer loop popularized by ReAct (Reasoning + Acting)[3]. In the original paper, the model emits explicit reasoning traces alongside actions. In production systems that reasoning may be hidden, summarized, or replaced with planner state, but the outer loop is the same:
Each iteration of the loop has three phases:
Conceptually, the agent's next step might be: "I need to search for both values, then calculate the product." In production systems that reasoning may be an explicit scratchpad, a hidden reasoning trace, or a planner state. The important part is the loop: each tool result feeds back into the next decision.
For a deeper architecture comparison, the LeetLLM lesson on ReAct, Plan-and-Execute, and other agentic architectures covers when ReAct works, when it doesn't, and what alternatives exist for complex multi-step tasks.
Here's where things get interesting. Once you move beyond calculator demos, five failure modes show up quickly. These aren't edge cases. They're the problems you hit first when a tool-using loop meets messy real-world tasks.
The agent calls the same tool with the same arguments, gets the same result, and tries again without making progress.
The model doesn't recognize that the tool result already answered the question, or it doesn't know how to recover from an error.
Add a maximum iteration count to stop the loop after a set number of steps, prevent unbounded recursion, and bound token cost. We pass the user message into our loop and enforce a strict loop boundary. If the task is incomplete after these iterations, we return a fallback string instead of waiting indefinitely.
1MAX_ITERATIONS = 10
2
3def run_agent(user_message: str) -> str:
4 messages = [...]
5
6 for iteration in range(MAX_ITERATIONS):
7 response = client.chat.completions.create(...)
8 choice = response.choices[0]
9
10 if choice.finish_reason != "tool_calls":
11 return choice.message.content
12
13 # ... process tool calls ...
14
15 return "I wasn't able to complete this task within the step limit."The model invents a tool that doesn't exist ("I'll use the query_database tool to...") and crashes when the lookup fails.
The tool list in the system prompt doesn't match the model's expectations, or the model is trying to use tools it's seen in training data.
Validate tool names before execution, and return a helpful error back to the model so it can realize its mistake and recover. The updated lookup function below verifies the incoming tool name. When an invalid tool is requested, it provides an error message containing the valid options, enabling the agent to self-correct.
1def execute_tool(name: str, args: dict) -> str:
2 if name not in TOOL_REGISTRY:
3 return (f"Error: Tool '{name}' doesn't exist. "
4 f"Available tools: {list(TOOL_REGISTRY.keys())}")
5 return TOOL_REGISTRY[name](**args)After enough tool calls, the conversation history exceeds the model's context window. The agent either crashes or starts losing the original instructions.
Each tool result adds tokens. A policy file might return 500 tokens. After 10 tool calls, you've consumed 5,000+ tokens of context just on tool results.
Summarize or truncate tool results before adding them to context. The following function cuts off long strings to avoid exceeding the maximum token limit. It takes the raw text output from a tool and ensures it stays below a specified character threshold, returning a safer, shortened string.
1def truncate_result(result: str, max_chars: int = 2000) -> str:
2 """Truncate tool results to prevent context overflow."""
3 if len(result) > max_chars:
4 return result[:max_chars] + "\n... (truncated)"
5 return resultThe model generates malformed JSON for tool arguments. This happens often with smaller models.
LLMs don't always produce valid JSON, particularly under complex argument schemas.
Retry with an error message so the model can attempt to fix its JSON format. Better yet, enable strict structured tool schemas when your provider supports them.[4] By wrapping the JSON parser in a try-except block, we catch decoding failures. We then append the error as a tool response, prompting the model to re-generate valid arguments in the next iteration.
1try:
2 args = json.loads(tool_call.function.arguments)
3except json.JSONDecodeError as e:
4 messages.append({
5 "role": "tool",
6 "tool_call_id": tool_call.id,
7 "content": f"Error parsing arguments: {e}. Please try again with valid JSON."
8 })
9 continueThe model picks the wrong tool for the job. It tries to calculate a date comparison, or read_file for a fact that's already in the conversation history.
Bad tool descriptions. If the description is vague ("do math stuff"), the model can't decide when to use it.
Write precise tool descriptions that include when to use the tool, not just what it does. Here's a comparison between a vague description and an effective one. The updated description clarifies expected inputs and scenarios, which helps the LLM select the correct operation for its current step.
1# Bad
2"description": "Do calculations"
3
4# Good
5"description": "Evaluate a Python math expression and return the numeric result. "
6 "Use this when you need to perform arithmetic, compute percentages, "
7 "or solve math problems. Input must be a valid Python expression."The LeetLLM lesson on Agent Failure States, Retries, and Fallback Strategies catalogs more failure modes and recovery strategies. This article focuses on the first five because they appear as soon as you move from a calculator demo to real tools.
Our agent has a problem: it forgets everything between runs. Each invocation starts fresh. For many use cases that's fine, but conversational and task-continuation agents usually need more than one memory layer. In practice it's useful to separate short-term context, retrieval memory, and durable application state.
| Memory Type | Mechanism | Lifetime | Primary Use Case | Cost Impact |
|---|---|---|---|---|
| Short-term | Appending to messages array | Current session | Context for multi-turn chats | Increases token cost linearly |
| Retrieval memory | Storing embeddings in a vector DB | Persistent across sessions | Recalling relevant past facts, docs, or episodes | Requires separate DB storage/retrieval |
| Durable state | Relational DB or key-value store | Persistent across sessions | Exact user preferences, permissions, workflow state | Requires transactional storage and schema design |
Short-term memory keeps the messages array around between calls instead of rebuilding it each time. Here's how we initialize the agent with a persistent message list. We define a class that stores the system prompt upon creation. The run method then appends the user's input and maintains the ongoing context across multiple questions.
This approach works for short interactions, but it scales linearly in cost and latency. Every new turn forces the model to re-process the entire conversation history, which eventually hits context window limits. For simple agents, however, this in-memory list is the most direct way to give the model a sense of continuity.
1class Agent:
2 """A conversational agent that maintains short-term memory (context history)."""
3
4 def __init__(self, system_prompt: str):
5 """Initialize the agent with a system prompt."""
6 self.messages = [
7 {"role": "system", "content": system_prompt}
8 ]
9
10 def run(self, user_message: str) -> str:
11 """Process a user message and run the agent loop."""
12 self.messages.append({"role": "user", "content": user_message})
13
14 for iteration in range(MAX_ITERATIONS):
15 response = self._call_llm()
16 # ... process response ...
17
18 return "I wasn't able to finish within the step limit."For information that persists across sessions, you need a database. A simple retrieval layer stores embeddings and lets the agent fetch semantically related context when it needs it. We can expose this as a tool the agent can decide to call. The remember tool accepts a string fact and inserts its embedding, while the recall tool searches for related memories based on a text query.
By treating memory as another set of tools, the agent can request saves and recalls through the same loop, subject to your product policy. In more advanced setups, a separate process might summarize conversation history and extract these facts without requiring explicit tool calls from the primary reasoning agent. The following sketch shows the remember and recall tool interface.
1@tool
2def remember(fact: str) -> str:
3 """Store an important fact for future reference."""
4 db.insert(embed(fact), metadata={"fact": fact, "timestamp": now()})
5 return f"Remembered: {fact}"
6
7@tool
8def recall(query: str) -> str:
9 """Search memory for relevant past facts."""
10 results = db.search(embed(query), top_k=5)
11 return "\n".join(r.metadata["fact"] for r in results)One warning: a vector database is great for fuzzy recall, but it's the wrong source of truth for exact state like account balances, permission flags, or whether an order was already shipped. Keep exact business facts in a relational or key-value store and expose that state through tools.
The LeetLLM lesson on Agent Memory and Persistence Patterns covers the full spectrum, from simple conversation history to episodic memory and working memory architectures.
The 200-line agent works for demos. For production, you need several more layers. Here's a checklist:
LLM calls are expensive. A runaway agent can burn through hundreds of dollars in minutes. To prevent this, implement a cost tracking class that monitors token usage and halts the agent if it exceeds a specified budget. The CostTracker examines the usage metrics from each API response. Because providers expose usage fields and pricing tables a little differently, keep the per-model rates in configuration instead of hardcoding them deep in the loop.
1class CostLimitExceeded(RuntimeError):
2 pass
3
4class CostTracker:
5 """Tracks LLM API usage costs across multiple calls."""
6
7 def __init__(
8 self,
9 max_cost_usd: float = 1.0,
10 input_cost_per_million: float = 0.0,
11 output_cost_per_million: float = 0.0,
12 ):
13 """Initialize tracker with a maximum allowed cost in USD."""
14 self.total_cost = 0.0
15 self.max_cost = max_cost_usd
16 self.input_cost_per_million = input_cost_per_million
17 self.output_cost_per_million = output_cost_per_million
18
19 def track(self, response):
20 """Calculate and accumulate the cost of a single API response."""
21 usage = response.usage
22 input_tokens = getattr(usage, "input_tokens", None)
23 if input_tokens is None:
24 input_tokens = getattr(usage, "prompt_tokens", 0)
25
26 output_tokens = getattr(usage, "output_tokens", None)
27 if output_tokens is None:
28 output_tokens = getattr(usage, "completion_tokens", 0)
29
30 cost = (
31 input_tokens * self.input_cost_per_million +
32 output_tokens * self.output_cost_per_million
33 ) / 1_000_000
34 self.total_cost += cost
35
36 if self.total_cost > self.max_cost:
37 raise CostLimitExceeded(
38 f"Agent exceeded cost limit: ${self.total_cost:.4f} > ${self.max_cost}"
39 )When an agent fails, you need to know why. Log each loop step to create an audit trail of the agent's actions, arguments, and outcomes. By adding standard Python logging calls to each phase, we capture the chosen tools, their arguments, and the resulting values. This gives you a clean execution trace without pretending you can or should log the model's hidden chain-of-thought.
1import logging
2
3logger = logging.getLogger("agent")
4
5for iteration in range(MAX_ITERATIONS):
6 logger.info(f"Iteration {iteration + 1}/{MAX_ITERATIONS}")
7
8 response = client.chat.completions.create(...)
9 choice = response.choices[0]
10
11 if choice.finish_reason == "tool_calls":
12 for tool_call in choice.message.tool_calls:
13 logger.info(f" Tool: {tool_call.function.name}")
14 logger.info(f" Args: {tool_call.function.arguments}")
15 result = execute_tool(...)
16 logger.info(f" Result: {result[:200]}")
17 else:
18 logger.info(f" Final answer: {choice.message.content[:200]}")A tool might hang. A web search might time out. You need per-tool timeouts and a total execution timeout. A portable pattern is to run the tool in a worker and wait with a timeout. This helper function takes the tool's name and arguments alongside a timeout duration. It either returns the successful result or catches the timeout to yield an error string.
1from concurrent.futures import ThreadPoolExecutor, TimeoutError as FuturesTimeoutError
2
3TOOL_EXECUTOR = ThreadPoolExecutor(max_workers=8)
4
5def execute_tool_with_timeout(name: str, args: dict, timeout_seconds: int = 30) -> str:
6 future = TOOL_EXECUTOR.submit(execute_tool, name, args)
7 try:
8 return future.result(timeout=timeout_seconds)
9 except FuturesTimeoutError:
10 return f"Error: Tool '{name}' timed out after {timeout_seconds}s"This pattern lets your main loop move on, but it doesn't terminate arbitrary Python code that's already stuck in a thread. For untrusted or side-effectful tools, use a separate process or sandbox you can kill cleanly.
For high-stakes actions (deleting files, sending emails, making payments), require human approval before execution. You can intercept specific tool calls and pause the process to request explicit confirmation. The updated function checks the requested tool against a restricted set. It pauses to prompt the human user and proceeds only if explicit permission is granted.
1DANGEROUS_TOOLS = {"delete_file", "send_email", "execute_sql"}
2
3def execute_tool_safe(name: str, args: dict) -> str:
4 if name in DANGEROUS_TOOLS:
5 print(f"\nApproval required: agent wants to call {name}({args})")
6 approval = input("Approve? (y/n): ")
7 if approval.lower() != "y":
8 return "Action was rejected by the user."
9
10 return execute_tool_with_timeout(name, args)The LeetLLM lesson on Human-in-the-Loop Agent Architecture covers escalation policies, approval workflows, and how to design agent UX that builds user trust.
A final-answer check isn't enough for agents. Two very different trajectories can produce the same answer, and a lucky answer can hide terrible tool use. Evaluate both the outcome and the path: did the agent choose the right tool, stop at the right time, and recover correctly when something failed?
For general-purpose agents, benchmarks like GAIA measure multi-step reasoning and tool use.[5] For coding agents, SWE-bench measures whether the system can resolve real repository tasks end to end.[6] If you need scalable review, an LLM judge can score trajectories or outputs, but sample with humans too because judge models have blind spots of their own.[7]
Putting it all together, here is a production-oriented scaffold. This version still fits in roughly 200 lines of actual logic, but now it wires together the guardrails from earlier sections: bounded loops, short-term memory, cost tracking, JSON validation, and timeout-aware tool execution.
By encapsulating the agent logic within a class, we can maintain state across multiple conversational turns. The class constructor initializes the agent with a system prompt, a predefined set of tools, and constraints on iterations and budget.
The run method orchestrates the lifecycle for a given user message. It takes the user's input, enters the iterative decision loop, safely executes any requested tools, appends the results to the context, and returns a final answer.
1class Agent:
2 """A production-oriented agent with guardrails and short-term memory."""
3
4 def __init__(
5 self,
6 system_prompt: str,
7 tools: list[dict],
8 model: str = "gpt-4.1",
9 max_iterations: int = 10,
10 max_cost_usd: float = 1.0,
11 input_cost_per_million: float = 0.0,
12 output_cost_per_million: float = 0.0,
13 ):
14 """Initialize the agent with constraints and available tools."""
15 self.system_prompt = system_prompt
16 self.tools = tools
17 self.model = model
18 self.max_iterations = max_iterations
19 self.cost_tracker = CostTracker(
20 max_cost_usd,
21 input_cost_per_million,
22 output_cost_per_million,
23 )
24 self.messages = [{"role": "system", "content": system_prompt}]
25
26 def _call_llm(self):
27 """Call the model with the current conversation state."""
28 return client.chat.completions.create(
29 model=self.model,
30 messages=self.messages,
31 tools=self.tools,
32 )
33
34 def run(self, user_message: str) -> str:
35 """Run the agent loop for a given user message."""
36 self.messages.append({"role": "user", "content": user_message})
37
38 for _ in range(self.max_iterations):
39 response = self._call_llm()
40 try:
41 self.cost_tracker.track(response)
42 except CostLimitExceeded as e:
43 return str(e)
44
45 choice = response.choices[0]
46
47 if choice.finish_reason == "stop":
48 self.messages.append(choice.message.model_dump(exclude_none=True))
49 return choice.message.content or ""
50
51 if choice.finish_reason != "tool_calls":
52 return f"Agent stopped unexpectedly: {choice.finish_reason}"
53
54 self.messages.append(choice.message.model_dump(exclude_none=True))
55
56 for tool_call in choice.message.tool_calls or []:
57 try:
58 args = json.loads(tool_call.function.arguments)
59 except json.JSONDecodeError as e:
60 result = (
61 f"Error parsing arguments: {e}. "
62 "Please try again with valid JSON."
63 )
64 else:
65 result = execute_tool_safe(
66 tool_call.function.name, args
67 )
68
69 self.messages.append({
70 "role": "tool",
71 "tool_call_id": tool_call.id,
72 "content": truncate_result(result),
73 })
74
75 return "I reached the maximum number of steps. Here's what I found so far..."This is enough to back a real CLI or HTTP service. In a deployed system you would still add persistence, auth, retries with backoff, and structured traces around this loop.
The best way to solidify this is to build something small and see where it breaks.
Exercise: Extend the order-support agent with a new tool called check_return_eligibility(order_id). This tool does four things:
get_order_status to find the delivery date.read_file to read the billing policy from policy.txt and extract the return window (for example, 30 days).get_current_date to check today's date.You don't need to change the LLM loop at all. A simple approach is to implement check_return_eligibility as a regular Python function that calls the other tools directly, then register it with the @tool decorator. The agent will then be able to use this compound tool in one shot, or you can let the LLM orchestrate the three individual calls itself and observe which approach produces cleaner behavior.
If you let the LLM orchestrate the three calls, watch for these behaviors:
get_order_status first, or does it try to read the policy before it knows the account ID?Before you hand this to the LLM, test the business rule without a model. That separates "my Python is wrong" from "the model chose the wrong tool path."
1from datetime import date, datetime, timedelta
2import json
3import re
4
5ORDER_DB = {
6 "12345": {"total": 89.99, "items": 3, "delivered": "2025-04-10"},
7 "67890": {"total": 149.50, "items": 1, "delivered": "2025-05-01"},
8}
9
10POLICY_TEXT = "Returns accepted within 30 days of delivery."
11
12def get_order_status(order_id: str) -> str:
13 order = ORDER_DB.get(order_id)
14 if order is None:
15 return f"Error: Order {order_id} not found."
16 return json.dumps(order)
17
18def read_file(path: str) -> str:
19 if path != "policy.txt":
20 return f"Error: File not found: {path}"
21 return POLICY_TEXT
22
23def get_current_date() -> str:
24 return date(2025, 5, 11).isoformat()
25
26def extract_return_window_days(policy_text: str) -> int:
27 match = re.search(r"within (\d+) days", policy_text)
28 if match is None:
29 raise ValueError("Policy does not include a return window.")
30 return int(match.group(1))
31
32def check_return_eligibility(order_id: str, today: date | None = None) -> str:
33 order = json.loads(get_order_status(order_id))
34 delivered = datetime.fromisoformat(order["delivered"]).date()
35 window_days = extract_return_window_days(read_file("policy.txt"))
36 current_day = today or datetime.fromisoformat(get_current_date()).date()
37 expires = delivered + timedelta(days=window_days)
38
39 if current_day <= expires:
40 return f"Eligible for return until {expires.isoformat()}"
41 return f"Return window expired on {expires.isoformat()}"
42
43expired = check_return_eligibility("12345")
44eligible = check_return_eligibility("67890")
45
46print(expired)
47print(eligible)
48print("expired_check:", expired == "Return window expired on 2025-05-10")
49print("eligible_check:", eligible == "Eligible for return until 2025-05-31")1Return window expired on 2025-05-10
2Eligible for return until 2025-05-31
3expired_check: True
4eligible_check: TrueA beginner agent often returns a verbose explanation instead of a concise yes/no. That's fine. The goal is to see the tool calls line up with the reasoning.
After building this from scratch, these are the practical lessons:
The prompt and model matter, but tool descriptions are routing policy. A bad tool description causes the model to pick the wrong tool or pass wrong arguments, and then the entire agent goes off the rails. Invest time here.
Starting with 1-2 tools and adding more as needed usually produces cleaner behavior than starting with a large tool catalog. When you have too many tools, the model struggles with selection.
The agent loop is small. The hard engineering lives around it. The core loop (call the model, execute a tool, repeat) is the easy part. Cost controls, timeouts, error handling, observability, and human approval are where production reliability comes from.
Smaller or less tool-tuned models need stricter schemas, more examples in tool descriptions, and tighter output parsing. If you're building with open-weight models, budget extra time for tool reliability.
Max iterations, cost caps, and timeouts aren't optional. An agent without limits is a liability. Set conservative limits first and relax them based on observed behavior.
Now that you understand the core agent loop, MCP and Tool Protocol Standards will make more sense because MCP standardizes how tools, resources, and prompts are exposed to agents. Structured Output and Constrained Generation is the natural next concept for making tool argument parsing more reliable.
What we built here is a single-agent ReAct loop. This is a common baseline architecture, and it works well for straightforward multi-step tasks where each observation can guide the next action. Production agent systems often use more structured orchestration patterns for complex workloads.[8]
For example, when a task requires rigorous planning or extensive coordination, you might encounter advanced frameworks:
These advanced architectures might seem intimidating at first, but they all build on the same foundation: an LLM, a set of tools, and an iterative loop. Once you understand this core loop, the rest is mostly about arranging the same building blocks into different configurations.
You now have a working mental model of that foundation. The next articles in the LeetLLM path go deeper: ReAct and Plan-and-Execute architectures shows when to switch from a simple loop to a planner, Function Calling and Tool Use dissects how providers structure tool schemas and enforce argument shapes, and Agent Failure States and Recovery catalogs the full set of production failure modes and tested recovery strategies.
Building Effective Agents
Anthropic · 2024
Function calling
OpenAI · 2026 · OpenAI API Docs
ReAct: Synergizing Reasoning and Acting in Language Models.
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. · 2022 · ICLR 2023
Structured outputs
OpenAI · 2024
GAIA: a Benchmark for General AI Assistants
Mialon, G., et al. · 2023 · ICLR 2024
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Jimenez et al. · 2024 · ICLR 2024
JudgeLM: Fine-tuned Large Language Models are Scalable Judges.
Zhu, K., et al. · 2024
The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling.
Masterman, T., et al. · 2024 · arXiv preprint
Introducing the Model Context Protocol
Anthropic · 2024