We built a working AI agent from an empty file, no frameworks, no abstractions, just an LLM, a loop, and some tools. Here's exactly how it works, where it breaks, and what we learned about making agents reliable.
Imagine hiring a smart assistant who sits at a desk, reads your instructions, and has access to a phone, a calculator, and a filing cabinet. You don't have to explain exactly how to dial the phone or push buttons on the calculator; you just give them a goal, and they figure out which tools to use until the job is done. That's what an AI agent is.
I wanted to understand how these agents work under the hood. Not the framework version. Not the "import LangChain and call it a day" version. The real thing: the loop, the tool calls, the failure modes. So I opened an empty Python file and built one from scratch.
š” Key insight: What follows is the complete walkthrough: the code, the architecture, the bugs, and the hard lessons about why agents break in ways that normal software doesn't. By the end, you'll have a working agent in about 200 lines of code, and a much better mental model for what's happening inside LangGraph, CrewAI, or any other framework when you use them later.
Strip away the marketing hype and an AI agent is surprisingly simple in concept. It's made of three parts:
| Capability | Single LLM Call | Agent Loop |
|---|---|---|
| Multi-step reasoning | Limited to one response | Iterates until completion |
| External actions | No direct tool execution | Calls APIs, files, and services |
| Error recovery | User must retry manually | Can retry or switch strategy |
| State handling | Context only in one prompt | Maintains a running message/tool history |
That's it. The LLM reads a prompt, decides whether it needs to use a tool, calls it, reads the result, and decides again. This cycle repeats until the LLM produces a final answer.
The key insight: the LLM doesn't "do" anything itself. It just decides what to do. The tools do the actual work. The loop is the glue that connects decisions to actions to observations to more decisions.
š” Key insight: Tools like LangGraph, CrewAI, and OpenAI Assistants all implement this same loop with varying levels of abstraction. Understanding the raw loop makes every framework easier to learn.
Let's start with the smallest thing that could possibly work. We'll build an agent that can use exactly one tool: a calculator. By constraining the environment, we can focus entirely on the core loop that drives the agent's decision-making process.
To make this work, we need to provide the LLM with a strict definition of what the tool is, what it does, and what inputs it expects. We do this by passing a JSON schema that describes our function. When the model determines that a calculation is necessary, it generates a structured request matching this schema rather than just outputting plain text.
The code below sets up the basic OpenAI client and defines our calculator tool. It takes in a math expression as a string and returns the evaluated result as a string, providing the foundation for our agent's actions.
python1import openai 2import json 3 4client = openai.OpenAI() 5 6# Define a tool 7tools = [ 8 { 9 "type": "function", 10 "function": { 11 "name": "calculate", 12 "description": "Evaluate a math expression and return the result", 13 "parameters": { 14 "type": "object", 15 "properties": { 16 "expression": { 17 "type": "string", 18 "description": "A Python math expression, e.g. '2 + 3 * 4'" 19 } 20 }, 21 "required": ["expression"] 22 } 23 } 24 } 25] 26 27def calculate(expression: str) -> str: 28 """Safely evaluate a math expression.""" 29 try: 30 result = eval(expression, {"__builtins__": {}}, {}) 31 return str(result) 32 except Exception as e: 33 return f"Error: {e}" 34 35# The agent loop 36def run_agent(user_message: str) -> str: 37 """Run the agent loop until it completes the task or runs out of iterations.""" 38 messages = [ 39 {"role": "system", "content": "You are a helpful assistant. " 40 "Use the calculate tool when you need to do math."}, 41 {"role": "user", "content": user_message} 42 ] 43 44 while True: 45 response = client.chat.completions.create( 46 model="gpt-5.4", # or any tool-capable model 47 messages=messages, 48 tools=tools, 49 ) 50 51 choice = response.choices[0] 52 53 # If the model wants to call a tool 54 if choice.finish_reason == "tool_calls": 55 for tool_call in choice.message.tool_calls: 56 args = json.loads(tool_call.function.arguments) 57 result = calculate(args["expression"]) 58 59 # Add the assistant's tool call and the result 60 messages.append(choice.message) 61 messages.append({ 62 "role": "tool", 63 "tool_call_id": tool_call.id, 64 "content": result, 65 }) 66 else: 67 # Model is done, return the final answer 68 return choice.message.content 69 70# Test it 71answer = run_agent("What is 42 * 17 + 389?") 72print(answer) # "42 * 17 + 389 = 714 + 389 = 1,103"
This is 60 lines and it's already a working agent. The while True loop is the agent loop. The if choice.finish_reason == "tool_calls" branch is where the agent decides to act. Everything else is plumbing.
ā ļø Common mistake: Error handling, timeouts, cost tracking, loop limits, structured logging, and about 20 other things you need in production. We'll add those. But first, let's understand the core pattern.
A calculator is fun, but a useful agent needs tools that interact with the world. Let's add three more tools: web search, a file reader, and a date/time helper. The following snippets show how to define these operations. Each tool accepts simple string arguments and returns string results, expanding what our agent can accomplish.
python1import requests 2import os 3from datetime import datetime 4from pathlib import Path 5 6# Tool registry: maps tool names to functions 7TOOL_REGISTRY = {} 8 9def tool(func): 10 """Decorator that registers a function as a tool.""" 11 TOOL_REGISTRY[func.__name__] = func 12 return func 13 14@tool 15def calculate(expression: str) -> str: 16 """Evaluate a math expression and return the result.""" 17 try: 18 result = eval(expression, {"__builtins__": {}}, {}) 19 return str(result) 20 except Exception as e: 21 return f"Error: {e}" 22 23@tool 24def search_web(query: str) -> str: 25 """Search the web and return a summary of results.""" 26 # In production, use Brave Search, Tavily, or SerpAPI 27 response = requests.get( 28 "https://api.tavily.com/search", 29 params={"query": query, "max_results": 3}, 30 headers={"Authorization": f"Bearer {os.environ.get('TAVILY_API_KEY')}"} 31 ) 32 results = response.json().get("results", []) 33 return "\n".join( 34 f"- {r['title']}: {r['content'][:200]}" 35 for r in results 36 ) 37 38@tool 39def read_file(path: str) -> str: 40 """Read the contents of a file.""" 41 try: 42 content = Path(path).read_text() 43 if len(content) > 3000: 44 return content[:3000] + "\n... (truncated)" 45 return content 46 except FileNotFoundError: 47 return f"Error: File not found: {path}" 48 49@tool 50def get_current_time() -> str: 51 """Get the current date and time.""" 52 return datetime.now().strftime("%Y-%m-%d %H:%M:%S")
The @tool decorator is a pattern you'll see in every agent framework. It registers tools in a lookup table so the agent loop can find the right function when the model asks for it.
Now the agent loop becomes generic. We implement a dispatcher function that takes the tool's name and its arguments, and dynamically executes it. If the requested tool is missing from our registry, it gracefully returns an error string.
python1def execute_tool(name: str, args: dict) -> str: 2 """Look up and execute a tool by name.""" 3 if name not in TOOL_REGISTRY: 4 return f"Error: Unknown tool '{name}'" 5 return TOOL_REGISTRY[name](**args)
What we've built so far is an implementation of the ReAct (Reasoning + Acting) pattern[1]. It's a foundational agent framework that interleaves reasoning traces with external tool actions:
Each iteration of the loop has three phases:
The model might think: "The user wants to know the population of Tokyo multiplied by the area of France. I need to search for both values, then calculate the product." That multi-step reasoning happens naturally because each tool result feeds back into the context.
š” Key insight: Our article on ReAct, Plan-and-Execute, and other agentic architectures covers when ReAct works, when it doesn't, and what alternatives exist for complex multi-step tasks.
Here's where things get interesting. I ran this agent on about 50 different tasks. It worked well on straightforward requests. On anything complex, it failed in one of five predictable ways.
The agent calls the same tool with the same arguments, gets the same result, and tries again. And again. And again.
The model doesn't recognize that the tool result already answered the question, or it doesn't know how to recover from an error.
Add a maximum iteration count to forcefully terminate the loop after a set number of steps, preventing infinite recursion and bounding the maximum token cost. We pass the user message into our loop and enforce a strict loop boundary. If the task is incomplete after these iterations, we return a fallback string instead of hanging forever.
python1MAX_ITERATIONS = 10 2 3def run_agent(user_message: str) -> str: 4 messages = [...] 5 6 for iteration in range(MAX_ITERATIONS): 7 response = client.chat.completions.create(...) 8 choice = response.choices[0] 9 10 if choice.finish_reason != "tool_calls": 11 return choice.message.content 12 13 # ... process tool calls ... 14 15 return "I wasn't able to complete this task within the step limit."
The model invents a tool that doesn't exist ("I'll use the query_database tool to...") and crashes when the lookup fails.
The tool list in the system prompt doesn't match the model's expectations, or the model is trying to use tools it's seen in training data.
Validate tool names before execution, and return a helpful error back to the model so it can realize its mistake and recover. The updated lookup function below verifies the incoming tool name. When an invalid tool is requested, it provides an error message containing the valid options, enabling the agent to self-correct.
python1def execute_tool(name: str, args: dict) -> str: 2 if name not in TOOL_REGISTRY: 3 return (f"Error: Tool '{name}' doesn't exist. " 4 f"Available tools: {list(TOOL_REGISTRY.keys())}") 5 return TOOL_REGISTRY[name](**args)
After enough tool calls, the conversation history exceeds the model's context window. The agent either crashes or starts losing the original instructions.
Each tool result adds tokens. A web search might return 500 tokens. After 10 tool calls, you've consumed 5,000+ tokens of context just on tool results.
Summarize or truncate tool results before adding them to context. The following function simply cuts off long strings to avoid exceeding the maximum token limit. It takes the raw text output from a tool and ensures it stays below a specified character threshold, returning a safer, shortened string.
python1def truncate_result(result: str, max_tokens: int = 500) -> str: 2 """Truncate tool results to prevent context overflow.""" 3 if len(result) > max_tokens * 4: # rough char-to-token ratio 4 return result[:max_tokens * 4] + "\n... (truncated)" 5 return result
The model generates malformed JSON for tool arguments. This happens more often than you'd expect, especially with smaller models.
LLMs don't always produce valid JSON, particularly under complex argument schemas.
Retry with an error message so the model can attempt to fix its JSON format. Alternatively, use a constrained generation mode like structured outputs if your provider supports it. By wrapping the JSON parser in a try-except block, we catch decoding failures. We then append the error as a tool response, prompting the model to re-generate valid arguments in the next iteration.
python1try: 2 args = json.loads(tool_call.function.arguments) 3except json.JSONDecodeError as e: 4 messages.append({ 5 "role": "tool", 6 "tool_call_id": tool_call.id, 7 "content": f"Error parsing arguments: {e}. Please try again with valid JSON." 8 }) 9 continue
The model picks the wrong tool for the job. It tries to calculate a string comparison, or search_web for something it already knows.
Bad tool descriptions. If the description is vague ("do math stuff"), the model can't decide when to use it.
Write precise tool descriptions that include when to use the tool, not just what it does. Here's a comparison between a vague description and an effective one. The updated description clarifies expected inputs and scenarios, ensuring the LLM reliably selects the correct operation for its current step.
python1# Bad 2"description": "Do calculations" 3 4# Good 5"description": "Evaluate a Python math expression and return the numeric result. " 6 "Use this when you need to perform arithmetic, compute percentages, " 7 "or solve math problems. Input must be a valid Python expression."
š” Key insight: Our article on Agent Failure States, Retries, and Fallback Strategies catalogs 12 failure modes and production-tested recovery strategies.
Our agent has a problem: it forgets everything between runs. Each invocation starts fresh. For many use cases that's fine, but for a conversational agent, you need memory. There are two types: short-term and long-term.
| Memory Type | Mechanism | Lifetime | Primary Use Case | Cost Impact |
|---|---|---|---|---|
| Short-term | Appending to messages array | Current session | Context for multi-turn chats | Increases token cost linearly |
| Long-term | Storing embeddings in a vector DB | Persistent across sessions | Recalling facts and preferences | Requires separate DB storage/retrieval |
This is just keeping the messages array around between calls instead of rebuilding it each time. Simple but effective. Here's how we initialize the agent with a persistent message list. We define a class that stores the system prompt upon creation. The run method then appends the user's input and maintains the ongoing context across multiple questions.
While this approach works perfectly for short interactions, it scales linearly in cost and latency. Every new turn forces the model to re-process the entire conversation history, which eventually hits the context window limits. For simple agents, however, this in-memory list is the easiest way to give the model a sense of continuity.
python1class Agent: 2 """A conversational agent that maintains short-term memory (context history).""" 3 4 def __init__(self, system_prompt: str): 5 """Initialize the agent with a system prompt.""" 6 self.messages = [ 7 {"role": "system", "content": system_prompt} 8 ] 9 10 def run(self, user_message: str) -> str: 11 """Process a user message and run the agent loop.""" 12 self.messages.append({"role": "user", "content": user_message}) 13 14 for iteration in range(MAX_ITERATIONS): 15 response = self._call_llm() 16 # ... process response ... 17 18 return final_answer
For information that should persist across sessions, you need a database. The simplest approach: store key facts in a vector database and retrieve them at the start of each conversation. We can expose this as a tool the agent itself can decide to call. The remember tool accepts a string fact and inserts its embedding, while the recall tool searches for related memories based on a text query.
By treating memory as just another set of tools, the agent decides on its own what is worth saving for the future. In more advanced setups, a separate background process might asynchronously summarize the conversation history and extract these facts without requiring explicit tool calls from the primary reasoning agent. The following code implements remember and recall tools interacting with a simple vector database to achieve this.
python1@tool 2def remember(fact: str) -> str: 3 """Store an important fact for future reference.""" 4 db.insert(embed(fact), metadata={"fact": fact, "timestamp": now()}) 5 return f"Remembered: {fact}" 6 7@tool 8def recall(query: str) -> str: 9 """Search memory for relevant past facts.""" 10 results = db.search(embed(query), top_k=5) 11 return "\n".join(r.metadata["fact"] for r in results)
š” Key insight: Our article on Agent Memory and Persistence Patterns covers the full spectrum, from simple conversation history to episodic memory and working memory architectures.
The 200-line agent works for demos. For production, you need several more layers. Here's a checklist:
LLM calls are expensive. A runaway agent can burn through hundreds of dollars in minutes. To prevent this, implement a cost tracking class that monitors token usage and halts the agent if it exceeds a specified budget. The CostTracker examines the usage metrics from each API response. If the accumulated token cost surpasses the allowed maximum, it raises an exception to immediately stop execution.
python1class CostTracker: 2 """Tracks LLM API usage costs across multiple calls.""" 3 4 def __init__(self, max_cost_usd: float = 1.0): 5 """Initialize tracker with a maximum allowed cost in USD.""" 6 self.total_cost = 0.0 7 self.max_cost = max_cost_usd 8 9 def track(self, response): 10 """Calculate and accumulate the cost of a single API response.""" 11 usage = response.usage 12 # Example rates: adjust based on your provider's pricing 13 cost = (usage.prompt_tokens * 0.0025 / 1000 + 14 usage.completion_tokens * 0.01 / 1000) 15 self.total_cost += cost 16 17 if self.total_cost > self.max_cost: 18 raise CostLimitExceeded( 19 f"Agent exceeded cost limit: ${self.total_cost:.4f} > ${self.max_cost}" 20 )
When an agent fails, you need to know why. Log every step of the loop to create an audit trail of the agent's actions, arguments, and outcomes. By adding standard Python logging calls to each phase, we capture the chosen tools, their arguments, and the resulting values. This provides a clear timeline of the model's reasoning process.
python1import logging 2 3logger = logging.getLogger("agent") 4 5for iteration in range(MAX_ITERATIONS): 6 logger.info(f"Iteration {iteration + 1}/{MAX_ITERATIONS}") 7 8 response = client.chat.completions.create(...) 9 choice = response.choices[0] 10 11 if choice.finish_reason == "tool_calls": 12 for tool_call in choice.message.tool_calls: 13 logger.info(f" Tool: {tool_call.function.name}") 14 logger.info(f" Args: {tool_call.function.arguments}") 15 result = execute_tool(...) 16 logger.info(f" Result: {result[:200]}") 17 else: 18 logger.info(f" Final answer: {choice.message.content[:200]}")
A tool might hang forever. A web search might time out. You need per-tool timeouts and a total execution timeout. Python's signal module can be used to wrap tool executions with a strict time limit. This helper function takes the tool's name and arguments alongside a timeout duration. It either returns the successful result or catches the alarm to yield an error string.
python1import signal 2 3def timeout_handler(signum, frame): 4 raise TimeoutError("Tool execution timed out") 5 6def execute_tool_with_timeout(name: str, args: dict, timeout_seconds: int = 30) -> str: 7 signal.signal(signal.SIGALRM, timeout_handler) 8 signal.alarm(timeout_seconds) 9 try: 10 return execute_tool(name, args) 11 except TimeoutError: 12 return f"Error: Tool '{name}' timed out after {timeout_seconds}s" 13 finally: 14 signal.alarm(0)
For high-stakes actions (deleting files, sending emails, making payments), you should require human approval before execution. You can intercept specific tool calls and pause the process to request explicit confirmation. The updated function checks the requested tool against a restricted set. It pauses to prompt the human user and only proceeds if explicit permission is granted.
python1DANGEROUS_TOOLS = {"delete_file", "send_email", "execute_sql"} 2 3def execute_tool_safe(name: str, args: dict) -> str: 4 if name in DANGEROUS_TOOLS: 5 print(f"\nā ļø Agent wants to call: {name}({args})") 6 approval = input("Approve? (y/n): ") 7 if approval.lower() != "y": 8 return "Action was rejected by the user." 9 10 return execute_tool(name, args)
šÆ Production tip: Our article on Human-in-the-Loop Agent Architecture covers escalation policies, approval workflows, and how to design agent UX that builds user trust.
Putting it all together, here is the production-ready scaffold. This version requires about 200 lines of actual logic. While it isn't much larger than our naive implementation, it introduces the critical guardrails needed to prevent infinite loops, manage context limits, and track execution costs.
By encapsulating the agent logic within a class, we can easily maintain state across multiple conversational turns. The class constructor initializes the agent with a system prompt, a predefined set of tools, and hard constraints on iterations and budget.
The run method orchestrates the entire lifecycle for a given user message. It takes the user's input, enters the iterative decision loop, safely executes any requested tools, appends the results to the context, and ultimately returns a final synthesized answer back to the user.
python1class Agent: 2 """A production-ready agent with cost controls, iteration limits, and short-term memory.""" 3 4 def __init__( 5 self, 6 system_prompt: str, 7 tools: list[dict], 8 model: str = "GPT-5.4", 9 max_iterations: int = 10, 10 max_cost_usd: float = 1.0, 11 ): 12 """Initialize the agent with constraints and available tools.""" 13 self.system_prompt = system_prompt 14 self.tools = tools 15 self.model = model 16 self.max_iterations = max_iterations 17 self.cost_tracker = CostTracker(max_cost_usd) 18 self.messages = [{"role": "system", "content": system_prompt}] 19 20 def run(self, user_message: str) -> str: 21 """Run the agent loop for a given user message.""" 22 self.messages.append({"role": "user", "content": user_message}) 23 24 for i in range(self.max_iterations): 25 response = self._call_llm() 26 self.cost_tracker.track(response) 27 choice = response.choices[0] 28 29 if choice.finish_reason != "tool_calls": 30 self.messages.append(choice.message) 31 return choice.message.content 32 33 self.messages.append(choice.message) 34 35 for tool_call in choice.message.tool_calls: 36 try: 37 args = json.loads(tool_call.function.arguments) 38 except json.JSONDecodeError: 39 result = "Error: Invalid JSON arguments" 40 else: 41 result = execute_tool_safe( 42 tool_call.function.name, args 43 ) 44 45 result = truncate_result(result) 46 self.messages.append({ 47 "role": "tool", 48 "tool_call_id": tool_call.id, 49 "content": result, 50 }) 51 52 return "I reached the maximum number of steps. Here's what I found so far..."
After building this from scratch, here are the things that surprised us most:
Not the prompt. Not the model. The tool descriptions. A bad tool description causes the model to pick the wrong tool or pass wrong arguments, and then the entire agent goes off the rails. Invest time here.
We tried starting with 8 tools and the agent got confused. Starting with 1-2 tools and adding more only when needed produced much better results. When you have too many tools, the model struggles with selection.
š” Key insight: The agent loop is trivially simple. Everything hard is in the edges. The core loop (call LLM, execute tool, repeat) is 30 lines. The cost controls, timeouts, error handling, observability, and human-in-the-loop approval are where all the complexity lives.
GPT-5.4 handles ambiguous tool calls gracefully. An open-weight model like Qwen3.5 needs stricter schemas, more examples in tool descriptions, and tighter output parsing. If you're building with open-source models, budget extra time for tool reliability.
Max iterations, cost caps, and timeouts aren't optional. An agent without limits is a liability. Set conservative limits first and relax them based on observed behavior.
š” Key insight: Now that you understand the core agent loop, explore MCP and Tool Protocol Standards for the emerging standard for connecting agents to external tools, and Structured Output and Constrained Generation for making tool argument parsing more reliable.
What we built here is a single-agent ReAct loop. This is the most fundamental architecture in the field, and it works exceptionally well for general-purpose queries and straightforward multi-step reasoning. However, production agent systems often rely on more sophisticated orchestration patterns to handle complex workloads.[2]
For example, when a task requires rigorous planning or extensive coordination, you might encounter advanced frameworks:
These advanced architectures might seem intimidating at first, but they all build on the exact same foundation: an LLM, a set of tools, and an iterative loop. Once you understand this core loop, the rest is simply arranging these basic building blocks into different configurations.
Ready to go deeper? LeetLLM's agents section covers the full spectrum: ReAct and Plan-and-Execute architectures, Function Calling and Tool Use, Failure States and Recovery, Multi-Agent DAGs, and Human-in-the-Loop design. Start with our free articles and unlock the full curriculum when you're ready.
ReAct: Synergizing Reasoning and Acting in Language Models.
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. Ā· 2022 Ā· ICLR 2023
The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling.
Masterman, T., et al. Ā· 2024 Ā· arXiv preprint
Introducing the Model Context Protocol
Anthropic Ā· 2024