LeetLLM
LearnFeaturesPricingBlog
Menu
LearnFeaturesPricingBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Pricing
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

Ā© 2026 LeetLLM. All rights reserved.

All Posts
BlogHow to Build an AI Agent from Scratch
šŸ¤– AgentsšŸŠ Deep DivešŸ·ļø Tutorial

How to Build an AI Agent from Scratch

We built a working AI agent from an empty file, no frameworks, no abstractions, just an LLM, a loop, and some tools. Here's exactly how it works, where it breaks, and what we learned about making agents reliable.

LeetLLM TeamFebruary 19, 202625 min read

I wanted to understand how AI agents actually work. Not the framework version. Not the "import LangChain and call it a day" version. The real thing. The loop. The tool calls. The failure modes.

So I opened an empty Python file and built one from scratch.

What follows is the complete walkthrough: the code, the architecture, the bugs, and the hard lessons about why agents break in ways that normal software doesn't. By the end, you'll have a working agent in about 200 lines of code, and a much better mental model for what's happening inside LangGraph, CrewAI, or any other framework when you use them later.

What is an AI agent, really?

AI agent concept diagram: perception (input processing), reasoning (LLM core), and action (tool execution) connected in a feedback loop with the environment. AI agent concept diagram: perception (input processing), reasoning (LLM core), and action (tool execution) connected in a feedback loop with the environment.

Strip away the marketing hype and an AI agent is surprisingly simple in concept. It's made of three parts:

  1. •An LLM that can reason about what to do next
  2. •A set of tools the LLM can invoke (search, calculate, read files, call APIs)
  3. •A loop that runs until the task is done or the agent gives up
CapabilitySingle LLM CallAgent Loop
Multi-step reasoningLimited to one responseIterates until completion
External actionsNo direct tool executionCalls APIs, files, and services
Error recoveryUser must retry manuallyCan retry or switch strategy
State handlingContext only in one promptMaintains a running message/tool history

That's it. The LLM reads a prompt, decides whether it needs to use a tool, calls it, reads the result, and decides again. This cycle repeats until the LLM produces a final answer[1].

The key insight: the LLM doesn't "do" anything itself. It just decides what to do. The tools do the actual work. The loop is the glue that connects decisions to actions to observations to more decisions.

šŸŽÆ Framework comparison: Tools like LangGraph, CrewAI, and OpenAI Assistants all implement this same loop with varying levels of abstraction. Understanding the raw loop makes every framework easier to learn.

The simplest possible agent

Let's start with the smallest thing that could possibly work. An agent that can use a single tool: a calculator.

python
1import openai 2import json 3 4client = openai.OpenAI() 5 6# Define a tool 7tools = [ 8 { 9 "type": "function", 10 "function": { 11 "name": "calculate", 12 "description": "Evaluate a math expression and return the result", 13 "parameters": { 14 "type": "object", 15 "properties": { 16 "expression": { 17 "type": "string", 18 "description": "A Python math expression, e.g. '2 + 3 * 4'" 19 } 20 }, 21 "required": ["expression"] 22 } 23 } 24 } 25] 26 27def calculate(expression: str) -> str: 28 """Safely evaluate a math expression.""" 29 try: 30 result = eval(expression, {"__builtins__": {}}, {}) 31 return str(result) 32 except Exception as e: 33 return f"Error: {e}" 34 35# The agent loop 36def run_agent(user_message: str) -> str: 37 """Run the agent loop until it completes the task or runs out of iterations.""" 38 messages = [ 39 {"role": "system", "content": "You are a helpful assistant. " 40 "Use the calculate tool when you need to do math."}, 41 {"role": "user", "content": user_message} 42 ] 43 44 while True: 45 response = client.chat.completions.create( 46 model="gpt-4o", 47 messages=messages, 48 tools=tools, 49 ) 50 51 choice = response.choices[0] 52 53 # If the model wants to call a tool 54 if choice.finish_reason == "tool_calls": 55 for tool_call in choice.message.tool_calls: 56 args = json.loads(tool_call.function.arguments) 57 result = calculate(args["expression"]) 58 59 # Add the assistant's tool call and the result 60 messages.append(choice.message) 61 messages.append({ 62 "role": "tool", 63 "tool_call_id": tool_call.id, 64 "content": result, 65 }) 66 else: 67 # Model is done, return the final answer 68 return choice.message.content 69 70# Test it 71answer = run_agent("What is 42 * 17 + 389?") 72print(answer) # "42 * 17 + 389 = 714 + 389 = 1,103"

This is 60 lines and it's already a working agent. The while True loop is the agent loop. The if choice.finish_reason == "tool_calls" branch is where the agent decides to act. Everything else is plumbing.

āš ļø What we just skipped: Error handling, timeouts, cost tracking, loop limits, structured logging, and about 20 other things you need in production. We'll add those. But first, let's understand the core pattern.

Adding real tools

A calculator is fun, but a useful agent needs tools that interact with the world. Let's add three more tools: web search, a file reader, and a date/time helper.

python
1import requests 2from datetime import datetime 3from pathlib import Path 4 5# Tool registry: maps tool names to functions 6TOOL_REGISTRY = {} 7 8def tool(func): 9 """Decorator that registers a function as a tool.""" 10 TOOL_REGISTRY[func.__name__] = func 11 return func 12 13@tool 14def calculate(expression: str) -> str: 15 """Evaluate a math expression and return the result.""" 16 try: 17 result = eval(expression, {"__builtins__": {}}, {}) 18 return str(result) 19 except Exception as e: 20 return f"Error: {e}" 21 22@tool 23def search_web(query: str) -> str: 24 """Search the web and return a summary of results.""" 25 # In production, use Brave Search, Tavily, or SerpAPI 26 response = requests.get( 27 "https://api.tavily.com/search", 28 params={"query": query, "max_results": 3}, 29 headers={"Authorization": f"Bearer {TAVILY_API_KEY}"} 30 ) 31 results = response.json().get("results", []) 32 return "\n".join( 33 f"- {r['title']}: {r['content'][:200]}" 34 for r in results 35 ) 36 37@tool 38def read_file(path: str) -> str: 39 """Read the contents of a file.""" 40 try: 41 content = Path(path).read_text() 42 if len(content) > 3000: 43 return content[:3000] + "\n... (truncated)" 44 return content 45 except FileNotFoundError: 46 return f"Error: File not found: {path}" 47 48@tool 49def get_current_time() -> str: 50 """Get the current date and time.""" 51 return datetime.now().strftime("%Y-%m-%d %H:%M:%S")

The @tool decorator is a pattern you'll see in every agent framework. It registers tools in a lookup table so the agent loop can find the right function when the model asks for it.

Now the agent loop becomes generic:

python
1def execute_tool(name: str, args: dict) -> str: 2 """Look up and execute a tool by name.""" 3 if name not in TOOL_REGISTRY: 4 return f"Error: Unknown tool '{name}'" 5 return TOOL_REGISTRY[name](**args)

The ReAct pattern

What we've built so far is actually an implementation of the ReAct (Reasoning + Acting) pattern[2]. It's worth naming because it's the most widely used agent architecture:

Diagram Diagram

Each iteration of the loop has three phases:

  1. •Think (Reason): The LLM reads the conversation history and decides what to do next
  2. •Act: It calls a tool with specific arguments
  3. •Observe: The tool result is added to the conversation, and the loop continues

The model might think: "The user wants to know the population of Tokyo multiplied by the area of France. I need to search for both values, then calculate the product." That multi-step reasoning happens naturally because each tool result feeds back into the context.

šŸ’” Deep architecture dive: Our article on ReAct, Plan-and-Execute, and other agentic architectures covers when ReAct works, when it doesn't, and what alternatives exist for complex multi-step tasks.

Where it breaks

Here's where things get interesting. I ran this agent on about 50 different tasks. It worked well on straightforward requests. On anything complex, it failed in one of five predictable ways.

Common agent failure modes: infinite loops, token exhaustion, hallucinated tool calls, and cascading errors, each with mitigation strategies. Common agent failure modes: infinite loops, token exhaustion, hallucinated tool calls, and cascading errors, each with mitigation strategies.

Failure 1: The infinite loop

The agent calls the same tool with the same arguments, gets the same result, and tries again. And again. And again.

What causes it: The model doesn't recognize that the tool result already answered the question, or it doesn't know how to recover from an error.

The fix: Add a maximum iteration count to forcefully terminate the loop after a set number of steps, preventing infinite recursion and bounding the maximum token cost.

python
1MAX_ITERATIONS = 10 2 3def run_agent(user_message: str) -> str: 4 messages = [...] 5 6 for iteration in range(MAX_ITERATIONS): 7 response = client.chat.completions.create(...) 8 9 if choice.finish_reason != "tool_calls": 10 return choice.message.content 11 12 # ... process tool calls ... 13 14 return "I wasn't able to complete this task within the step limit."

Failure 2: Hallucinated tool calls

The model invents a tool that doesn't exist ("I'll use the query_database tool to...") and crashes when the lookup fails.

What causes it: The tool list in the system prompt doesn't match the model's expectations, or the model is trying to use tools it's seen in training data.

The fix: Validate tool names before execution, and return a helpful error back to the model so it can realize its mistake and recover.

python
1def execute_tool(name: str, args: dict) -> str: 2 if name not in TOOL_REGISTRY: 3 return (f"Error: Tool '{name}' does not exist. " 4 f"Available tools: {list(TOOL_REGISTRY.keys())}") 5 return TOOL_REGISTRY[name](**args)

Failure 3: Context window overflow

After enough tool calls, the conversation history exceeds the model's context window. The agent either crashes or starts losing the original instructions.

What causes it: Each tool result adds tokens. A web search might return 500 tokens. After 10 tool calls, you've consumed 5,000+ tokens of context just on tool results.

The fix: Summarize or truncate tool results before adding them to context. The following function simply cuts off long strings to avoid exceeding the maximum token limit:

python
1def truncate_result(result: str, max_tokens: int = 500) -> str: 2 """Truncate tool results to prevent context overflow.""" 3 if len(result) > max_tokens * 4: # rough char-to-token ratio 4 return result[:max_tokens * 4] + "\n... (truncated)" 5 return result

Failure 4: Argument parsing errors

The model generates malformed JSON for tool arguments. This happens more often than you'd expect, especially with smaller models.

What causes it: LLMs don't always produce valid JSON, particularly under complex argument schemas.

The fix: Retry with an error message so the model can attempt to fix its JSON format. Alternatively, use a constrained generation mode like structured outputs if your provider supports it.

python
1try: 2 args = json.loads(tool_call.function.arguments) 3except json.JSONDecodeError as e: 4 messages.append({ 5 "role": "tool", 6 "tool_call_id": tool_call.id, 7 "content": f"Error parsing arguments: {e}. Please try again with valid JSON." 8 }) 9 continue

Failure 5: Wrong tool selection

The model picks the wrong tool for the job. It tries to calculate a string comparison, or search_web for something it already knows.

What causes it: Bad tool descriptions. If the description is vague ("do math stuff"), the model can't decide when to use it.

The fix: Write precise tool descriptions that include when to use the tool, not just what it does. Here is a comparison between a vague description and an effective one:

python
1# Bad 2"description": "Do calculations" 3 4# Good 5"description": "Evaluate a Python math expression and return the numeric result. " 6 "Use this when you need to perform arithmetic, compute percentages, " 7 "or solve math problems. Input must be a valid Python expression."

šŸ’” More on failure modes: Our article on Agent Failure States, Retries, and Fallback Strategies catalogs 12 failure modes and production-tested recovery strategies.

Adding memory

Our agent has a problem: it forgets everything between runs. Each invocation starts fresh. For many use cases that's fine, but for a conversational agent, you need memory.

There are two types:

Short-term memory (conversation history)

This is just keeping the messages array around between calls instead of rebuilding it each time. Simple but effective. Here is how we initialize the agent with a persistent message list:

python
1class Agent: 2 """A conversational agent that maintains short-term memory (context history).""" 3 4 def __init__(self, system_prompt: str): 5 """Initialize the agent with a system prompt.""" 6 self.messages = [ 7 {"role": "system", "content": system_prompt} 8 ] 9 10 def run(self, user_message: str) -> str: 11 """Process a user message and run the agent loop.""" 12 self.messages.append({"role": "user", "content": user_message}) 13 14 for iteration in range(MAX_ITERATIONS): 15 response = self._call_llm() 16 # ... process response ... 17 18 return final_answer

Long-term memory (persistent)

For information that should persist across sessions, you need a database. The simplest approach: store key facts in a vector database and retrieve them at the start of each conversation. We can expose this as a tool the agent itself can decide to call:

python
1@tool 2def remember(fact: str) -> str: 3 """Store an important fact for future reference.""" 4 db.insert(embed(fact), metadata={"fact": fact, "timestamp": now()}) 5 return f"Remembered: {fact}" 6 7@tool 8def recall(query: str) -> str: 9 """Search memory for relevant past facts.""" 10 results = db.search(embed(query), top_k=5) 11 return "\n".join(r.metadata["fact"] for r in results)

šŸ’” Memory patterns: Our article on Agent Memory and Persistence Patterns covers the full spectrum, from simple conversation history to episodic memory and working memory architectures.

Making it production-ready

The 200-line agent works for demos. For production, you need several more layers. Here's a checklist:

Cost controls

LLM calls are expensive. A runaway agent can burn through hundreds of dollars in minutes. To prevent this, implement a cost tracking class that monitors token usage and halts the agent if it exceeds a specified budget:

python
1class CostTracker: 2 """Tracks LLM API usage costs across multiple calls.""" 3 4 def __init__(self, max_cost_usd: float = 1.0): 5 """Initialize tracker with a maximum allowed cost in USD.""" 6 self.total_cost = 0.0 7 self.max_cost = max_cost_usd 8 9 def track(self, response): 10 """Calculate and accumulate the cost of a single API response.""" 11 usage = response.usage 12 cost = (usage.prompt_tokens * 0.0025 / 1000 + 13 usage.completion_tokens * 0.01 / 1000) 14 self.total_cost += cost 15 16 if self.total_cost > self.max_cost: 17 raise CostLimitExceeded( 18 f"Agent exceeded cost limit: ${self.total_cost:.4f} > ${self.max_cost}" 19 )

Observability

When an agent fails, you need to know why. Log every step of the loop to create an audit trail of the agent's actions, arguments, and outcomes:

python
1import logging 2 3logger = logging.getLogger("agent") 4 5for iteration in range(MAX_ITERATIONS): 6 logger.info(f"Iteration {iteration + 1}/{MAX_ITERATIONS}") 7 8 response = client.chat.completions.create(...) 9 10 if choice.finish_reason == "tool_calls": 11 for tool_call in choice.message.tool_calls: 12 logger.info(f" Tool: {tool_call.function.name}") 13 logger.info(f" Args: {tool_call.function.arguments}") 14 result = execute_tool(...) 15 logger.info(f" Result: {result[:200]}") 16 else: 17 logger.info(f" Final answer: {choice.message.content[:200]}")

Timeouts

A tool might hang forever. A web search might time out. You need per-tool timeouts and a total execution timeout. Python's signal module can be used to wrap tool executions with a strict time limit:

python
1import signal 2 3def timeout_handler(signum, frame): 4 raise TimeoutError("Tool execution timed out") 5 6def execute_tool_with_timeout(name: str, args: dict, timeout_seconds: int = 30) -> str: 7 signal.signal(signal.SIGALRM, timeout_handler) 8 signal.alarm(timeout_seconds) 9 try: 10 return execute_tool(name, args) 11 except TimeoutError: 12 return f"Error: Tool '{name}' timed out after {timeout_seconds}s" 13 finally: 14 signal.alarm(0)

Human-in-the-loop

For high-stakes actions (deleting files, sending emails, making payments), you should require human approval before execution. You can intercept specific tool calls and pause the process to request explicit confirmation:

python
1DANGEROUS_TOOLS = {"delete_file", "send_email", "execute_sql"} 2 3def execute_tool_safe(name: str, args: dict) -> str: 4 if name in DANGEROUS_TOOLS: 5 print(f"\nāš ļø Agent wants to call: {name}({args})") 6 approval = input("Approve? (y/n): ") 7 if approval.lower() != "y": 8 return "Action was rejected by the user." 9 10 return execute_tool(name, args)

šŸ’” Production patterns: Our article on Human-in-the-Loop Agent Architecture covers escalation policies, approval workflows, and how to design agent UX that builds user trust.

The complete agent

Putting it all together, here's the production-ready scaffold. This is about 200 lines of actual logic, which is not much more than the naive version, but it handles the critical failure modes.

python
1class Agent: 2 """A production-ready agent with cost controls, iteration limits, and short-term memory.""" 3 4 def __init__( 5 self, 6 system_prompt: str, 7 tools: list[dict], 8 model: str = "gpt-4o", 9 max_iterations: int = 10, 10 max_cost_usd: float = 1.0, 11 ): 12 """Initialize the agent with constraints and available tools.""" 13 self.system_prompt = system_prompt 14 self.tools = tools 15 self.model = model 16 self.max_iterations = max_iterations 17 self.cost_tracker = CostTracker(max_cost_usd) 18 self.messages = [{"role": "system", "content": system_prompt}] 19 20 def run(self, user_message: str) -> str: 21 """Run the agent loop for a given user message.""" 22 self.messages.append({"role": "user", "content": user_message}) 23 24 for i in range(self.max_iterations): 25 response = self._call_llm() 26 self.cost_tracker.track(response) 27 choice = response.choices[0] 28 29 if choice.finish_reason != "tool_calls": 30 self.messages.append(choice.message) 31 return choice.message.content 32 33 self.messages.append(choice.message) 34 35 for tool_call in choice.message.tool_calls: 36 try: 37 args = json.loads(tool_call.function.arguments) 38 except json.JSONDecodeError: 39 result = "Error: Invalid JSON arguments" 40 else: 41 result = execute_tool_safe( 42 tool_call.function.name, args 43 ) 44 45 result = truncate_result(result) 46 self.messages.append({ 47 "role": "tool", 48 "tool_call_id": tool_call.id, 49 "content": result, 50 }) 51 52 return "I reached the maximum number of steps. Here's what I found so far..."

Key takeaways

After building this from scratch, here are the things that surprised us most:

1. Tool descriptions are the most important part of the system. Not the prompt. Not the model. The tool descriptions. A bad tool description causes the model to pick the wrong tool or pass wrong arguments, and then the entire agent goes off the rails. Invest time here.

2. Start with one tool and add more gradually. We tried starting with 8 tools and the agent got confused. Starting with 1-2 tools and adding more only when needed produced much better results. When you have too many tools, the model struggles with selection.

šŸ’” Key insight: The agent loop is trivially simple. Everything hard is in the edges. The core loop (call LLM, execute tool, repeat) is 30 lines. The cost controls, timeouts, error handling, observability, and human-in-the-loop approval are where all the complexity lives.

3. Smaller models need more guardrails. GPT-4o handles ambiguous tool calls gracefully. A 7B parameter model needs stricter schemas, more examples in tool descriptions, and tighter output parsing. If you're building with open-source models, budget extra time for tool reliability.

4. Always have an escape hatch. Max iterations, cost caps, and timeouts aren't optional. An agent without limits is a liability. Set conservative limits first and relax them based on observed behavior.

šŸ’” Keep building: Now that you understand the core agent loop, explore MCP and Tool Protocol Standards for the emerging standard for connecting agents to external tools, and Structured Output and Constrained Generation for making tool argument parsing more reliable.

Going further

What we built here is a single-agent ReAct loop. Production agent systems often use more sophisticated patterns[3]:

  • •Plan-and-Execute: The agent first creates a plan (a list of steps), then executes each step. Better for multi-step tasks.
  • •Multi-Agent Systems: Multiple specialized agents that coordinate. One researches, another writes, a third reviews.
  • •DAG-Based Orchestration: Complex workflows modeled as directed acyclic graphs with conditional branching.
  • •Tool Protocol Standards: MCP (Model Context Protocol)[4] standardizes how agents discover and invoke tools, making agent-tool integration more portable.

These all build on the same foundation: an LLM, tools, and a loop. Once you understand the loop, the rest is architecture patterns on top.


Ready to go deeper? LeetLLM's agents section covers the full spectrum: ReAct and Plan-and-Execute architectures, Function Calling and Tool Use, Failure States and Recovery, Multi-Agent DAGs, and Human-in-the-Loop design. Start with our free articles and unlock the full curriculum when you're ready.

References

ReAct: Synergizing Reasoning and Acting in Language Models

Yao, S., et al. Ā· 2023

Toolformer: Language Models Can Teach Themselves to Use Tools.

Schick, T., et al. Ā· 2023

A Survey on Large Language Model based Autonomous Agents.

Wang, L., et al. Ā· 2024

Introducing the Model Context Protocol

Anthropic Ā· 2024