AgentsDeep DiveTutorial

How to Build an AI Agent from Scratch

Build a working AI agent from the raw loop: define tools, let the model choose one, execute it in Python, append the observation, and add the guardrails that keep agents reliable.

LeetLLM TeamFebruary 19, 2026Updated July 14, 202610 min read

A repo assistant has access to file search, test logs, issue history, and a command runner. You give it a goal; it chooses the next tool; your code executes that tool; and the result becomes context for the next decision. That's the core of an AI agent.

Frameworks make this comfortable. Start with the raw loop so LangGraph, CrewAI, OpenAI tool calling, and MCP-style tool servers are easier to reason about later.^{[1]Reference 1Function callinghttps://developers.openai.com/api/docs/guides/function-calling}^{[2]Reference 2Introducing the Model Context Protocolhttps://www.anthropic.com/news/model-context-protocol} The important lesson is simple: the model chooses; your runtime acts.

What an agent is

Bounded agent loop where a user goal reaches the model, code runs the selected tool, and an observation returns to the next decision. — The agent loop separates decision from execution: the model picks the next step, code does the real work, and the observation feeds the next turn.

Strip away framework names and an agent has three parts:

A large language model (LLM) that can decide whether to answer or request a tool
A set of tools your code can execute, such as search, read file, run tests, or call APIs
A bounded loop that repeats until the task is done or a safety limit stops it

Diagram showing User goal, Model decides answer or tool, Runtime validates tool + args, and Tool executes bounded side effect. — User goal, Model decides answer or tool, Runtime validates tool + args, and Tool executes bounded side effect.

Compared with a single LLM call, an agent loop can iterate through observations, call tools through your runtime, retry or switch tools after errors, and carry message plus tool history across steps.

The LLM doesn't touch files, databases, or APIs by itself. It emits structured intent. Your Python code validates that intent, runs allowed tools, and appends the result back into the conversation.

Use this loop only when it earns the complexity. Anthropic's agent guidance starts with the simplest effective system: deterministic workflows beat open-ended agents when the path is already known.^{[3]Reference 3Building Effective Agentshttps://www.anthropic.com/research/building-effective-agents} Reach for an agent when the task genuinely requires deciding the next step from new evidence.

💡 Key insight: An agent is a runtime contract, not a smarter prompt. The model chooses intent; your code owns validation, execution, memory, limits, and audit logs.

One turn in the loop

Suppose a developer asks, "Can this retry PR break duplicate-execution safety?"

The agent has three tools:

read_file(path) reads a repository file
search_repo(query) finds call sites and tests
run_tests(target) runs an approved test target

A useful trajectory might look like this:

Model requests read_file({"path": "retry_backoff.py"}).
Runtime returns the helper code.
Model searches for retry_backoff submit.
Runtime returns two side-effecting call sites.
Model runs tests/test_retry_backoff.py.
Runtime reports that timeout tests pass, but duplicate execution is untested.
Model answers: "Block merge until side-effecting callers pass an idempotency key and duplicate-execution tests exist."

The model chose the next steps and wrote the answer. File reads, search, and test execution stayed in the runtime.

Build the core loop

The smallest useful implementation needs a tool schema, a registry of Python functions, and a loop that preserves every model output item before appending the matching tool result. That ordering matters because a function_call_output must carry the call_id of the request that produced it. The Responses API also returns reasoning and message items that belong to the next turn, so don't rebuild history from visible text alone.^{[1]Reference 1Function callinghttps://developers.openai.com/api/docs/guides/function-calling}

This example defaults to GPT-5.6 Luna for a small, cost-sensitive tool call. Set OPENAI_MODEL to a pinned model ID that fits your workload and evals instead of scattering model names through application code.^{[4]Reference 4GPT-5.6 Luna Modelhttps://developers.openai.com/api/docs/models/gpt-5.6-luna}

minimal-tool-agent.py

import json
import os
from openai import OpenAI

client = OpenAI()
MODEL = os.getenv("OPENAI_MODEL", "gpt-5.6-luna")

tools = [{
    "type": "function",
    "name": "estimate_test_runtime",
    "description": "Estimate total runtime for a focused test target.",
    "strict": True,
    "parameters": {
        "type": "object",
        "properties": {
            "test_count": {"type": "integer"},
            "average_ms": {"type": "number"},
        },
        "required": ["test_count", "average_ms"],
        "additionalProperties": False,
    },
}]

def estimate_test_runtime(test_count: int, average_ms: float) -> str:
    seconds = test_count * average_ms / 1000
    return f"{seconds:.1f}s"

TOOL_REGISTRY = {
    "estimate_test_runtime": estimate_test_runtime,
}

def run_agent(user_message: str, max_steps: int = 6) -> str:
    history = [
        {"role": "user", "content": user_message},
    ]

    for _ in range(max_steps):
        response = client.responses.create(
            model=MODEL,
            instructions="Use tools when exact calculation helps.",
            input=history,
            tools=tools,
        )
        history.extend(response.output)
        calls = [item for item in response.output if item.type == "function_call"]

        if not calls:
            return response.output_text

        for call in calls:
            name = call.name
            args = json.loads(call.arguments)
            if name not in TOOL_REGISTRY:
                result = f"Error: unknown tool {name}"
            else:
                result = TOOL_REGISTRY[name](**args)

            history.append({
                "type": "function_call_output",
                "call_id": call.call_id,
                "output": result,
            })

    return "Stopped after step limit."

print(run_agent("How long will 42 tests take at 31ms each?"))

The exact wording depends on the model, but the final answer should include roughly 1.3s. This is already an agent: the model chooses a tool, the runtime executes it, and the observation feeds the next model call. Keeping response.output in history preserves the provider's function-call, reasoning, and message items in their original order.

Strict tool schemas help because tool inputs are model-generated.^{[1]Reference 1Function callinghttps://developers.openai.com/api/docs/guides/function-calling} Treat them as untrusted until the schema, parser, allowlist, sandbox, or approval gate says otherwise.

Design real tools carefully

Adding repo tools is less about clever prompting and more about safe interfaces. Good tools are narrow, typed, bounded, and explicit about when to use them.

Common repo tools should stay narrow:

read_file(path): reject paths outside the repo and truncate long files.
search_repo(query): use fixed-string rg, a timeout, and an output cap.
run_tests(target): allowlist test paths and return only the useful tail.
Side-effect tools: require human approval and an audit log before patching, deleting, deploying, or sending.

The tool description is routing policy. "Search repository text" is weaker than "Use this when you don't know which file contains a symbol or call site." The model needs to know what the tool does and when it should pick it.

The ReAct pattern

This loop matches the outer shape popularized by ReAct (Reasoning + Acting): model decision, action, observation, then another decision.^{[5]Reference 5ReAct: Synergizing Reasoning and Acting in Language Models.https://arxiv.org/abs/2210.03629} Modern APIs often hide or compress the reasoning trace, but the control loop is still visible.

Bounded ReAct control loop where thinking selects a tool action, runtime returns an observation, and stop gates decide whether to answer or continue. — ReAct is a stateful control loop, not a four-step checklist. Tool intent crosses from model to runtime, observations return to message history, and step, cost, and timeout gates decide whether another model call is allowed.

Each iteration has three phases:

Decide: The model reads state and chooses a tool or final answer.
Act: The runtime validates the requested tool and arguments.
Observe: The runtime appends the tool result and calls the model again.

Pattern	Control shape	Best use	Main risk
Deterministic workflow	Fixed steps, optional model calls	Known paths such as classify, retrieve, draft, check	Brittle when evidence changes the next step
ReAct loop	Model picks each next action from observations	Search, inspect, test, explain, and recover tasks	Loops, wrong tools, and growing context
Plan-and-execute	Planner creates steps, workers execute them	Longer tasks with separable subtasks	Stale plans after tool results
Graph workflow	Explicit nodes and transitions	Production flows with review, retry, and approval gates	More code and state-machine upkeep

For architecture variants, the LeetLLM lesson on ReAct, Plan-and-Execute, and other agentic architectures covers when to move from this loop to a planner or graph.

Where agents break first

Once you leave calculator demos, the same failures show up quickly.

Agent failure map showing a tool loop breaking through loops, fake tools, overflowing context, malformed JSON, and wrong tool routing. — Most early agent failures come from loop control, tool validation, context growth, JSON parsing, and tool-routing policy.

Early failures have predictable controls:

Repeated calls: stop with a step cap, cost cap, and progress check.
Hallucinated tool names: reject the request and return the allowed set.
Context overflow: truncate file dumps and logs before they crowd out the task.
Bad arguments: use strict schemas plus validation-error retries.^{[6]Reference 6Structured outputshttps://developers.openai.com/api/docs/guides/structured-outputs}
Wrong tool choice: include when-to-use guidance in tool descriptions.

These controls turn an open-ended loop into a bounded runtime. The code isn't glamorous: check the tool name, parse arguments, cap the loop, cap output length, and return errors as observations instead of throwing raw exceptions into the process.

Memory isn't one thing

An agent has short-term state by default: the messages list. That's enough for one task, but long-running products need separate memory layers.

Keep current task context in the messages array, knowing token cost grows every turn. Retrieval memory is better for fuzzy recall across docs or past sessions, not exact truth. Put permissions, workflow state, deployed revisions, and anything that needs transactions in a database or key-value store.

Don't use a vector database as the source of truth for exact state. Retrieval is for fuzzy recall. Exact state belongs in an authoritative system exposed through tools.

Make the loop production-ready

The raw loop is only the middle of a production agent. Reliability comes from controls around it.

Production agent control stack with memory, budgets, timeouts, approvals, logging, and trajectory evaluation around a tool loop. — A production agent isn't only a tool loop. The loop sits inside a control stack: memory, budget, timeout, logging, approval, and evaluation each block a different class of failure.

Production controls wrap the loop:

Cost: track token usage and stop at a budget.
Observability: log step, tool, arguments, truncated result, and stop reason.
Timeouts: add per-tool timeouts plus a total task deadline.
Approval: pause before patching, deleting, deploying, or sending.
Evaluation: score both final answer and tool trajectory so a lucky answer with a risky path still fails review.

A thread timeout can keep your loop moving, but it can't reliably kill arbitrary stuck Python code. For untrusted or side-effectful tools, run the tool in a separate process or sandbox you can terminate.

Final-answer checks aren't enough for agents. Two trajectories can produce the same answer, and one may be much riskier. As a safety and system-design recommendation, evaluate whether the agent chose an allowed tool, stopped at the right time, recovered from errors, and used evidence correctly. This recommendation isn't a conclusion established by final-answer benchmarks such as GAIA or SWE-bench, and an LLM judge needs its own calibration before it can grade any part of the run.

⚠️ Common mistake: Evaluating only the final answer. A risky trajectory can get lucky once, so review tool sequence, retries, approvals, and stop reason beside answer quality.

Practical rule

Start with one or two tools and watch the trace. Add more tools only when the task needs them, put approvals in front of side effects, keep exact state outside retrieval memory, and stop repeated loops before the model decides to stop.

The single-agent ReAct loop is the smallest useful version. More structured systems, such as Plan-and-Execute, multi-agent orchestration, DAG workflows, and MCP, arrange the same pieces into stricter control flows for larger workloads.^{[7]Reference 7The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling.https://arxiv.org/abs/2404.11584}^{[2]Reference 2Introducing the Model Context Protocolhttps://www.anthropic.com/news/model-context-protocol}

Next, ReAct and Plan-and-Execute architectures shows when to switch from a simple loop to a planner, Function Calling and Tool Use digs into schemas and argument shapes, and Agent Failure States and Recovery catalogs recovery patterns.

PreviousThe Million-Token Era: What 1M Context Windows Change NextRAG vs Fine-Tuning vs Prompting

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Function calling

OpenAI · 2026 · OpenAI API Docs

Introducing the Model Context Protocol

Anthropic · 2024

Building Effective Agents

Anthropic · 2024

GPT-5.6 Luna Model

OpenAI · 2026

ReAct: Synergizing Reasoning and Acting in Language Models.

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. · 2022 · ICLR 2023

Structured outputs

OpenAI · 2024

The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling.

Masterman, T., et al. · 2024 · arXiv preprint

Blog

AgentsDeep DiveTutorial

How to Build an AI Agent from Scratch

Build a working AI agent from the raw loop: define tools, let the model choose one, execute it in Python, append the observation, and add the guardrails that keep agents reliable.

LeetLLM TeamFebruary 19, 2026Updated July 14, 202610 min read

What an agent is

Strip away framework names and an agent has three parts:

A large language model (LLM) that can decide whether to answer or request a tool
A set of tools your code can execute, such as search, read file, run tests, or call APIs
A bounded loop that repeats until the task is done or a safety limit stops it

Compared with a single LLM call, an agent loop can iterate through observations, call tools through your runtime, retry or switch tools after errors, and carry message plus tool history across steps.

The LLM doesn't touch files, databases, or APIs by itself. It emits structured intent. Your Python code validates that intent, runs allowed tools, and appends the result back into the conversation.

💡 Key insight: An agent is a runtime contract, not a smarter prompt. The model chooses intent; your code owns validation, execution, memory, limits, and audit logs.

One turn in the loop

Suppose a developer asks, "Can this retry PR break duplicate-execution safety?"

The agent has three tools:

read_file(path) reads a repository file
search_repo(query) finds call sites and tests
run_tests(target) runs an approved test target

A useful trajectory might look like this:

Model requests read_file({"path": "retry_backoff.py"}).
Runtime returns the helper code.
Model searches for retry_backoff submit.
Runtime returns two side-effecting call sites.
Model runs tests/test_retry_backoff.py.
Runtime reports that timeout tests pass, but duplicate execution is untested.
Model answers: "Block merge until side-effecting callers pass an idempotency key and duplicate-execution tests exist."

The model chose the next steps and wrote the answer. File reads, search, and test execution stayed in the runtime.

Build the core loop

minimal-tool-agent.py

import json
import os
from openai import OpenAI

client = OpenAI()
MODEL = os.getenv("OPENAI_MODEL", "gpt-5.6-luna")

tools = [{
    "type": "function",
    "name": "estimate_test_runtime",
    "description": "Estimate total runtime for a focused test target.",
    "strict": True,
    "parameters": {
        "type": "object",
        "properties": {
            "test_count": {"type": "integer"},
            "average_ms": {"type": "number"},
        },
        "required": ["test_count", "average_ms"],
        "additionalProperties": False,
    },
}]

def estimate_test_runtime(test_count: int, average_ms: float) -> str:
    seconds = test_count * average_ms / 1000
    return f"{seconds:.1f}s"

TOOL_REGISTRY = {
    "estimate_test_runtime": estimate_test_runtime,
}

def run_agent(user_message: str, max_steps: int = 6) -> str:
    history = [
        {"role": "user", "content": user_message},
    ]

    for _ in range(max_steps):
        response = client.responses.create(
            model=MODEL,
            instructions="Use tools when exact calculation helps.",
            input=history,
            tools=tools,
        )
        history.extend(response.output)
        calls = [item for item in response.output if item.type == "function_call"]

        if not calls:
            return response.output_text

        for call in calls:
            name = call.name
            args = json.loads(call.arguments)
            if name not in TOOL_REGISTRY:
                result = f"Error: unknown tool {name}"
            else:
                result = TOOL_REGISTRY[name](**args)

            history.append({
                "type": "function_call_output",
                "call_id": call.call_id,
                "output": result,
            })

    return "Stopped after step limit."

print(run_agent("How long will 42 tests take at 31ms each?"))

Design real tools carefully

Adding repo tools is less about clever prompting and more about safe interfaces. Good tools are narrow, typed, bounded, and explicit about when to use them.

Common repo tools should stay narrow:

read_file(path): reject paths outside the repo and truncate long files.
search_repo(query): use fixed-string rg, a timeout, and an output cap.
run_tests(target): allowlist test paths and return only the useful tail.
Side-effect tools: require human approval and an audit log before patching, deleting, deploying, or sending.

The ReAct pattern

Each iteration has three phases:

Decide: The model reads state and chooses a tool or final answer.
Act: The runtime validates the requested tool and arguments.
Observe: The runtime appends the tool result and calls the model again.

Pattern	Control shape	Best use	Main risk
Deterministic workflow	Fixed steps, optional model calls	Known paths such as classify, retrieve, draft, check	Brittle when evidence changes the next step
ReAct loop	Model picks each next action from observations	Search, inspect, test, explain, and recover tasks	Loops, wrong tools, and growing context
Plan-and-execute	Planner creates steps, workers execute them	Longer tasks with separable subtasks	Stale plans after tool results
Graph workflow	Explicit nodes and transitions	Production flows with review, retry, and approval gates	More code and state-machine upkeep

For architecture variants, the LeetLLM lesson on ReAct, Plan-and-Execute, and other agentic architectures covers when to move from this loop to a planner or graph.

Where agents break first

Once you leave calculator demos, the same failures show up quickly.

Early failures have predictable controls:

Repeated calls: stop with a step cap, cost cap, and progress check.
Hallucinated tool names: reject the request and return the allowed set.
Context overflow: truncate file dumps and logs before they crowd out the task.
Bad arguments: use strict schemas plus validation-error retries.^{[6]Reference 6Structured outputshttps://developers.openai.com/api/docs/guides/structured-outputs}
Wrong tool choice: include when-to-use guidance in tool descriptions.

Memory isn't one thing

An agent has short-term state by default: the messages list. That's enough for one task, but long-running products need separate memory layers.

Don't use a vector database as the source of truth for exact state. Retrieval is for fuzzy recall. Exact state belongs in an authoritative system exposed through tools.

Make the loop production-ready

The raw loop is only the middle of a production agent. Reliability comes from controls around it.

Production controls wrap the loop:

Cost: track token usage and stop at a budget.
Observability: log step, tool, arguments, truncated result, and stop reason.
Timeouts: add per-tool timeouts plus a total task deadline.
Approval: pause before patching, deleting, deploying, or sending.
Evaluation: score both final answer and tool trajectory so a lucky answer with a risky path still fails review.

⚠️ Common mistake: Evaluating only the final answer. A risky trajectory can get lucky once, so review tool sequence, retries, approvals, and stop reason beside answer quality.

Practical rule

PreviousThe Million-Token Era: What 1M Context Windows Change NextRAG vs Fine-Tuning vs Prompting

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Function calling

OpenAI · 2026 · OpenAI API Docs

Introducing the Model Context Protocol

Anthropic · 2024

Building Effective Agents

Anthropic · 2024

GPT-5.6 Luna Model

OpenAI · 2026

ReAct: Synergizing Reasoning and Acting in Language Models.

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. · 2022 · ICLR 2023

Structured outputs

OpenAI · 2024

The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling.

Masterman, T., et al. · 2024 · arXiv preprint