LearnApplied LLM EngineeringFunction Calling & Tool Use

🤖MediumLLM Agents & Tool Use

Function Calling & Tool Use

Build a safe tool-calling runtime that validates model requests, executes controlled actions, feeds observations back, and evaluates complete workflows.

15 min read

Learning path

Step 58 of 158 in the full curriculum

CoT, ToT & Self-Consistency Prompting MCP & Tool Protocol Standards

Careful reasoning over supplied facts still couldn't answer eval run R42: the latest result was missing. No amount of careful prompting can invent a trustworthy live metric. Software has to fetch it.

Function calling gives a language model a typed way to request that fetch. The model doesn't run the evaluation service. It proposes an action such as get_eval_run(run_id="R42"); your runtime checks the request, executes an allowed tool, returns the observation, and asks the model to answer from the result.

This permission boundary is the start of agent engineering. Once an LLM can request reads or writes against real systems, correctness includes parsing, authorization, retries, side effects, latency, and evaluation of the whole trajectory.

A tool call is a request, not an execution

Release assistant Luna is answering, "Why did eval run R42 fail?" The assistant needs a live eval lookup. A good loop has four visible events:

User asks a question that depends on external state.
Model emits a named action with typed arguments.
Runtime validates and executes that action.
Model receives the tool observation and writes a grounded reply.

Function-calling flow where a user question leads the model to propose a typed eval-run lookup, the runtime executes it inside the trust boundary, and the returned observation grounds the final answer. — The model can request a tool call, but only the runtime may execute it and attach the observation that grounds the reply.

This distinction matters. If the model requests promote_model, it still hasn't changed production traffic. Your application gets a final chance to reject the wrong run, a failed gate, a duplicate action, or a write that needs approval.

Define the smallest useful tool contract

A model chooses tools from the definitions you provide. Each definition needs a name, a description that says when to use it, and an argument schema. Keep the schema narrow: if an eval lookup only needs a run ID, don't expose promotion fields or free-form SQL.

Tool schema diagram where a compact contract allows one exact eval-run lookup call and rejects malformed calls with extra keys or a missing run ID. — Keep tool contracts narrow. Exact shape passes, malformed calls stop before runtime policy checks even begin.

This logical definition is provider-neutral. Hosted APIs serialize similar information in their own request envelope. If you adapt it to OpenAI strict mode, list every property in required, represent optional values with a nullable type, and keep additionalProperties: false; provider schema subsets differ.^{[1]Reference 1Function callinghttps://developers.openai.com/api/docs/guides/function-calling}

define-eval-run-tool.py

TOOL = {
    "name": "get_eval_run",
    "description": "Read live evaluation status for one model run. Do not use for promotion.",
    "parameters": {
        "type": "object",
        "properties": {
            "run_id": {"type": "string"},
            "include_failures": {"type": "boolean"},
        },
        "required": ["run_id"],
        "additionalProperties": False,
    },
}

parameters = TOOL["parameters"]
print(f"tool_name: {TOOL['name']}")
print(f"required: {parameters['required']}")
print(f"accepts_extra_fields: {parameters['additionalProperties']}")

Output

tool_name: get_eval_run
required: ['run_id']
accepts_extra_fields: False

A schema is a contract for shape. It helps the model construct a call and helps your runtime reject malformed input. It doesn't prove that the caller owns the project or that a write action is permitted.

Parse and validate before dispatch

The model's output crosses a trust boundary. Even when an API offers constrained or strict structured output, your runtime still owns semantic validation and permission checks. Start with the simplest read-only dispatcher: accept one known tool, allow only named fields, and verify field types.

validate-a-read-call.py

class CallRejected(ValueError):
    pass

def validate_eval_call(call: dict[str, object]) -> dict[str, object]:
    if call.get("name") != "get_eval_run":
        raise CallRejected("unknown tool")
    args = call.get("args")
    if not isinstance(args, dict):
        raise CallRejected("args must be an object")
    allowed = {"run_id", "include_failures"}
    unknown = set(args) - allowed
    if unknown:
        raise CallRejected(f"unknown fields: {sorted(unknown)}")
    if not isinstance(args.get("run_id"), str):
        raise CallRejected("run_id must be a string")
    if "include_failures" in args and not isinstance(args["include_failures"], bool):
        raise CallRejected("include_failures must be a boolean")
    return args

candidates = [
    {"name": "get_eval_run", "args": {"run_id": "R42", "include_failures": True}},
    {"name": "get_eval_run", "args": {"run_id": "R42", "promote_now": True}},
]
for call in candidates:
    try:
        args = validate_eval_call(call)
        print(f"accepted: {args['run_id']}")
    except CallRejected as exc:
        print(f"rejected: {exc}")

Output

accepted: R42
rejected: unknown fields: ['promote_now']

Notice that this validator doesn't attempt to repair a bad call silently. A rejected request becomes a structured observation, so the model may correct it on a later bounded turn.

Build the complete tool loop

Function calling becomes concrete only when you run the full state transition. In a hosted-model integration, the first model response contains a tool request and the second model response consumes a tool result. To keep this lab executable without credentials, the model below is scripted while the runtime path is real.

complete-tool-loop.py

import json

EVAL_RUNS = {
    "R42": {"status": "failed", "metric": "citation_precision", "score": "0.81"},
}

class ScriptedModel:
    def __init__(self) -> None:
        self.turns = 0

    def respond(self, messages: list[dict[str, object]]) -> dict[str, object]:
        self.turns += 1
        observations = [item for item in messages if item["role"] == "tool"]
        if not observations:
            return {
                "role": "assistant",
                "tool_call": {
                    "id": "eval-1",
                    "name": "get_eval_run",
                    "args": {"run_id": "R42"},
                },
            }
        result = json.loads(str(observations[-1]["content"]))
        return {
            "role": "assistant",
            "content": (
                f"Eval run R42 {result['status']} because {result['metric']} "
                f"scored {result['score']}."
            ),
        }

def execute_eval_tool(call: dict[str, object]) -> dict[str, str]:
    if call.get("name") != "get_eval_run":
        raise ValueError("tool not allowed")
    args = call.get("args")
    if not isinstance(args, dict) or set(args) != {"run_id"}:
        raise ValueError("expected only run_id")
    run_id = args["run_id"]
    if not isinstance(run_id, str) or run_id not in EVAL_RUNS:
        raise ValueError("unknown run")
    return EVAL_RUNS[run_id]

model = ScriptedModel()
messages: list[dict[str, object]] = [
    {"role": "user", "content": "Why did eval run R42 fail?"}
]

first = model.respond(messages)
call = first["tool_call"]
messages.append(first)
observation = execute_eval_tool(call)  # runtime executes, not model
messages.append(
    {"role": "tool", "tool_call_id": call["id"], "content": json.dumps(observation)}
)
final = model.respond(messages)

print(f"requested_tool: {call['name']}")
print(f"tool_status: {observation['status']}")
print(f"answer: {final['content']}")
print(f"model_turns: {model.turns}")

Output

requested_tool: get_eval_run
tool_status: failed
answer: Eval run R42 failed because citation_precision scored 0.81.
model_turns: 2

The transcript is the essential pattern: user message, assistant tool request, tool observation, assistant answer. Preserve a call ID so each observation stays linked to the request that produced it, especially when multiple reads run concurrently. Toolformer showed that models can learn where API calls help during generation, and ReAct made the reason/action/observation loop explicit for tool-using tasks.^{[2]Reference 2Toolformer: Language Models Can Teach Themselves to Use Tools.https://arxiv.org/abs/2302.04761}^{[3]Reference 3ReAct: Synergizing Reasoning and Acting in Language Models.https://arxiv.org/abs/2210.03629} The engineering burden remains in your runtime.

Structure isn't permission

Valid arguments can still request a harmful or unauthorized action. An eval lookup is read-only; promote_model changes production traffic. Writes need more checks:

caller owns the target project;
run satisfies release policy;
promotion target is computed from trusted run metadata, not model text;
a human confirms when policy requires approval; and
an idempotency key prevents retrying the same write twice.

A model-promotion request moves from a proposed call into a runtime gate stack for shape, ownership, policy, approval, and idempotency. Failed checks stop in a reject path, a clean first pass promotes one candidate, and a repeated approved request replays the stored result instead of issuing a second promotion. — Schema-valid writes still need runtime gates. Reject bad calls early, execute one trusted promotion, and let idempotency turn retries into replays instead of duplicate side effects.

guard-a-model-promotion-write.py

from dataclasses import dataclass

@dataclass(frozen=True)
class Session:
    project_id: str

MODEL_RUNS = {
    "R42": {"project_id": "search", "gates_passed": True, "candidate": "reranker-v7"},
}
PROMOTIONS: dict[str, str] = {}

def promote_model(session: Session, run_id: str, confirmed: bool) -> str:
    run = MODEL_RUNS.get(run_id)
    if run is None:
        return "blocked: unknown run"
    if session.project_id != run["project_id"]:
        return "blocked: project ownership failed"
    if not run["gates_passed"]:
        return "blocked: release policy failed"
    if not confirmed:
        return "blocked: confirmation required"
    key = f"promotion:{run_id}"
    if key in PROMOTIONS:
        return f"replayed: promotion already exists for {PROMOTIONS[key]}"
    PROMOTIONS[key] = run["candidate"]
    return f"promoted: {PROMOTIONS[key]}"

print(promote_model(Session("search"), "Z99999", confirmed=True))
print(promote_model(Session("ads"), "R42", confirmed=True))
print(promote_model(Session("search"), "R42", confirmed=False))
print(promote_model(Session("search"), "R42", confirmed=True))
print(promote_model(Session("search"), "R42", confirmed=True))

Output

blocked: unknown run
blocked: project ownership failed
blocked: confirmation required
promoted: reranker-v7
replayed: promotion already exists for reranker-v7

Schema enforcement isn't a security boundary. It can narrow a JSON shape; only application logic knows ownership, policy, approval, and whether a write already happened.

Pass validated arguments as parameters, never strings

Validation narrows what the model can ask for, but how the runtime uses those arguments matters just as much. When a tool touches a database, a shell, or any other interpreter, pass the validated arguments as bound parameters, never as interpolated strings. A run_id that cleared your schema check still becomes an injection vector the moment you build f"SELECT * FROM runs WHERE id = '{run_id}'" or hand it to a shell with subprocess.run(cmd, shell=True). Use the driver's placeholder binding, cursor.execute("SELECT * FROM runs WHERE id = %s", (run_id,)), and pass process arguments as an argv list with shell=False. The model proposes values; parameterized execution keeps those values as data instead of letting them turn into code.

Return errors as observations, with limits

Tool calls fail in ordinary ways: the model uses an old field name, a run ID doesn't exist, or a service times out. A useful runtime returns a typed rejection rather than a stack trace. The next model turn can correct the request, but it should get only a small retry budget.

correct-a-rejected-call.py

EVAL_RUNS = {"R42": {"status": "failed"}}

def execute(call: dict[str, object]) -> dict[str, object]:
    if call.get("name") != "get_eval_run":
        return {"ok": False, "error": "tool not allowed"}
    args = call.get("args")
    if not isinstance(args, dict):
        return {"ok": False, "error": "args must be an object"}
    unknown = sorted(set(args) - {"run_id"})
    if unknown:
        return {"ok": False, "error": f"unknown fields: {unknown}"}
    if "run_id" not in args:
        return {"ok": False, "error": "required field: run_id"}
    run_id = args["run_id"]
    if not isinstance(run_id, str) or run_id not in EVAL_RUNS:
        return {"ok": False, "error": "unknown run"}
    return {"ok": True, "status": EVAL_RUNS[run_id]["status"]}

model_attempts = [
    {"name": "get_eval_run", "args": {}},
    {"name": "get_eval_run", "args": {"run_id": "R42"}},
]
for turn, call in enumerate(model_attempts, start=1):
    observation = execute(call)
    print(f"turn_{turn}: {observation}")
    if observation["ok"]:
        break

Output

turn_1: {'ok': False, 'error': 'required field: run_id'}
turn_2: {'ok': True, 'status': 'failed'}

Correction is useful only while it changes the request. If a model repeats the same rejected call, stop before it burns external capacity or triggers repeated writes.

stop-a-repeated-tool-loop.py

import json

attempts = [
    {"name": "get_eval_run", "args": {"eval_id": "R42"}},
    {"name": "get_eval_run", "args": {"eval_id": "R42"}},
    {"name": "get_eval_run", "args": {"run_id": "R42"}},
]
seen: set[str] = set()
max_turns = 3

def rejection_for(call: dict[str, object]) -> str | None:
    if call.get("name") != "get_eval_run":
        return "tool not allowed"
    args = call.get("args")
    if not isinstance(args, dict):
        return "args must be an object"
    unknown = sorted(set(args) - {"run_id"})
    if unknown:
        return f"unknown fields: {unknown}"
    if "run_id" not in args:
        return "required field: run_id"
    return None

for turn, call in enumerate(attempts[:max_turns], start=1):
    error = rejection_for(call)
    if error is None:
        print(f"turn_{turn}: accepted")
        break
    fingerprint = json.dumps(call, sort_keys=True)
    if fingerprint in seen:
        print(f"turn_{turn}: stopped repeated rejected call")
        break
    seen.add(fingerprint)
    print(f"turn_{turn}: rejected: {error}")

Output

turn_1: rejected: unknown fields: ['eval_id']
turn_2: stopped repeated rejected call

Track a turn cap, repeated-call detection, timeout budget, and cost budget for each request. A model that can't recover should return a safe fallback or hand the ticket to a person.

Parallelize independent reads, not dependent writes

Alex asks, "Compare eval runs R42 and R43, then promote the safer one if it passes." Those two eval reads are independent. Your runtime may execute them concurrently after validating both calls. A request to promote one candidate based on those results can't run at the same time: the write depends on what the reads discover.

parallel-read-only-lookups.py

import asyncio

STATUSES = {
    "R42": "failed",
    "R43": "passed",
}

async def get_eval_status(run_id: str) -> tuple[str, str]:
    await asyncio.sleep(0)
    return run_id, STATUSES[run_id]

async def main() -> None:
    run_ids = ["R42", "R43"]
    rows = await asyncio.gather(*(get_eval_status(run_id) for run_id in run_ids))
    for run_id, status in sorted(rows):
        print(f"{run_id}: {status}")
    print("write_action: wait for validated read results")

asyncio.run(main())

Output

R42: failed
R43: passed
write_action: wait for validated read results

Concurrency saves elapsed time only when actions are independent. For writes on the same model target, serialize execution and apply idempotency rules.

Expose a small allowed toolbox

As an agent grows, passing every internal action to the model wastes context and expands the space of possible mistakes. Filter for authorization first, then route among permitted tools relevant to the current request. Retrieval-augmented API selection appears in Gorilla, which evaluated models against large API collections.^{[4]Reference 4Gorilla: Large Language Model Connected with Massive APIs.https://arxiv.org/abs/2305.15334}

A tool-routing flow starts with a catalog containing get_eval_run, promote_model, and search_eval_policy. Authorization removes the forbidden promotion tool first. Only the allowed tools are ranked against an eval-failure query, and the single relevant read tool is exposed to the model. — Authorization runs before relevance. Forbidden tools never make it into the ranking step or model context.

route-only-allowed-tools.py

import re
from dataclasses import dataclass

@dataclass(frozen=True)
class Tool:
    name: str
    description: str

catalog = [
    Tool("get_eval_run", "eval run metric failure score status"),
    Tool("promote_model", "promote model release production traffic"),
    Tool("search_eval_policy", "find release policy gate thresholds"),
]
allowed = {"get_eval_run", "search_eval_policy"}
query = "Why did eval run R42 fail its metric?"
query_terms = set(re.findall(r"[a-z_]+", query.lower()))

ranked = sorted(
    (
        (len(query_terms & set(tool.description.split())), tool.name)
        for tool in catalog
        if tool.name in allowed
    ),
    reverse=True,
)
visible_tools = [name for score, name in ranked if score > 0]

print(f"model_visible_tools: {visible_tools}")
print(f"promotion_tool_exposed: {'promote_model' in visible_tools}")

Output

model_visible_tools: ['get_eval_run']
promotion_tool_exposed: False

Routing isn't authorization. The allowlist is applied first. A highly relevant but forbidden write tool must stay unavailable.

Evaluate trajectory, not final text alone

A pleasant answer can hide a wrong tool call, an unsafe write, or a result invented without an observation. Evaluate the events your runtime controls:

Check	Passing behavior
Tool choice	Requests `get_eval_run` for a live eval-status question
Arguments	Uses allowed keys and exact run ID
Execution safety	Makes no unauthorized write
Grounding	Final response reflects returned status
Efficiency	Stays within round, latency, and cost budgets

BFCL evaluates function-selection and executable calling behavior, while Tau-Bench examines longer, policy-constrained interactions with tools and users.^{[5]Reference 5Berkeley Function-Calling Leaderboard.https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard}^{[6]Reference 6Tau-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domainshttps://arxiv.org/abs/2406.12045} Your own release gate needs the schemas, policy failures, and side-effect boundaries your users will hit.

score-a-tool-trajectory.py

def score(events: list[dict[str, object]]) -> tuple[bool, str]:
    if [event.get("type") for event in events] != ["call", "observation", "answer"]:
        return False, "wrong call sequence"

    call, observation, answer = events
    if call.get("name") != "get_eval_run":
        return False, "wrong call sequence"
    if call.get("args") != {"run_id": "R42"}:
        return False, "wrong arguments"
    if not call.get("id") or observation.get("call_id") != call["id"]:
        return False, "observation mismatched call"
    if observation.get("run_id") != "R42" or observation.get("status") != "failed":
        return False, "missing grounded observation"
    if answer.get("run_id") != observation.get("run_id"):
        return False, "answer ignored observation"
    if answer.get("status") != observation.get("status"):
        return False, "answer ignored observation"
    return True, "trajectory passed"

good = [
    {
        "type": "call",
        "id": "eval-1",
        "name": "get_eval_run",
        "args": {"run_id": "R42"},
    },
    {"type": "observation", "call_id": "eval-1", "run_id": "R42", "status": "failed"},
    {"type": "answer", "run_id": "R42", "status": "failed", "text": "Eval run R42 failed."},
]
bad_order = [
    {"type": "observation", "call_id": "eval-1", "run_id": "R42", "status": "failed"},
    {"type": "call", "id": "eval-1", "name": "get_eval_run", "args": {"run_id": "R42"}},
    {"type": "answer", "run_id": "R42", "status": "failed", "text": "Eval run R42 failed."},
]
bad_semantics = [
    {"type": "call", "id": "eval-1", "name": "get_eval_run", "args": {"run_id": "R42"}},
    {"type": "observation", "call_id": "eval-1", "run_id": "R42", "status": "failed"},
    {"type": "answer", "run_id": "R42", "status": "passed", "text": "R42 did not fail; an older check was marked failed."},
]
print(score(good))
print(score(bad_order))
print(score(bad_semantics))

Output

(True, 'trajectory passed')
(False, 'wrong call sequence')
(False, 'answer ignored observation')

Once correctness is scored, add serving constraints. A runtime that succeeds after ten retries isn't ready for a customer-facing workflow.

release-gate-tool-runtime.py

runs = [
    {"passed": True, "unsafe_writes": 0, "rounds": 2, "latency_ms": 430, "cost_cents": 2.1},
    {"passed": True, "unsafe_writes": 0, "rounds": 2, "latency_ms": 510, "cost_cents": 2.4},
    {"passed": False, "unsafe_writes": 0, "rounds": 3, "latency_ms": 680, "cost_cents": 3.8},
    {"passed": True, "unsafe_writes": 0, "rounds": 2, "latency_ms": 470, "cost_cents": 2.2},
]

success_rate = sum(run["passed"] for run in runs) / len(runs)
unsafe_writes = sum(run["unsafe_writes"] for run in runs)
max_rounds = max(run["rounds"] for run in runs)
max_latency_ms = max(run["latency_ms"] for run in runs)
max_cost_cents = max(run["cost_cents"] for run in runs)
latency_budget_ms = 600
cost_budget_cents = 3.0
ready = (
    success_rate >= 0.75
    and unsafe_writes == 0
    and max_rounds <= 3
    and max_latency_ms <= latency_budget_ms
    and max_cost_cents <= cost_budget_cents
)

print(f"success_rate: {success_rate:.0%}")
print(f"unsafe_writes: {unsafe_writes}")
print(f"max_rounds: {max_rounds}")
print(f"max_latency_ms: {max_latency_ms}")
print(f"latency_budget_ms: {latency_budget_ms}")
print(f"max_cost_cents: {max_cost_cents:.1f}")
print(f"cost_budget_cents: {cost_budget_cents:.1f}")
print(f"release_candidate: {ready}")

Output

success_rate: 75%
unsafe_writes: 0
max_rounds: 3
max_latency_ms: 680
latency_budget_ms: 600
max_cost_cents: 3.8
cost_budget_cents: 3.0
release_candidate: False

The failed release is deliberate: one run exceeds latency and cost budgets even though aggregate success reaches the threshold. These fixtures test controller behavior, not model quality. Replace them with held-out tasks, actual model calls, sandboxed tool results, and labeled policy outcomes before shipping.

From local functions to reusable tools

The examples used in-process Python functions because the execution boundary is easiest to understand there. Real agent products need many capabilities owned by different teams and consumed by more than one host. Rebuilding a custom adapter for every host and service doesn't scale.

The next lesson introduces the Model Context Protocol (MCP). The mental model stays the same: model proposes a typed action and a trusted runtime decides whether to execute it. MCP standardizes how hosts discover and invoke externally provided capabilities.

What to remember

The model requests; runtime executes. Never give a text generator implicit authority over real effects.
Schemas narrow shape, not policy. Validate ownership, release gates, approvals, and idempotency before writes.
Observations close the loop. A grounded answer must follow a returned tool result.
Recovery needs budgets. Reject bad calls with structured errors, then cap retries, repeated actions, latency, and cost.
Evaluate traces. Tool choice, arguments, results, safety, and serving cost all belong in the release gate.

Mastery check

Key concepts

Tool call proposal versus application execution
JSON-like input schemas and server-side validation
Assistant request, tool observation, final-answer loop
Read versus write permissions
Confirmation and idempotency for side effects
Structured errors and bounded correction
Parallel safe reads and serialized writes
Allowed-tool routing
Trajectory-based evaluation
Function calling as prerequisite for MCP

Evaluation rubric

Foundational: Explains why a model can't supply a live eval metric without a tool observation.
Intermediate: Defines and validates a narrow get_eval_run contract.
Intermediate: Implements a complete tool request, execution, observation, and answer loop.
Advanced: Protects a promotion write with ownership, policy, confirmation, and idempotency checks.
Advanced: Builds a trajectory release gate with correctness, unsafe-write, round, and latency measures.

Common pitfalls

Treating a tool call as a completed action: The assistant claims a model was promoted before execution. Make the runtime return an observation before wording the result as complete.
Trusting schema-valid writes: Correct JSON can still target the wrong run. Apply application authorization and policy checks.
Blind retries: A rejected call repeats until budget is gone. Fingerprint rejected requests and cap tool rounds.
Parallelizing effects: Two write calls race on the same model target. Parallelize independent reads only.
Scoring only the final sentence: A correct-looking answer can be ungrounded. Evaluate call, arguments, observation, and outcome.

Practice extension

Extend complete-tool-loop.py with a second read-only tool, get_release_policy(run_id), and a protected write tool, promote_model(run_id). Build five labeled fixtures: two straightforward eval-status questions, one unknown run, one promotion candidate that passes gates, and one candidate that fails gates. Your artifact is a trajectory report containing final outcome, blocked writes, retries, tool rounds, and maximum latency for every fixture.

Next Step

Continue to MCP & Tool Protocol Standards

You can now implement a safe local tool loop and evaluate its trajectories; next comes the host/server boundary that standardizes tools across integrations.

PreviousCoT, ToT & Self-Consistency Prompting

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Function calling

OpenAI · 2026 · OpenAI API Docs

Toolformer: Language Models Can Teach Themselves to Use Tools.

Schick, T., et al. · 2023 · NeurIPS 2023

ReAct: Synergizing Reasoning and Acting in Language Models.

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. · 2023 · ICLR 2023

Gorilla: Large Language Model Connected with Massive APIs.

Patil, S. G., et al. · 2023 · arXiv preprint

Berkeley Function-Calling Leaderboard.

Patil, S. G., et al. · 2024 · UC Berkeley Gorilla repository

Tau-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Yao, S., et al. · 2024 · arXiv preprint

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Back to Topics

LearnApplied LLM EngineeringFunction Calling & Tool Use

🤖MediumLLM Agents & Tool Use

Function Calling & Tool Use

Build a safe tool-calling runtime that validates model requests, executes controlled actions, feeds observations back, and evaluates complete workflows.

15 min read

Learning path

Step 58 of 158 in the full curriculum

CoT, ToT & Self-Consistency Prompting MCP & Tool Protocol Standards

A tool call is a request, not an execution

Release assistant Luna is answering, "Why did eval run R42 fail?" The assistant needs a live eval lookup. A good loop has four visible events:

User asks a question that depends on external state.
Model emits a named action with typed arguments.
Runtime validates and executes that action.
Model receives the tool observation and writes a grounded reply.

Define the smallest useful tool contract

define-eval-run-tool.py

TOOL = {
    "name": "get_eval_run",
    "description": "Read live evaluation status for one model run. Do not use for promotion.",
    "parameters": {
        "type": "object",
        "properties": {
            "run_id": {"type": "string"},
            "include_failures": {"type": "boolean"},
        },
        "required": ["run_id"],
        "additionalProperties": False,
    },
}

parameters = TOOL["parameters"]
print(f"tool_name: {TOOL['name']}")
print(f"required: {parameters['required']}")
print(f"accepts_extra_fields: {parameters['additionalProperties']}")

Output

tool_name: get_eval_run
required: ['run_id']
accepts_extra_fields: False

Parse and validate before dispatch

validate-a-read-call.py

class CallRejected(ValueError):
    pass

def validate_eval_call(call: dict[str, object]) -> dict[str, object]:
    if call.get("name") != "get_eval_run":
        raise CallRejected("unknown tool")
    args = call.get("args")
    if not isinstance(args, dict):
        raise CallRejected("args must be an object")
    allowed = {"run_id", "include_failures"}
    unknown = set(args) - allowed
    if unknown:
        raise CallRejected(f"unknown fields: {sorted(unknown)}")
    if not isinstance(args.get("run_id"), str):
        raise CallRejected("run_id must be a string")
    if "include_failures" in args and not isinstance(args["include_failures"], bool):
        raise CallRejected("include_failures must be a boolean")
    return args

candidates = [
    {"name": "get_eval_run", "args": {"run_id": "R42", "include_failures": True}},
    {"name": "get_eval_run", "args": {"run_id": "R42", "promote_now": True}},
]
for call in candidates:
    try:
        args = validate_eval_call(call)
        print(f"accepted: {args['run_id']}")
    except CallRejected as exc:
        print(f"rejected: {exc}")

Output

accepted: R42
rejected: unknown fields: ['promote_now']

Notice that this validator doesn't attempt to repair a bad call silently. A rejected request becomes a structured observation, so the model may correct it on a later bounded turn.

Build the complete tool loop

complete-tool-loop.py

import json

EVAL_RUNS = {
    "R42": {"status": "failed", "metric": "citation_precision", "score": "0.81"},
}

class ScriptedModel:
    def __init__(self) -> None:
        self.turns = 0

    def respond(self, messages: list[dict[str, object]]) -> dict[str, object]:
        self.turns += 1
        observations = [item for item in messages if item["role"] == "tool"]
        if not observations:
            return {
                "role": "assistant",
                "tool_call": {
                    "id": "eval-1",
                    "name": "get_eval_run",
                    "args": {"run_id": "R42"},
                },
            }
        result = json.loads(str(observations[-1]["content"]))
        return {
            "role": "assistant",
            "content": (
                f"Eval run R42 {result['status']} because {result['metric']} "
                f"scored {result['score']}."
            ),
        }

def execute_eval_tool(call: dict[str, object]) -> dict[str, str]:
    if call.get("name") != "get_eval_run":
        raise ValueError("tool not allowed")
    args = call.get("args")
    if not isinstance(args, dict) or set(args) != {"run_id"}:
        raise ValueError("expected only run_id")
    run_id = args["run_id"]
    if not isinstance(run_id, str) or run_id not in EVAL_RUNS:
        raise ValueError("unknown run")
    return EVAL_RUNS[run_id]

model = ScriptedModel()
messages: list[dict[str, object]] = [
    {"role": "user", "content": "Why did eval run R42 fail?"}
]

first = model.respond(messages)
call = first["tool_call"]
messages.append(first)
observation = execute_eval_tool(call)  # runtime executes, not model
messages.append(
    {"role": "tool", "tool_call_id": call["id"], "content": json.dumps(observation)}
)
final = model.respond(messages)

print(f"requested_tool: {call['name']}")
print(f"tool_status: {observation['status']}")
print(f"answer: {final['content']}")
print(f"model_turns: {model.turns}")

Output

requested_tool: get_eval_run
tool_status: failed
answer: Eval run R42 failed because citation_precision scored 0.81.
model_turns: 2

Structure isn't permission

Valid arguments can still request a harmful or unauthorized action. An eval lookup is read-only; promote_model changes production traffic. Writes need more checks:

caller owns the target project;
run satisfies release policy;
promotion target is computed from trusted run metadata, not model text;
a human confirms when policy requires approval; and
an idempotency key prevents retrying the same write twice.

guard-a-model-promotion-write.py

from dataclasses import dataclass

@dataclass(frozen=True)
class Session:
    project_id: str

MODEL_RUNS = {
    "R42": {"project_id": "search", "gates_passed": True, "candidate": "reranker-v7"},
}
PROMOTIONS: dict[str, str] = {}

def promote_model(session: Session, run_id: str, confirmed: bool) -> str:
    run = MODEL_RUNS.get(run_id)
    if run is None:
        return "blocked: unknown run"
    if session.project_id != run["project_id"]:
        return "blocked: project ownership failed"
    if not run["gates_passed"]:
        return "blocked: release policy failed"
    if not confirmed:
        return "blocked: confirmation required"
    key = f"promotion:{run_id}"
    if key in PROMOTIONS:
        return f"replayed: promotion already exists for {PROMOTIONS[key]}"
    PROMOTIONS[key] = run["candidate"]
    return f"promoted: {PROMOTIONS[key]}"

print(promote_model(Session("search"), "Z99999", confirmed=True))
print(promote_model(Session("ads"), "R42", confirmed=True))
print(promote_model(Session("search"), "R42", confirmed=False))
print(promote_model(Session("search"), "R42", confirmed=True))
print(promote_model(Session("search"), "R42", confirmed=True))

Output

blocked: unknown run
blocked: project ownership failed
blocked: confirmation required
promoted: reranker-v7
replayed: promotion already exists for reranker-v7

Schema enforcement isn't a security boundary. It can narrow a JSON shape; only application logic knows ownership, policy, approval, and whether a write already happened.

Pass validated arguments as parameters, never strings

Return errors as observations, with limits

correct-a-rejected-call.py

EVAL_RUNS = {"R42": {"status": "failed"}}

def execute(call: dict[str, object]) -> dict[str, object]:
    if call.get("name") != "get_eval_run":
        return {"ok": False, "error": "tool not allowed"}
    args = call.get("args")
    if not isinstance(args, dict):
        return {"ok": False, "error": "args must be an object"}
    unknown = sorted(set(args) - {"run_id"})
    if unknown:
        return {"ok": False, "error": f"unknown fields: {unknown}"}
    if "run_id" not in args:
        return {"ok": False, "error": "required field: run_id"}
    run_id = args["run_id"]
    if not isinstance(run_id, str) or run_id not in EVAL_RUNS:
        return {"ok": False, "error": "unknown run"}
    return {"ok": True, "status": EVAL_RUNS[run_id]["status"]}

model_attempts = [
    {"name": "get_eval_run", "args": {}},
    {"name": "get_eval_run", "args": {"run_id": "R42"}},
]
for turn, call in enumerate(model_attempts, start=1):
    observation = execute(call)
    print(f"turn_{turn}: {observation}")
    if observation["ok"]:
        break

Output

turn_1: {'ok': False, 'error': 'required field: run_id'}
turn_2: {'ok': True, 'status': 'failed'}

Correction is useful only while it changes the request. If a model repeats the same rejected call, stop before it burns external capacity or triggers repeated writes.

stop-a-repeated-tool-loop.py

import json

attempts = [
    {"name": "get_eval_run", "args": {"eval_id": "R42"}},
    {"name": "get_eval_run", "args": {"eval_id": "R42"}},
    {"name": "get_eval_run", "args": {"run_id": "R42"}},
]
seen: set[str] = set()
max_turns = 3

def rejection_for(call: dict[str, object]) -> str | None:
    if call.get("name") != "get_eval_run":
        return "tool not allowed"
    args = call.get("args")
    if not isinstance(args, dict):
        return "args must be an object"
    unknown = sorted(set(args) - {"run_id"})
    if unknown:
        return f"unknown fields: {unknown}"
    if "run_id" not in args:
        return "required field: run_id"
    return None

for turn, call in enumerate(attempts[:max_turns], start=1):
    error = rejection_for(call)
    if error is None:
        print(f"turn_{turn}: accepted")
        break
    fingerprint = json.dumps(call, sort_keys=True)
    if fingerprint in seen:
        print(f"turn_{turn}: stopped repeated rejected call")
        break
    seen.add(fingerprint)
    print(f"turn_{turn}: rejected: {error}")

Output

turn_1: rejected: unknown fields: ['eval_id']
turn_2: stopped repeated rejected call

Track a turn cap, repeated-call detection, timeout budget, and cost budget for each request. A model that can't recover should return a safe fallback or hand the ticket to a person.

Parallelize independent reads, not dependent writes

parallel-read-only-lookups.py

import asyncio

STATUSES = {
    "R42": "failed",
    "R43": "passed",
}

async def get_eval_status(run_id: str) -> tuple[str, str]:
    await asyncio.sleep(0)
    return run_id, STATUSES[run_id]

async def main() -> None:
    run_ids = ["R42", "R43"]
    rows = await asyncio.gather(*(get_eval_status(run_id) for run_id in run_ids))
    for run_id, status in sorted(rows):
        print(f"{run_id}: {status}")
    print("write_action: wait for validated read results")

asyncio.run(main())

Output

R42: failed
R43: passed
write_action: wait for validated read results

Concurrency saves elapsed time only when actions are independent. For writes on the same model target, serialize execution and apply idempotency rules.

Expose a small allowed toolbox

route-only-allowed-tools.py

import re
from dataclasses import dataclass

@dataclass(frozen=True)
class Tool:
    name: str
    description: str

catalog = [
    Tool("get_eval_run", "eval run metric failure score status"),
    Tool("promote_model", "promote model release production traffic"),
    Tool("search_eval_policy", "find release policy gate thresholds"),
]
allowed = {"get_eval_run", "search_eval_policy"}
query = "Why did eval run R42 fail its metric?"
query_terms = set(re.findall(r"[a-z_]+", query.lower()))

ranked = sorted(
    (
        (len(query_terms & set(tool.description.split())), tool.name)
        for tool in catalog
        if tool.name in allowed
    ),
    reverse=True,
)
visible_tools = [name for score, name in ranked if score > 0]

print(f"model_visible_tools: {visible_tools}")
print(f"promotion_tool_exposed: {'promote_model' in visible_tools}")

Output

model_visible_tools: ['get_eval_run']
promotion_tool_exposed: False

Routing isn't authorization. The allowlist is applied first. A highly relevant but forbidden write tool must stay unavailable.

Evaluate trajectory, not final text alone

A pleasant answer can hide a wrong tool call, an unsafe write, or a result invented without an observation. Evaluate the events your runtime controls:

Check	Passing behavior
Tool choice	Requests `get_eval_run` for a live eval-status question
Arguments	Uses allowed keys and exact run ID
Execution safety	Makes no unauthorized write
Grounding	Final response reflects returned status
Efficiency	Stays within round, latency, and cost budgets

score-a-tool-trajectory.py

def score(events: list[dict[str, object]]) -> tuple[bool, str]:
    if [event.get("type") for event in events] != ["call", "observation", "answer"]:
        return False, "wrong call sequence"

    call, observation, answer = events
    if call.get("name") != "get_eval_run":
        return False, "wrong call sequence"
    if call.get("args") != {"run_id": "R42"}:
        return False, "wrong arguments"
    if not call.get("id") or observation.get("call_id") != call["id"]:
        return False, "observation mismatched call"
    if observation.get("run_id") != "R42" or observation.get("status") != "failed":
        return False, "missing grounded observation"
    if answer.get("run_id") != observation.get("run_id"):
        return False, "answer ignored observation"
    if answer.get("status") != observation.get("status"):
        return False, "answer ignored observation"
    return True, "trajectory passed"

good = [
    {
        "type": "call",
        "id": "eval-1",
        "name": "get_eval_run",
        "args": {"run_id": "R42"},
    },
    {"type": "observation", "call_id": "eval-1", "run_id": "R42", "status": "failed"},
    {"type": "answer", "run_id": "R42", "status": "failed", "text": "Eval run R42 failed."},
]
bad_order = [
    {"type": "observation", "call_id": "eval-1", "run_id": "R42", "status": "failed"},
    {"type": "call", "id": "eval-1", "name": "get_eval_run", "args": {"run_id": "R42"}},
    {"type": "answer", "run_id": "R42", "status": "failed", "text": "Eval run R42 failed."},
]
bad_semantics = [
    {"type": "call", "id": "eval-1", "name": "get_eval_run", "args": {"run_id": "R42"}},
    {"type": "observation", "call_id": "eval-1", "run_id": "R42", "status": "failed"},
    {"type": "answer", "run_id": "R42", "status": "passed", "text": "R42 did not fail; an older check was marked failed."},
]
print(score(good))
print(score(bad_order))
print(score(bad_semantics))

Output

(True, 'trajectory passed')
(False, 'wrong call sequence')
(False, 'answer ignored observation')

Once correctness is scored, add serving constraints. A runtime that succeeds after ten retries isn't ready for a customer-facing workflow.

release-gate-tool-runtime.py

runs = [
    {"passed": True, "unsafe_writes": 0, "rounds": 2, "latency_ms": 430, "cost_cents": 2.1},
    {"passed": True, "unsafe_writes": 0, "rounds": 2, "latency_ms": 510, "cost_cents": 2.4},
    {"passed": False, "unsafe_writes": 0, "rounds": 3, "latency_ms": 680, "cost_cents": 3.8},
    {"passed": True, "unsafe_writes": 0, "rounds": 2, "latency_ms": 470, "cost_cents": 2.2},
]

success_rate = sum(run["passed"] for run in runs) / len(runs)
unsafe_writes = sum(run["unsafe_writes"] for run in runs)
max_rounds = max(run["rounds"] for run in runs)
max_latency_ms = max(run["latency_ms"] for run in runs)
max_cost_cents = max(run["cost_cents"] for run in runs)
latency_budget_ms = 600
cost_budget_cents = 3.0
ready = (
    success_rate >= 0.75
    and unsafe_writes == 0
    and max_rounds <= 3
    and max_latency_ms <= latency_budget_ms
    and max_cost_cents <= cost_budget_cents
)

print(f"success_rate: {success_rate:.0%}")
print(f"unsafe_writes: {unsafe_writes}")
print(f"max_rounds: {max_rounds}")
print(f"max_latency_ms: {max_latency_ms}")
print(f"latency_budget_ms: {latency_budget_ms}")
print(f"max_cost_cents: {max_cost_cents:.1f}")
print(f"cost_budget_cents: {cost_budget_cents:.1f}")
print(f"release_candidate: {ready}")

Output

success_rate: 75%
unsafe_writes: 0
max_rounds: 3
max_latency_ms: 680
latency_budget_ms: 600
max_cost_cents: 3.8
cost_budget_cents: 3.0
release_candidate: False

From local functions to reusable tools

What to remember

The model requests; runtime executes. Never give a text generator implicit authority over real effects.
Schemas narrow shape, not policy. Validate ownership, release gates, approvals, and idempotency before writes.
Observations close the loop. A grounded answer must follow a returned tool result.
Recovery needs budgets. Reject bad calls with structured errors, then cap retries, repeated actions, latency, and cost.
Evaluate traces. Tool choice, arguments, results, safety, and serving cost all belong in the release gate.

Mastery check

Key concepts

Tool call proposal versus application execution
JSON-like input schemas and server-side validation
Assistant request, tool observation, final-answer loop
Read versus write permissions
Confirmation and idempotency for side effects
Structured errors and bounded correction
Parallel safe reads and serialized writes
Allowed-tool routing
Trajectory-based evaluation
Function calling as prerequisite for MCP

Evaluation rubric

Foundational: Explains why a model can't supply a live eval metric without a tool observation.
Intermediate: Defines and validates a narrow get_eval_run contract.
Intermediate: Implements a complete tool request, execution, observation, and answer loop.
Advanced: Protects a promotion write with ownership, policy, confirmation, and idempotency checks.
Advanced: Builds a trajectory release gate with correctness, unsafe-write, round, and latency measures.

Common pitfalls

Treating a tool call as a completed action: The assistant claims a model was promoted before execution. Make the runtime return an observation before wording the result as complete.
Trusting schema-valid writes: Correct JSON can still target the wrong run. Apply application authorization and policy checks.
Blind retries: A rejected call repeats until budget is gone. Fingerprint rejected requests and cap tool rounds.
Parallelizing effects: Two write calls race on the same model target. Parallelize independent reads only.
Scoring only the final sentence: A correct-looking answer can be ungrounded. Evaluate call, arguments, observation, and outcome.

Practice extension

Next Step

Continue to MCP & Tool Protocol Standards

You can now implement a safe local tool loop and evaluate its trajectories; next comes the host/server boundary that standardizes tools across integrations.

PreviousCoT, ToT & Self-Consistency Prompting

Share this article

X Facebook LinkedIn Bluesky Reddit Hacker News Email

References

Function calling

OpenAI · 2026 · OpenAI API Docs

Toolformer: Language Models Can Teach Themselves to Use Tools.

Schick, T., et al. · 2023 · NeurIPS 2023

ReAct: Synergizing Reasoning and Acting in Language Models.

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. · 2023 · ICLR 2023

Gorilla: Large Language Model Connected with Massive APIs.

Patil, S. G., et al. · 2023 · arXiv preprint

Berkeley Function-Calling Leaderboard.

Patil, S. G., et al. · 2024 · UC Berkeley Gorilla repository

Tau-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Yao, S., et al. · 2024 · arXiv preprint

Discussion

Questions and insights from fellow learners.

Discussion loads when you reach this section.

Function Calling & Tool Use

A tool call is a request, not an execution

Define the smallest useful tool contract

Parse and validate before dispatch

Build the complete tool loop

Structure isn't permission

Pass validated arguments as parameters, never strings

Return errors as observations, with limits

Parallelize independent reads, not dependent writes

Expose a small allowed toolbox

Evaluate trajectory, not final text alone

From local functions to reusable tools

What to remember

Mastery check

Key concepts

Evaluation rubric

Common pitfalls

Practice extension

Mastery Check

Discussion

Function Calling & Tool Use

A tool call is a request, not an execution

Define the smallest useful tool contract

Parse and validate before dispatch

Build the complete tool loop

Structure isn't permission

Pass validated arguments as parameters, never strings

Return errors as observations, with limits

Parallelize independent reads, not dependent writes

Expose a small allowed toolbox

Evaluate trajectory, not final text alone

From local functions to reusable tools

What to remember

Mastery check

Key concepts

Evaluation rubric

Common pitfalls

Practice extension

Mastery Check

Discussion

Function Calling & Tool Use

A tool call is a request, not an execution

Which component actually touches the evaluation service after an LLM emits get_eval_run(run_id="R42")?

Define the smallest useful tool contract

Parse and validate before dispatch

Build the complete tool loop

Why does a one-tool answer normally require two model turns?

Structure isn't permission

Pass validated arguments as parameters, never strings

Return errors as observations, with limits

Why shouldn't a runtime retry every failed tool request until the model eventually succeeds?

Parallelize independent reads, not dependent writes

Expose a small allowed toolbox

Evaluate trajectory, not final text alone

From local functions to reusable tools

What to remember

Mastery check

Key concepts

Evaluation rubric

Common pitfalls

Practice extension

What is the central safety rule for a tool-using LLM application?

Mastery Check

Discussion

Function Calling & Tool Use

A tool call is a request, not an execution

Which component actually touches the evaluation service after an LLM emits get_eval_run(run_id="R42")?

Define the smallest useful tool contract

Parse and validate before dispatch

Build the complete tool loop

Why does a one-tool answer normally require two model turns?

Structure isn't permission

Pass validated arguments as parameters, never strings

Return errors as observations, with limits

Why shouldn't a runtime retry every failed tool request until the model eventually succeeds?

Parallelize independent reads, not dependent writes

Expose a small allowed toolbox

Evaluate trajectory, not final text alone

From local functions to reusable tools

What to remember

Mastery check

Key concepts

Evaluation rubric

Common pitfalls

Practice extension

What is the central safety rule for a tool-using LLM application?

Mastery Check

Discussion