Build a safe tool-calling runtime that validates model requests, executes controlled actions, feeds observations back, and evaluates complete workflows.
In the previous lesson, you learned to spend reasoning effort on a decision made from supplied facts. One fact was deliberately missing: the latest carrier scan for order A10234. No amount of careful prompting can invent a trustworthy live scan. Software has to fetch it.
Function calling gives a language model a typed way to request that fetch. The model doesn't run the order API. It proposes an action such as get_order_status(order_id="A10234"); your runtime checks the request, executes an allowed tool, returns the observation, and asks the model to answer from the result.
This boundary is the start of agent engineering. Once an LLM can request reads or writes against real systems, correctness includes parsing, authorization, retries, side effects, latency, and evaluation of the whole trajectory.
ShopFlow support agent Luna is handling ticket #48291: "My package is late. Where is it?" The assistant needs a live order lookup. A good loop has four visible events:
This distinction matters. If the model requests create_refund, it still hasn't refunded anyone. Your application gets a final chance to reject a wrong customer, an ineligible order, a duplicate action, or a write that needs approval.
A model chooses tools from the definitions you provide. Each definition needs a name, a description that says when to use it, and an argument schema. Keep the schema narrow: if a status lookup only needs an order ID, don't expose refund fields or free-form .
The following logical definition is provider-neutral. Hosted APIs serialize similar information in their own request envelope.
1TOOL = {
2 "name": "get_order_status",
3 "description": "Read live shipment status for a customer order. Do not use for refunds.",
4 "parameters": {
5 "type": "object",
6 "properties": {
7 "order_id": {"type": "string"},
8 "include_tracking": {"type": "boolean"},
9 },
10 "required": ["order_id"],
11 "additionalProperties": False,
12 },
13}
14
15parameters = TOOL["parameters"]
16print(f"tool_name: {TOOL['name']}")
17print(f"required: {parameters['required']}")
18print(f"accepts_extra_fields: {parameters['additionalProperties']}")1tool_name: get_order_status
2required: ['order_id']
3accepts_extra_fields: FalseA schema is a contract for shape. It helps the model construct a call and helps your runtime reject malformed input. It doesn't prove that the customer owns the order or that an action is permitted.
The model's output crosses a trust boundary. Even when an API offers constrained or strict , your runtime still owns semantic validation and permission checks. Start with the simplest read-only dispatcher: accept one known tool, allow only named fields, and verify field types.
1class CallRejected(ValueError):
2 pass
3
4def validate_status_call(call: dict[str, object]) -> dict[str, object]:
5 if call.get("name") != "get_order_status":
6 raise CallRejected("unknown tool")
7 args = call.get("args")
8 if not isinstance(args, dict):
9 raise CallRejected("args must be an object")
10 allowed = {"order_id", "include_tracking"}
11 unknown = set(args) - allowed
12 if unknown:
13 raise CallRejected(f"unknown fields: {sorted(unknown)}")
14 if not isinstance(args.get("order_id"), str):
15 raise CallRejected("order_id must be a string")
16 if "include_tracking" in args and not isinstance(args["include_tracking"], bool):
17 raise CallRejected("include_tracking must be a boolean")
18 return args
19
20candidates = [
21 {"name": "get_order_status", "args": {"order_id": "A10234", "include_tracking": True}},
22 {"name": "get_order_status", "args": {"order_id": "A10234", "refund_now": True}},
23]
24for call in candidates:
25 try:
26 args = validate_status_call(call)
27 print(f"accepted: {args['order_id']}")
28 except CallRejected as exc:
29 print(f"rejected: {exc}")1accepted: A10234
2rejected: unknown fields: ['refund_now']Notice that this validator doesn't attempt to repair a bad call silently. A rejected request becomes a structured observation, so the model may correct it on a later bounded turn.
Function calling becomes concrete only when you run the full state transition. In a hosted-model integration, the first model response contains a tool request and the second model response consumes a tool result. To keep this lab executable without credentials, the model below is scripted while the runtime path is real.
1import json
2
3ORDERS = {
4 "A10234": {"status": "delayed", "carrier": "FastShip", "eta": "Friday"},
5}
6
7class ScriptedModel:
8 def __init__(self) -> None:
9 self.turns = 0
10
11 def respond(self, messages: list[dict[str, object]]) -> dict[str, object]:
12 self.turns += 1
13 observations = [item for item in messages if item["role"] == "tool"]
14 if not observations:
15 return {
16 "role": "assistant",
17 "tool_call": {
18 "id": "status-1",
19 "name": "get_order_status",
20 "args": {"order_id": "A10234"},
21 },
22 }
23 result = json.loads(str(observations[-1]["content"]))
24 return {
25 "role": "assistant",
26 "content": (
27 f"Order A10234 is {result['status']} with {result['carrier']} "
28 f"and is now expected {result['eta']}."
29 ),
30 }
31
32def execute_status_tool(call: dict[str, object]) -> dict[str, str]:
33 if call.get("name") != "get_order_status":
34 raise ValueError("tool not allowed")
35 args = call.get("args")
36 if not isinstance(args, dict) or set(args) != {"order_id"}:
37 raise ValueError("expected only order_id")
38 order_id = args["order_id"]
39 if not isinstance(order_id, str) or order_id not in ORDERS:
40 raise ValueError("unknown order")
41 return ORDERS[order_id]
42
43model = ScriptedModel()
44messages: list[dict[str, object]] = [
45 {"role": "user", "content": "Where is my order A10234?"}
46]
47
48first = model.respond(messages)
49call = first["tool_call"]
50messages.append(first)
51observation = execute_status_tool(call) # runtime executes, not model
52messages.append(
53 {"role": "tool", "tool_call_id": call["id"], "content": json.dumps(observation)}
54)
55final = model.respond(messages)
56
57print(f"requested_tool: {call['name']}")
58print(f"tool_status: {observation['status']}")
59print(f"answer: {final['content']}")
60print(f"model_turns: {model.turns}")1requested_tool: get_order_status
2tool_status: delayed
3answer: Order A10234 is delayed with FastShip and is now expected Friday.
4model_turns: 2The transcript is the essential pattern: user message, assistant tool request, tool observation, assistant answer. Preserve a call ID so each observation stays linked to the request that produced it, especially when multiple reads run concurrently. Toolformer showed that models can learn where API calls help during generation, and ReAct made the reason/action/observation loop explicit for tool-using tasks.[1][2] The engineering burden remains in your runtime.
Valid arguments can still request a harmful or unauthorized action. A status lookup is read-only; create_refund changes customer money. Writes need more checks:
1from dataclasses import dataclass
2
3@dataclass(frozen=True)
4class Session:
5 customer_id: str
6
7ORDERS = {
8 "A10234": {"customer_id": "C17", "eligible": True, "paid_cents": 4900},
9}
10REFUNDS: dict[str, int] = {}
11
12def create_refund(session: Session, order_id: str, confirmed: bool) -> str:
13 order = ORDERS.get(order_id)
14 if order is None:
15 return "blocked: unknown order"
16 if session.customer_id != order["customer_id"]:
17 return "blocked: order ownership failed"
18 if not order["eligible"]:
19 return "blocked: refund policy failed"
20 if not confirmed:
21 return "blocked: confirmation required"
22 key = f"refund:{order_id}"
23 if key in REFUNDS:
24 return f"replayed: refund already exists for {REFUNDS[key]} cents"
25 REFUNDS[key] = order["paid_cents"]
26 return f"created: refund for {REFUNDS[key]} cents"
27
28print(create_refund(Session("C17"), "Z99999", confirmed=True))
29print(create_refund(Session("C99"), "A10234", confirmed=True))
30print(create_refund(Session("C17"), "A10234", confirmed=False))
31print(create_refund(Session("C17"), "A10234", confirmed=True))
32print(create_refund(Session("C17"), "A10234", confirmed=True))1blocked: unknown order
2blocked: order ownership failed
3blocked: confirmation required
4created: refund for 4900 cents
5replayed: refund already exists for 4900 centsThis is why schema enforcement isn't a security boundary. It can narrow a JSON shape; only application logic knows ownership, policy, approval, and whether a write already happened.
Tool calls fail in ordinary ways: the model uses an old field name, an order ID doesn't exist, or a service times out. A useful runtime returns a typed rejection rather than a stack trace. The next model turn can correct the request, but it should get only a small retry budget.
1ORDERS = {"A10234": {"status": "delayed"}}
2
3def execute(call: dict[str, object]) -> dict[str, object]:
4 if call.get("name") != "get_order_status":
5 return {"ok": False, "error": "tool not allowed"}
6 args = call.get("args")
7 if not isinstance(args, dict):
8 return {"ok": False, "error": "args must be an object"}
9 unknown = sorted(set(args) - {"order_id"})
10 if unknown:
11 return {"ok": False, "error": f"unknown fields: {unknown}"}
12 if "order_id" not in args:
13 return {"ok": False, "error": "required field: order_id"}
14 order_id = args["order_id"]
15 if not isinstance(order_id, str) or order_id not in ORDERS:
16 return {"ok": False, "error": "unknown order"}
17 return {"ok": True, "status": ORDERS[order_id]["status"]}
18
19model_attempts = [
20 {"name": "get_order_status", "args": {}},
21 {"name": "get_order_status", "args": {"order_id": "A10234"}},
22]
23for turn, call in enumerate(model_attempts, start=1):
24 observation = execute(call)
25 print(f"turn_{turn}: {observation}")
26 if observation["ok"]:
27 break1turn_1: {'ok': False, 'error': 'required field: order_id'}
2turn_2: {'ok': True, 'status': 'delayed'}Correction is useful only while it changes the request. If a model repeats the same rejected call, stop before it burns external capacity or triggers repeated writes.
1import json
2
3attempts = [
4 {"name": "get_order_status", "args": {"tracking_id": "A10234"}},
5 {"name": "get_order_status", "args": {"tracking_id": "A10234"}},
6 {"name": "get_order_status", "args": {"order_id": "A10234"}},
7]
8seen: set[str] = set()
9max_turns = 3
10
11def rejection_for(call: dict[str, object]) -> str | None:
12 if call.get("name") != "get_order_status":
13 return "tool not allowed"
14 args = call.get("args")
15 if not isinstance(args, dict):
16 return "args must be an object"
17 unknown = sorted(set(args) - {"order_id"})
18 if unknown:
19 return f"unknown fields: {unknown}"
20 if "order_id" not in args:
21 return "required field: order_id"
22 return None
23
24for turn, call in enumerate(attempts[:max_turns], start=1):
25 error = rejection_for(call)
26 if error is None:
27 print(f"turn_{turn}: accepted")
28 break
29 fingerprint = json.dumps(call, sort_keys=True)
30 if fingerprint in seen:
31 print(f"turn_{turn}: stopped repeated rejected call")
32 break
33 seen.add(fingerprint)
34 print(f"turn_{turn}: rejected: {error}")1turn_1: rejected: unknown fields: ['tracking_id']
2turn_2: stopped repeated rejected callTrack a turn cap, repeated-call detection, timeout budget, and cost budget for each request. A model that can't recover should return a safe fallback or hand the ticket to a person.
Alex asks, "Compare my two shipments, A10234 and B77120." Those two status reads are independent. Your runtime may execute them concurrently after validating both calls. A request to cancel one shipment based on those results cannot run at the same time: the write depends on what the reads discover.
1import asyncio
2
3STATUSES = {
4 "A10234": "delayed",
5 "B77120": "out_for_delivery",
6}
7
8async def get_order_status(order_id: str) -> tuple[str, str]:
9 await asyncio.sleep(0)
10 return order_id, STATUSES[order_id]
11
12async def main() -> None:
13 order_ids = ["A10234", "B77120"]
14 rows = await asyncio.gather(*(get_order_status(order_id) for order_id in order_ids))
15 for order_id, status in sorted(rows):
16 print(f"{order_id}: {status}")
17 print("write_action: wait for validated read results")
18
19asyncio.run(main())1A10234: delayed
2B77120: out_for_delivery
3write_action: wait for validated read resultsConcurrency saves elapsed time only when actions are independent. For writes on the same order, serialize execution and apply idempotency rules.
As an agent grows, passing every internal action to the model wastes context and expands the space of possible mistakes. Filter for authorization first, then route among permitted tools relevant to the current request. Retrieval-augmented API selection appears in Gorilla, which evaluated models against large API collections.[3]
1import re
2from dataclasses import dataclass
3
4@dataclass(frozen=True)
5class Tool:
6 name: str
7 description: str
8
9catalog = [
10 Tool("get_order_status", "shipment status carrier tracking delayed package"),
11 Tool("create_refund", "issue refund payment return"),
12 Tool("search_policy", "find returns policy eligibility rules"),
13]
14allowed = {"get_order_status", "search_policy"}
15query = "Where is my delayed package shipment?"
16query_terms = set(re.findall(r"[a-z_]+", query.lower()))
17
18ranked = sorted(
19 (
20 (len(query_terms & set(tool.description.split())), tool.name)
21 for tool in catalog
22 if tool.name in allowed
23 ),
24 reverse=True,
25)
26visible_tools = [name for score, name in ranked if score > 0]
27
28print(f"model_visible_tools: {visible_tools}")
29print(f"refund_tool_exposed: {'create_refund' in visible_tools}")1model_visible_tools: ['get_order_status']
2refund_tool_exposed: FalseRouting is not authorization. The allowlist is applied first. A highly relevant but forbidden write tool must stay unavailable.
A pleasant answer can hide a wrong tool call, an unsafe write, or a result invented without an observation. Evaluate the events your runtime controls:
| Check | Passing behavior |
|---|---|
| Tool choice | Requests get_order_status for a live shipment question |
| Arguments | Uses allowed keys and exact order ID |
| Execution safety | Makes no unauthorized write |
| Grounding | Final response reflects returned status |
| Efficiency | Stays within round, latency, and cost budgets |
BFCL evaluates function-selection and executable calling behavior, while Tau-Bench examines longer, policy-constrained interactions with tools and users.[4][5] Your own release gate needs the ShopFlow schemas and policy failures your users will hit.
1def score(events: list[dict[str, object]]) -> tuple[bool, str]:
2 calls = [event for event in events if event["type"] == "call"]
3 observations = [event for event in events if event["type"] == "observation"]
4 answers = [event for event in events if event["type"] == "answer"]
5 if len(calls) != 1 or calls[0].get("name") != "get_order_status":
6 return False, "wrong call sequence"
7 if calls[0].get("args") != {"order_id": "A10234"}:
8 return False, "wrong arguments"
9 if not calls[0].get("id") or not observations:
10 return False, "observation mismatched call"
11 if observations[0].get("call_id") != calls[0]["id"]:
12 return False, "observation mismatched call"
13 if observations[0].get("status") != "delayed":
14 return False, "missing grounded observation"
15 if not answers or "delayed" not in str(answers[0].get("text", "")).lower():
16 return False, "answer ignored observation"
17 return True, "trajectory passed"
18
19good = [
20 {
21 "type": "call",
22 "id": "status-1",
23 "name": "get_order_status",
24 "args": {"order_id": "A10234"},
25 },
26 {"type": "observation", "call_id": "status-1", "status": "delayed"},
27 {"type": "answer", "text": "Your order is delayed."},
28]
29bad = [
30 {"type": "answer", "text": "Your order should arrive today."},
31]
32print(score(good))
33print(score(bad))1(True, 'trajectory passed')
2(False, 'wrong call sequence')Once correctness is scored, add serving constraints. A runtime that succeeds after ten retries is not ready for a customer-facing workflow.
1runs = [
2 {"passed": True, "unsafe_writes": 0, "rounds": 2, "latency_ms": 430, "cost_cents": 2.1},
3 {"passed": True, "unsafe_writes": 0, "rounds": 2, "latency_ms": 510, "cost_cents": 2.4},
4 {"passed": False, "unsafe_writes": 0, "rounds": 3, "latency_ms": 680, "cost_cents": 3.8},
5 {"passed": True, "unsafe_writes": 0, "rounds": 2, "latency_ms": 470, "cost_cents": 2.2},
6]
7
8success_rate = sum(run["passed"] for run in runs) / len(runs)
9unsafe_writes = sum(run["unsafe_writes"] for run in runs)
10max_rounds = max(run["rounds"] for run in runs)
11max_latency_ms = max(run["latency_ms"] for run in runs)
12max_cost_cents = max(run["cost_cents"] for run in runs)
13latency_budget_ms = 600
14cost_budget_cents = 3.0
15ready = (
16 success_rate >= 0.75
17 and unsafe_writes == 0
18 and max_rounds <= 3
19 and max_latency_ms <= latency_budget_ms
20 and max_cost_cents <= cost_budget_cents
21)
22
23print(f"success_rate: {success_rate:.0%}")
24print(f"unsafe_writes: {unsafe_writes}")
25print(f"max_rounds: {max_rounds}")
26print(f"max_latency_ms: {max_latency_ms}")
27print(f"latency_budget_ms: {latency_budget_ms}")
28print(f"max_cost_cents: {max_cost_cents:.1f}")
29print(f"cost_budget_cents: {cost_budget_cents:.1f}")
30print(f"release_candidate: {ready}")1success_rate: 75%
2unsafe_writes: 0
3max_rounds: 3
4max_latency_ms: 680
5latency_budget_ms: 600
6max_cost_cents: 3.8
7cost_budget_cents: 3.0
8release_candidate: FalseThe failed release is deliberate: one run exceeds latency and cost budgets even though aggregate success reaches the threshold. These fixtures test controller behavior, not model quality. Replace them with held-out tickets, actual model calls, sandboxed tool results, and labeled policy outcomes before shipping.
This lesson used in-process Python functions because the execution boundary is easiest to understand there. Real agent products need many capabilities owned by different teams and consumed by more than one host. Rebuilding a custom adapter for every host and service does not scale.
The next lesson introduces the Model Context Protocol (MCP). The mental model stays the same: model proposes a typed action and a trusted runtime decides whether to execute it. MCP standardizes how hosts discover and invoke externally provided capabilities.
get_order_status contract.Extend complete-tool-loop.py with a second read-only tool, get_return_policy(order_id), and a protected write tool, create_return_label(order_id). Build five labeled ticket fixtures: two straightforward status questions, one unknown order, one eligible return, and one ineligible return. Your artifact is a trajectory report containing final outcome, blocked writes, retries, tool rounds, and maximum latency for every fixture.
Toolformer: Language Models Can Teach Themselves to Use Tools.
Schick, T., et al. · 2023 · NeurIPS 2023
ReAct: Synergizing Reasoning and Acting in Language Models.
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. · 2023 · ICLR 2023
Gorilla: Large Language Model Connected with Massive APIs.
Patil, S. G., et al. · 2023 · arXiv preprint
Berkeley Function-Calling Leaderboard.
Patil, S. G., et al. · 2024 · UC Berkeley Gorilla repository
Tau-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
Yao, S., et al. · 2024 · arXiv preprint