LeetLLM
LearnFeaturesBlog
LeetLLM

Your go-to resource for mastering AI & LLM systems.

Product

  • Learn
  • Features
  • Blog

Legal

  • Terms of Service
  • Privacy Policy

© 2026 LeetLLM. All rights reserved.

All Topics
Your Progress
0%

0 of 155 articles completed

🛠️Computing Foundations0/6
NumPy and Tensor ShapesCUDA for ML TrainingMPS & Metal for ML on MacData Structures for AISQL and Data ModelingAlgorithms for ML Engineers
📊Math & Statistics0/8
Gradients and BackpropVectors, Matrices & TensorsLinear Algebra for MLAdam, Momentum, SchedulersProbability for Machine LearningStatistics and UncertaintyDistributions and SamplingHypothesis Tests, Intervals, and pass@k
📚Preparation & Prerequisites0/13
Neural Networks from ScratchCNNs from ScratchTraining & BackpropagationSoftmax, Cross-Entropy & OptimizationRNNs, LSTMs, GRUs, and Sequence ModelingAutoencoders and VAEsThe Transformer Architecture End-to-EndLanguage Modeling & Next TokensFrom GPT to Modern LLMsPrompt Engineering FundamentalsCalling LLM APIs in ProductionFirst AI App End-to-EndThe LLM Lifecycle
🧮ML Algorithms & Evaluation0/11
Linear Regression from ScratchLogistic Regression and MetricsDecision Trees, Forests, and BoostingReinforcement Learning BasicsValidation and LeakageClustering and PCACore Retrieval AlgorithmsDecoding AlgorithmsExperiment Design and A/B TestingPyTorch Training LoopsDataset Pipelines and Data Quality
📦Production ML Systems0/6
Feature Engineering for Production MLBatch and Streaming Feature PipelinesGradient Boosted Trees in ProductionRanking and Recommendation SystemsForecasting and Anomaly DetectionMonitoring Predictive Models
🧪Core LLM Foundations0/8
The Bitter Lesson & ComputeBPE, WordPiece, and SentencePieceStatic to Contextual EmbeddingsPerplexity & Model EvaluationFile Ingestion for AIChunking StrategiesLLM Benchmarks & LimitationsInstruction Tuning & Chat Templates
🧰Applied LLM Engineering0/23
Dimensionality Reduction for EmbeddingsCoT, ToT & Self-Consistency PromptingFunction Calling & Tool UseMCP & Tool Protocol StandardsPrompt Injection DefenseResponsible AI GovernanceData Labeling and Human FeedbackEvaluating AI AgentsProduction RAG PipelinesHybrid Search: Dense + SparseReranking and Cross-Encoders for RAGRAG Evaluation for Reliable AnswersLLM-as-a-Judge EvaluationBias & Fairness in LLMsHallucination Detection & MitigationLLM Observability & MonitoringExperiment Tracking with MLflow and W&BMixed Precision TrainingModel Versioning & DeploymentSemantic Caching & Cost OptimizationLLM Cost Engineering & Token EconomicsModel Gateways, Routing, and FallbacksDesign an Automated Support Agent
🎓Portfolio Capstones0/9
Capstone: Delivery ETA PredictionCapstone: Product RankingCapstone: Demand ForecastingCapstone: Image Damage ClassifierCapstone: Production ML PipelineCapstone: Document QACapstone: Eval DashboardCapstone: Fine-Tuned ClassifierCapstone: Production Agent
🧠Transformer Deep Dives0/8
Sentence Embeddings & Contrastive LossEmbedding Similarity & QuantizationScaled Dot-Product AttentionVision Transformers and Image EncodersPositional Encoding: RoPE & ALiBiLayer Normalization: Pre-LN vs Post-LNMechanistic InterpretabilityDecoding Strategies: Greedy to Nucleus
🧬Advanced Training & Adaptation0/16
Scaling Laws & Compute-Optimal TrainingPre-training Data at ScaleBuild GPT from Scratch LabContinued Pretraining for Domain ShiftSynthetic Data PipelinesSupervised Fine-Tuning PipelineDistributed Training: FSDP & ZeROLoRA & Parameter-Efficient TuningReward Modeling from Preference DataRLHF & DPO AlignmentConstitutional AI & Red TeamingRLVR & Verifiable RewardsKnowledge Distillation for LLMsModel Merging and Weight InterpolationPrompt Optimization with DSPyRecursive Language Models (RLM)
🤖Advanced Agents & Retrieval0/14
Vector DB Internals: HNSW & IVFAdvanced RAG: HyDE & Self-RAGGraphRAG & Knowledge GraphsRAG Security & Access ControlStructured Output GenerationReAct & Plan-and-ExecuteGuardrails & Safety FiltersCode Generation & SandboxingComputer-Use / GUI / Browser AgentsHuman-in-the-Loop Agent ArchitectureAI Coding Workflow with AgentsAgent Memory & PersistenceAgent Failure & RecoveryMulti-Agent Orchestration
⚡Inference & Production Scale0/20
Inference: TTFT, TPS & KV CacheMulti-Query & Grouped-Query AttentionKV Cache & PagedAttentionPrefix Caching and Prompt CachingFlashAttention & Memory EfficiencyContinuous Batching & SchedulingScaling LLM InferenceModel Parallelism for LLM InferenceModel Quantization: GPTQ, AWQ & GGUFLocal LLM DeploymentSLM Specialization & Edge DeploymentSpeculative DecodingLong Context Window ManagementContext EngineeringMixture of Experts ArchitectureMamba & State Space ModelsReasoning & Test-Time ComputeAdvanced MLOps & DevOps for AIGPU Serving & AutoscalingA/B Testing for LLMs
🏗️System Design Capstones0/9
Content Moderation SystemCode Completion SystemMulti-Tenant LLM PlatformLLM-Powered Search EngineVision-Language Models & CLIPMultimodal LLM ArchitectureDiffusion Models & Image GenerationReal-Time Voice AI AgentReasoning & Test-Time Compute
🎤AI Lab Interviewing0/4
AI Lab Coding Interview: Python SystemsAI Lab System Design InterviewAI Lab Behavioral InterviewAI Lab Technical Presentation
Back to Topics
LearnApplied LLM EngineeringFunction Calling & Tool Use
🤖MediumLLM Agents & Tool Use

Function Calling & Tool Use

Build a safe tool-calling runtime that validates model requests, executes controlled actions, feeds observations back, and evaluates complete workflows.

13 min read
Learning path
Step 55 of 155 in the full curriculum
CoT, ToT & Self-Consistency PromptingMCP & Tool Protocol Standards

In the previous lesson, you learned to spend reasoning effort on a decision made from supplied facts. One fact was deliberately missing: the latest carrier scan for order A10234. No amount of careful prompting can invent a trustworthy live scan. Software has to fetch it.

Function calling gives a language model a typed way to request that fetch. The model doesn't run the order API. It proposes an action such as get_order_status(order_id="A10234"); your runtime checks the request, executes an allowed tool, returns the observation, and asks the model to answer from the result.

This boundary is the start of agent engineering. Once an LLM can request reads or writes against real systems, correctness includes parsing, authorization, retries, side effects, latency, and evaluation of the whole trajectory.

A tool call is a request, not an execution

ShopFlow support agent Luna is handling ticket #48291: "My package is late. Where is it?" The assistant needs a live order lookup. A good loop has four visible events:

  1. User asks a question that depends on external state.
  2. Model emits a named action with typed arguments.
  3. Runtime validates and executes that action.
  4. Model receives the tool observation and writes a grounded reply.
Function-calling loop for a delayed parcel where the model proposes get_order_status, the runtime validates and executes it, then the model answers from the observation. Function-calling loop for a delayed parcel where the model proposes get_order_status, the runtime validates and executes it, then the model answers from the observation.
A model proposes the call; application code owns execution and the resulting side effects.

This distinction matters. If the model requests create_refund, it still hasn't refunded anyone. Your application gets a final chance to reject a wrong customer, an ineligible order, a duplicate action, or a write that needs approval.

Define the smallest useful tool contract

A model chooses tools from the definitions you provide. Each definition needs a name, a description that says when to use it, and an argument schema. Keep the schema narrow: if a status lookup only needs an order ID, don't expose refund fields or free-form .

Tool schema for get_order_status showing its purpose, input fields, required order ID, and rejection of additional arguments. Tool schema for get_order_status showing its purpose, input fields, required order ID, and rejection of additional arguments.
A compact schema narrows what the model may request and makes runtime validation straightforward.

The following logical definition is provider-neutral. Hosted APIs serialize similar information in their own request envelope.

define-order-status-tool.py
1TOOL = { 2 "name": "get_order_status", 3 "description": "Read live shipment status for a customer order. Do not use for refunds.", 4 "parameters": { 5 "type": "object", 6 "properties": { 7 "order_id": {"type": "string"}, 8 "include_tracking": {"type": "boolean"}, 9 }, 10 "required": ["order_id"], 11 "additionalProperties": False, 12 }, 13} 14 15parameters = TOOL["parameters"] 16print(f"tool_name: {TOOL['name']}") 17print(f"required: {parameters['required']}") 18print(f"accepts_extra_fields: {parameters['additionalProperties']}")
Output
1tool_name: get_order_status 2required: ['order_id'] 3accepts_extra_fields: False

A schema is a contract for shape. It helps the model construct a call and helps your runtime reject malformed input. It doesn't prove that the customer owns the order or that an action is permitted.

Parse and validate before dispatch

The model's output crosses a trust boundary. Even when an API offers constrained or strict , your runtime still owns semantic validation and permission checks. Start with the simplest read-only dispatcher: accept one known tool, allow only named fields, and verify field types.

validate-a-read-call.py
1class CallRejected(ValueError): 2 pass 3 4def validate_status_call(call: dict[str, object]) -> dict[str, object]: 5 if call.get("name") != "get_order_status": 6 raise CallRejected("unknown tool") 7 args = call.get("args") 8 if not isinstance(args, dict): 9 raise CallRejected("args must be an object") 10 allowed = {"order_id", "include_tracking"} 11 unknown = set(args) - allowed 12 if unknown: 13 raise CallRejected(f"unknown fields: {sorted(unknown)}") 14 if not isinstance(args.get("order_id"), str): 15 raise CallRejected("order_id must be a string") 16 if "include_tracking" in args and not isinstance(args["include_tracking"], bool): 17 raise CallRejected("include_tracking must be a boolean") 18 return args 19 20candidates = [ 21 {"name": "get_order_status", "args": {"order_id": "A10234", "include_tracking": True}}, 22 {"name": "get_order_status", "args": {"order_id": "A10234", "refund_now": True}}, 23] 24for call in candidates: 25 try: 26 args = validate_status_call(call) 27 print(f"accepted: {args['order_id']}") 28 except CallRejected as exc: 29 print(f"rejected: {exc}")
Output
1accepted: A10234 2rejected: unknown fields: ['refund_now']

Notice that this validator doesn't attempt to repair a bad call silently. A rejected request becomes a structured observation, so the model may correct it on a later bounded turn.

Build the complete tool loop

Function calling becomes concrete only when you run the full state transition. In a hosted-model integration, the first model response contains a tool request and the second model response consumes a tool result. To keep this lab executable without credentials, the model below is scripted while the runtime path is real.

complete-tool-loop.py
1import json 2 3ORDERS = { 4 "A10234": {"status": "delayed", "carrier": "FastShip", "eta": "Friday"}, 5} 6 7class ScriptedModel: 8 def __init__(self) -> None: 9 self.turns = 0 10 11 def respond(self, messages: list[dict[str, object]]) -> dict[str, object]: 12 self.turns += 1 13 observations = [item for item in messages if item["role"] == "tool"] 14 if not observations: 15 return { 16 "role": "assistant", 17 "tool_call": { 18 "id": "status-1", 19 "name": "get_order_status", 20 "args": {"order_id": "A10234"}, 21 }, 22 } 23 result = json.loads(str(observations[-1]["content"])) 24 return { 25 "role": "assistant", 26 "content": ( 27 f"Order A10234 is {result['status']} with {result['carrier']} " 28 f"and is now expected {result['eta']}." 29 ), 30 } 31 32def execute_status_tool(call: dict[str, object]) -> dict[str, str]: 33 if call.get("name") != "get_order_status": 34 raise ValueError("tool not allowed") 35 args = call.get("args") 36 if not isinstance(args, dict) or set(args) != {"order_id"}: 37 raise ValueError("expected only order_id") 38 order_id = args["order_id"] 39 if not isinstance(order_id, str) or order_id not in ORDERS: 40 raise ValueError("unknown order") 41 return ORDERS[order_id] 42 43model = ScriptedModel() 44messages: list[dict[str, object]] = [ 45 {"role": "user", "content": "Where is my order A10234?"} 46] 47 48first = model.respond(messages) 49call = first["tool_call"] 50messages.append(first) 51observation = execute_status_tool(call) # runtime executes, not model 52messages.append( 53 {"role": "tool", "tool_call_id": call["id"], "content": json.dumps(observation)} 54) 55final = model.respond(messages) 56 57print(f"requested_tool: {call['name']}") 58print(f"tool_status: {observation['status']}") 59print(f"answer: {final['content']}") 60print(f"model_turns: {model.turns}")
Output
1requested_tool: get_order_status 2tool_status: delayed 3answer: Order A10234 is delayed with FastShip and is now expected Friday. 4model_turns: 2

The transcript is the essential pattern: user message, assistant tool request, tool observation, assistant answer. Preserve a call ID so each observation stays linked to the request that produced it, especially when multiple reads run concurrently. Toolformer showed that models can learn where API calls help during generation, and ReAct made the reason/action/observation loop explicit for tool-using tasks.[1][2] The engineering burden remains in your runtime.

Structure isn't permission

Valid arguments can still request a harmful or unauthorized action. A status lookup is read-only; create_refund changes customer money. Writes need more checks:

  • caller owns the target order;
  • order satisfies refund policy;
  • amount is computed from trusted order data, not model text;
  • a human confirms when policy requires approval; and
  • an key prevents retrying the same write twice.
Refund request checked for schema, ownership and policy, confirmation, and idempotency before any write executes. Refund request checked for schema, ownership and policy, confirmation, and idempotency before any write executes.
A valid function call can reach a write tool only after the runtime validates who may do what, once.
guard-a-refund-write.py
1from dataclasses import dataclass 2 3@dataclass(frozen=True) 4class Session: 5 customer_id: str 6 7ORDERS = { 8 "A10234": {"customer_id": "C17", "eligible": True, "paid_cents": 4900}, 9} 10REFUNDS: dict[str, int] = {} 11 12def create_refund(session: Session, order_id: str, confirmed: bool) -> str: 13 order = ORDERS.get(order_id) 14 if order is None: 15 return "blocked: unknown order" 16 if session.customer_id != order["customer_id"]: 17 return "blocked: order ownership failed" 18 if not order["eligible"]: 19 return "blocked: refund policy failed" 20 if not confirmed: 21 return "blocked: confirmation required" 22 key = f"refund:{order_id}" 23 if key in REFUNDS: 24 return f"replayed: refund already exists for {REFUNDS[key]} cents" 25 REFUNDS[key] = order["paid_cents"] 26 return f"created: refund for {REFUNDS[key]} cents" 27 28print(create_refund(Session("C17"), "Z99999", confirmed=True)) 29print(create_refund(Session("C99"), "A10234", confirmed=True)) 30print(create_refund(Session("C17"), "A10234", confirmed=False)) 31print(create_refund(Session("C17"), "A10234", confirmed=True)) 32print(create_refund(Session("C17"), "A10234", confirmed=True))
Output
1blocked: unknown order 2blocked: order ownership failed 3blocked: confirmation required 4created: refund for 4900 cents 5replayed: refund already exists for 4900 cents

This is why schema enforcement isn't a security boundary. It can narrow a JSON shape; only application logic knows ownership, policy, approval, and whether a write already happened.

Return errors as observations, with limits

Tool calls fail in ordinary ways: the model uses an old field name, an order ID doesn't exist, or a service times out. A useful runtime returns a typed rejection rather than a stack trace. The next model turn can correct the request, but it should get only a small retry budget.

correct-a-rejected-call.py
1ORDERS = {"A10234": {"status": "delayed"}} 2 3def execute(call: dict[str, object]) -> dict[str, object]: 4 if call.get("name") != "get_order_status": 5 return {"ok": False, "error": "tool not allowed"} 6 args = call.get("args") 7 if not isinstance(args, dict): 8 return {"ok": False, "error": "args must be an object"} 9 unknown = sorted(set(args) - {"order_id"}) 10 if unknown: 11 return {"ok": False, "error": f"unknown fields: {unknown}"} 12 if "order_id" not in args: 13 return {"ok": False, "error": "required field: order_id"} 14 order_id = args["order_id"] 15 if not isinstance(order_id, str) or order_id not in ORDERS: 16 return {"ok": False, "error": "unknown order"} 17 return {"ok": True, "status": ORDERS[order_id]["status"]} 18 19model_attempts = [ 20 {"name": "get_order_status", "args": {}}, 21 {"name": "get_order_status", "args": {"order_id": "A10234"}}, 22] 23for turn, call in enumerate(model_attempts, start=1): 24 observation = execute(call) 25 print(f"turn_{turn}: {observation}") 26 if observation["ok"]: 27 break
Output
1turn_1: {'ok': False, 'error': 'required field: order_id'} 2turn_2: {'ok': True, 'status': 'delayed'}

Correction is useful only while it changes the request. If a model repeats the same rejected call, stop before it burns external capacity or triggers repeated writes.

stop-a-repeated-tool-loop.py
1import json 2 3attempts = [ 4 {"name": "get_order_status", "args": {"tracking_id": "A10234"}}, 5 {"name": "get_order_status", "args": {"tracking_id": "A10234"}}, 6 {"name": "get_order_status", "args": {"order_id": "A10234"}}, 7] 8seen: set[str] = set() 9max_turns = 3 10 11def rejection_for(call: dict[str, object]) -> str | None: 12 if call.get("name") != "get_order_status": 13 return "tool not allowed" 14 args = call.get("args") 15 if not isinstance(args, dict): 16 return "args must be an object" 17 unknown = sorted(set(args) - {"order_id"}) 18 if unknown: 19 return f"unknown fields: {unknown}" 20 if "order_id" not in args: 21 return "required field: order_id" 22 return None 23 24for turn, call in enumerate(attempts[:max_turns], start=1): 25 error = rejection_for(call) 26 if error is None: 27 print(f"turn_{turn}: accepted") 28 break 29 fingerprint = json.dumps(call, sort_keys=True) 30 if fingerprint in seen: 31 print(f"turn_{turn}: stopped repeated rejected call") 32 break 33 seen.add(fingerprint) 34 print(f"turn_{turn}: rejected: {error}")
Output
1turn_1: rejected: unknown fields: ['tracking_id'] 2turn_2: stopped repeated rejected call

Track a turn cap, repeated-call detection, timeout budget, and cost budget for each request. A model that can't recover should return a safe fallback or hand the ticket to a person.

Parallelize independent reads, not dependent writes

Alex asks, "Compare my two shipments, A10234 and B77120." Those two status reads are independent. Your runtime may execute them concurrently after validating both calls. A request to cancel one shipment based on those results cannot run at the same time: the write depends on what the reads discover.

parallel-read-only-lookups.py
1import asyncio 2 3STATUSES = { 4 "A10234": "delayed", 5 "B77120": "out_for_delivery", 6} 7 8async def get_order_status(order_id: str) -> tuple[str, str]: 9 await asyncio.sleep(0) 10 return order_id, STATUSES[order_id] 11 12async def main() -> None: 13 order_ids = ["A10234", "B77120"] 14 rows = await asyncio.gather(*(get_order_status(order_id) for order_id in order_ids)) 15 for order_id, status in sorted(rows): 16 print(f"{order_id}: {status}") 17 print("write_action: wait for validated read results") 18 19asyncio.run(main())
Output
1A10234: delayed 2B77120: out_for_delivery 3write_action: wait for validated read results

Concurrency saves elapsed time only when actions are independent. For writes on the same order, serialize execution and apply idempotency rules.

Expose a small allowed toolbox

As an agent grows, passing every internal action to the model wastes context and expands the space of possible mistakes. Filter for authorization first, then route among permitted tools relevant to the current request. Retrieval-augmented API selection appears in Gorilla, which evaluated models against large API collections.[3]

Tool routing flow where authorization filters a catalog and relevance ranking exposes only get_order_status for a shipment-status question. Tool routing flow where authorization filters a catalog and relevance ranking exposes only get_order_status for a shipment-status question.
The model should choose from a small permitted set, not discover restricted write tools by accident.
route-only-allowed-tools.py
1import re 2from dataclasses import dataclass 3 4@dataclass(frozen=True) 5class Tool: 6 name: str 7 description: str 8 9catalog = [ 10 Tool("get_order_status", "shipment status carrier tracking delayed package"), 11 Tool("create_refund", "issue refund payment return"), 12 Tool("search_policy", "find returns policy eligibility rules"), 13] 14allowed = {"get_order_status", "search_policy"} 15query = "Where is my delayed package shipment?" 16query_terms = set(re.findall(r"[a-z_]+", query.lower())) 17 18ranked = sorted( 19 ( 20 (len(query_terms & set(tool.description.split())), tool.name) 21 for tool in catalog 22 if tool.name in allowed 23 ), 24 reverse=True, 25) 26visible_tools = [name for score, name in ranked if score > 0] 27 28print(f"model_visible_tools: {visible_tools}") 29print(f"refund_tool_exposed: {'create_refund' in visible_tools}")
Output
1model_visible_tools: ['get_order_status'] 2refund_tool_exposed: False

Routing is not authorization. The allowlist is applied first. A highly relevant but forbidden write tool must stay unavailable.

Evaluate the trajectory, not just the final sentence

A pleasant answer can hide a wrong tool call, an unsafe write, or a result invented without an observation. Evaluate the events your runtime controls:

CheckPassing behavior
Tool choiceRequests get_order_status for a live shipment question
ArgumentsUses allowed keys and exact order ID
Execution safetyMakes no unauthorized write
GroundingFinal response reflects returned status
EfficiencyStays within round, latency, and cost budgets

BFCL evaluates function-selection and executable calling behavior, while Tau-Bench examines longer, policy-constrained interactions with tools and users.[4][5] Your own release gate needs the ShopFlow schemas and policy failures your users will hit.

score-a-tool-trajectory.py
1def score(events: list[dict[str, object]]) -> tuple[bool, str]: 2 calls = [event for event in events if event["type"] == "call"] 3 observations = [event for event in events if event["type"] == "observation"] 4 answers = [event for event in events if event["type"] == "answer"] 5 if len(calls) != 1 or calls[0].get("name") != "get_order_status": 6 return False, "wrong call sequence" 7 if calls[0].get("args") != {"order_id": "A10234"}: 8 return False, "wrong arguments" 9 if not calls[0].get("id") or not observations: 10 return False, "observation mismatched call" 11 if observations[0].get("call_id") != calls[0]["id"]: 12 return False, "observation mismatched call" 13 if observations[0].get("status") != "delayed": 14 return False, "missing grounded observation" 15 if not answers or "delayed" not in str(answers[0].get("text", "")).lower(): 16 return False, "answer ignored observation" 17 return True, "trajectory passed" 18 19good = [ 20 { 21 "type": "call", 22 "id": "status-1", 23 "name": "get_order_status", 24 "args": {"order_id": "A10234"}, 25 }, 26 {"type": "observation", "call_id": "status-1", "status": "delayed"}, 27 {"type": "answer", "text": "Your order is delayed."}, 28] 29bad = [ 30 {"type": "answer", "text": "Your order should arrive today."}, 31] 32print(score(good)) 33print(score(bad))
Output
1(True, 'trajectory passed') 2(False, 'wrong call sequence')

Once correctness is scored, add serving constraints. A runtime that succeeds after ten retries is not ready for a customer-facing workflow.

release-gate-tool-runtime.py
1runs = [ 2 {"passed": True, "unsafe_writes": 0, "rounds": 2, "latency_ms": 430, "cost_cents": 2.1}, 3 {"passed": True, "unsafe_writes": 0, "rounds": 2, "latency_ms": 510, "cost_cents": 2.4}, 4 {"passed": False, "unsafe_writes": 0, "rounds": 3, "latency_ms": 680, "cost_cents": 3.8}, 5 {"passed": True, "unsafe_writes": 0, "rounds": 2, "latency_ms": 470, "cost_cents": 2.2}, 6] 7 8success_rate = sum(run["passed"] for run in runs) / len(runs) 9unsafe_writes = sum(run["unsafe_writes"] for run in runs) 10max_rounds = max(run["rounds"] for run in runs) 11max_latency_ms = max(run["latency_ms"] for run in runs) 12max_cost_cents = max(run["cost_cents"] for run in runs) 13latency_budget_ms = 600 14cost_budget_cents = 3.0 15ready = ( 16 success_rate >= 0.75 17 and unsafe_writes == 0 18 and max_rounds <= 3 19 and max_latency_ms <= latency_budget_ms 20 and max_cost_cents <= cost_budget_cents 21) 22 23print(f"success_rate: {success_rate:.0%}") 24print(f"unsafe_writes: {unsafe_writes}") 25print(f"max_rounds: {max_rounds}") 26print(f"max_latency_ms: {max_latency_ms}") 27print(f"latency_budget_ms: {latency_budget_ms}") 28print(f"max_cost_cents: {max_cost_cents:.1f}") 29print(f"cost_budget_cents: {cost_budget_cents:.1f}") 30print(f"release_candidate: {ready}")
Output
1success_rate: 75% 2unsafe_writes: 0 3max_rounds: 3 4max_latency_ms: 680 5latency_budget_ms: 600 6max_cost_cents: 3.8 7cost_budget_cents: 3.0 8release_candidate: False

The failed release is deliberate: one run exceeds latency and cost budgets even though aggregate success reaches the threshold. These fixtures test controller behavior, not model quality. Replace them with held-out tickets, actual model calls, sandboxed tool results, and labeled policy outcomes before shipping.

From local functions to reusable tools

This lesson used in-process Python functions because the execution boundary is easiest to understand there. Real agent products need many capabilities owned by different teams and consumed by more than one host. Rebuilding a custom adapter for every host and service does not scale.

The next lesson introduces the Model Context Protocol (MCP). The mental model stays the same: model proposes a typed action and a trusted runtime decides whether to execute it. MCP standardizes how hosts discover and invoke externally provided capabilities.

What to remember

  • The model requests; runtime executes. Never give a text generator implicit authority over real effects.
  • Schemas narrow shape, not policy. Validate ownership, eligibility, approvals, and idempotency before writes.
  • Observations close the loop. A grounded answer must follow a returned tool result.
  • Recovery needs budgets. Reject bad calls with structured errors, then cap retries, repeated actions, latency, and cost.
  • Evaluate traces. Tool choice, arguments, results, safety, and serving cost all belong in the release gate.

Mastery check

Key concepts

  • Tool call proposal versus application execution
  • JSON-like input schemas and server-side validation
  • Assistant request, tool observation, final-answer loop
  • Read versus write permissions
  • Confirmation and idempotency for side effects
  • Structured errors and bounded correction
  • Parallel safe reads and serialized writes
  • Allowed-tool routing
  • Trajectory-based evaluation
  • Function calling as prerequisite for MCP

Evaluation rubric

  • Foundational: Explains why a model cannot supply a live carrier scan without a tool observation.
  • Intermediate: Defines and validates a narrow get_order_status contract.
  • Intermediate: Implements a complete tool request, execution, observation, and answer loop.
  • Advanced: Protects a refund write with ownership, policy, confirmation, and idempotency checks.
  • Advanced: Builds a trajectory release gate with correctness, unsafe-write, round, and latency measures.

Common pitfalls

  • Treating a tool call as a completed action: The assistant claims a refund happened before execution. Make the runtime return an observation before wording the result as complete.
  • Trusting schema-valid writes: Correct JSON can still target the wrong order. Apply application authorization and policy checks.
  • Blind retries: A rejected call repeats until budget is gone. Fingerprint rejected requests and cap tool rounds.
  • Parallelizing effects: Two write calls race on the same order. Parallelize independent reads only.
  • Scoring only the final sentence: A correct-looking answer can be ungrounded. Evaluate call, arguments, observation, and outcome.

Practice extension

Extend complete-tool-loop.py with a second read-only tool, get_return_policy(order_id), and a protected write tool, create_return_label(order_id). Build five labeled ticket fixtures: two straightforward status questions, one unknown order, one eligible return, and one ineligible return. Your artifact is a trajectory report containing final outcome, blocked writes, retries, tool rounds, and maximum latency for every fixture.

Next Step
Continue to MCP & Tool Protocol Standards

You can now implement a safe local tool loop and evaluate its trajectories; next you will learn how hosts and external capability servers standardize that same boundary across integrations.

PreviousCoT, ToT & Self-Consistency Prompting
Share this article
XFacebookLinkedInBlueskyRedditHacker NewsEmail
References

Toolformer: Language Models Can Teach Themselves to Use Tools.

Schick, T., et al. · 2023 · NeurIPS 2023

ReAct: Synergizing Reasoning and Acting in Language Models.

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. · 2023 · ICLR 2023

Gorilla: Large Language Model Connected with Massive APIs.

Patil, S. G., et al. · 2023 · arXiv preprint

Berkeley Function-Calling Leaderboard.

Patil, S. G., et al. · 2024 · UC Berkeley Gorilla repository

Tau-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Yao, S., et al. · 2024 · arXiv preprint