Build an AI Agent with Tool Use from Scratch
Hand-roll a production-minded tool-using agent in plain Python with the OpenAI SDK — no framework — and learn exactly how the control loop, dispatch, and guardrails fit together.
What you'll build
You'll build a working AI agent in plain Python that can decide when to call your functions (weather lookup, safe arithmetic), execute them, feed the results back to the model, and iterate until it produces a final answer. No agent framework — just the OpenAI SDK and a hand-rolled control loop so you understand exactly what happens on each turn, plus guardrails (step limits, a tool allowlist, argument validation, error containment, and timeouts) you'd actually ship.
Prerequisites
- Python 3.10+ (we use modern type hints). Check with
python3 --version. - An OpenAI API key with billing enabled. Set it as an environment variable — never hardcode it.
- OpenAI Python SDK ≥ 1.40 (the v1 client API). Earlier 0.x syntax is incompatible.
- Familiarity with JSON Schema and HTTP — this is an advanced tutorial, so we won't explain what an environment variable is.
OS notes: everything here is cross-platform. On Windows use setx/PowerShell instead of export. On macOS/Linux:
python3 -m venv .venv
source .venv/bin/activate
pip install "openai>=1.40" "requests>=2.31"
export OPENAI_API_KEY="sk-..." # your key
Step 1 — Understand the loop you're about to build
Tool use (a.k.a. function calling) is not magic. The model never executes anything. The contract is:
- You send the conversation plus a list of tool schemas.
- The model replies either with a final text answer or with one or more
tool_calls(a name + JSON arguments). - You execute those calls in your own code.
- You append each result back to the conversation as a
toolmessage keyed bytool_call_id. - You call the model again. Repeat until it stops asking for tools.
The agent is just this loop with a termination condition. Everything else — guardrails, retries, logging — hangs off it.
Step 2 — Define your tools (real functions + JSON Schema)
A tool is two things: a callable, and a schema that tells the model how to call it. Keep schemas tight; additionalProperties: false and enums reduce hallucinated arguments.
Two correctness details matter here. First, bool is a subclass of int in Python, so a naive isinstance(node.value, (int, float)) would happily evaluate True + 1; we exclude bool explicitly. Second, ast.parse accepts the grammar but not the cost of an expression — 9**9**9 is trivially small text that produces a multi-gigabyte integer and can hang or OOM your process. We guard exponentiation explicitly.
# tools.py
import ast
import operator
# --- Safe arithmetic (never use eval() on model output) ---
_BIN_OPS = {
ast.Add: operator.add, ast.Sub: operator.sub,
ast.Mult: operator.mul, ast.Div: operator.truediv,
ast.Mod: operator.mod,
}
_UNARY_OPS = {ast.USub: operator.neg, ast.UAdd: operator.pos}
# Bounds to prevent CPU/memory DoS via exponentiation, e.g. 9**9**9.
_MAX_EXP = 100
_MAX_BASE = 10 ** 6
def _safe_pow(base, exp):
if abs(exp) > _MAX_EXP or abs(base) > _MAX_BASE:
raise ValueError("Exponent or base too large")
return operator.pow(base, exp)
def _eval_node(node):
if isinstance(node, ast.Expression):
return _eval_node(node.body)
# Exclude bool: it subclasses int, and 'True + 1' should be rejected.
if isinstance(node, ast.Constant) and isinstance(node.value, (int, float)) \
and not isinstance(node.value, bool):
return node.value
if isinstance(node, ast.BinOp):
left, right = _eval_node(node.left), _eval_node(node.right)
if isinstance(node.op, ast.Pow):
return _safe_pow(left, right)
if type(node.op) in _BIN_OPS:
return _BIN_OPS[type(node.op)](left, right)
if isinstance(node, ast.UnaryOp) and type(node.op) in _UNARY_OPS:
return _UNARY_OPS[type(node.op)](_eval_node(node.operand))
raise ValueError("Unsupported expression")
def calculate(expression: str) -> dict:
"""Evaluate a basic arithmetic expression safely."""
result = _eval_node(ast.parse(expression, mode="eval"))
return {"expression": expression, "result": result}
def get_weather(city: str, unit: str = "celsius") -> dict:
"""Mock weather lookup. Replace with a real API call + timeout."""
fake_db = {"berlin": 18, "tokyo": 24, "cairo": 33}
temp_c = fake_db.get(city.strip().lower())
if temp_c is None:
return {"error": f"No data for '{city}'"}
temp = temp_c if unit == "celsius" else round(temp_c * 9 / 5 + 32, 1)
return {"city": city, "temperature": temp, "unit": unit}
Note that calculate still raises on bad input (ValueError for unsupported expressions, SyntaxError from ast.parse on malformed text, ZeroDivisionError for 1/0). That's fine — those exceptions are deliberately contained by the dispatcher in Step 3, not by the tool itself. Keep tool functions focused on their job and let the dispatcher own the error boundary.
Now the schemas the model sees. The name must match a key in your dispatch table (Step 3).
# schemas.py
TOOL_SCHEMAS = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current temperature for a city.",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name, e.g. 'Berlin'"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["city"],
"additionalProperties": False,
},
},
},
{
"type": "function",
"function": {
"name": "calculate",
"description": "Evaluate a basic arithmetic expression like '3 * (4 + 2)'.",
"parameters": {
"type": "object",
"properties": {
"expression": {"type": "string"},
},
"required": ["expression"],
"additionalProperties": False,
},
},
},
]
Step 3 — Build the dispatcher with an allowlist
Never dispatch by getattr or eval on a model-provided name. Use an explicit registry — that is your security boundary. Anything not in the map is refused.
The dispatcher is also your error containment boundary. It must catch anything a tool can throw, not just argument-shape errors. Earlier drafts only caught json.JSONDecodeError and TypeError, which meant a ValueError, SyntaxError, ZeroDivisionError, or OverflowError raised inside a tool would propagate up and crash run_agent — contradicting the whole point of letting the model self-correct. We wrap the actual call in a broad except Exception and convert it into a ToolError.
# dispatch.py
import json
from tools import calculate, get_weather
TOOL_REGISTRY = {
"get_weather": get_weather,
"calculate": calculate,
}
class ToolError(Exception):
pass
def execute_tool(name: str, raw_args: str) -> dict:
fn = TOOL_REGISTRY.get(name)
if fn is None:
raise ToolError(f"Unknown tool: {name!r}")
try:
args = json.loads(raw_args or "{}")
except json.JSONDecodeError as e:
raise ToolError(f"Invalid JSON arguments: {e}")
if not isinstance(args, dict):
raise ToolError("Arguments must be a JSON object")
try:
return fn(**args)
except TypeError as e:
# Wrong/missing/extra keyword arguments for the function signature.
raise ToolError(f"Bad arguments for {name}: {e}")
except Exception as e:
# Anything the tool itself raises (ValueError, SyntaxError,
# ZeroDivisionError, OverflowError, network errors, ...).
# Contain it so the loop can recover instead of crashing.
raise ToolError(f"{name} failed: {type(e).__name__}: {e}")
Because every failure path raises ToolError, the agent loop (Step 4) can convert it to a tool message and let the model read it, apologize, retry with corrected arguments, or report the limitation — instead of taking the whole process down.
Step 4 — Write the agent loop with guardrails
This is the core. The critical correctness detail: the model can return several tool_calls in one assistant message (parallel tool calling). You must execute all of them and append one tool message per tool_call_id before the next API call — otherwise the API rejects the request for a missing tool response.
One implementation note for advanced readers: the SDK returns a ChatCompletionMessage pydantic object. Current SDK versions do accept that object back in the messages list and serialize it for you, but mixing a raw model object with hand-built dict messages relies on implicit serialization behavior. To be explicit and future-proof, we append msg.model_dump(exclude_none=True) — a plain dict — so the entire messages array is uniform and obvious.
# agent.py
import json
from openai import OpenAI
from schemas import TOOL_SCHEMAS
from dispatch import execute_tool, ToolError
client = OpenAI() # reads OPENAI_API_KEY from the environment
MODEL = "gpt-4o-mini"
MAX_STEPS = 6 # guardrail: hard cap on tool-use turns
SYSTEM_PROMPT = (
"You are a precise assistant. Use the provided tools when they help. "
"Do not invent tool results. If a tool returns an error, explain it plainly."
)
def run_agent(user_input: str) -> str:
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_input},
]
for step in range(MAX_STEPS):
response = client.chat.completions.create(
model=MODEL,
messages=messages,
tools=TOOL_SCHEMAS,
tool_choice="auto",
temperature=0,
)
msg = response.choices[0].message
# Append an explicit dict rather than the raw pydantic object.
messages.append(msg.model_dump(exclude_none=True))
# No tool calls => the model produced its final answer.
if not msg.tool_calls:
return msg.content or ""
# Execute every tool call this turn, in order.
for call in msg.tool_calls:
name = call.function.name
try:
result = execute_tool(name, call.function.arguments)
content = json.dumps(result)
except ToolError as e:
content = json.dumps({"error": str(e)})
messages.append({
"role": "tool",
"tool_call_id": call.id,
"content": content,
})
# Guardrail tripped: ran out of steps without a final answer.
return "Stopped: reached the maximum number of reasoning steps."
if __name__ == "__main__":
print(run_agent("What's the weather in Tokyo in fahrenheit, "
"and what is 19 * 4 + 7?"))
Because execute_tool now raises ToolError for every failure mode, the single except ToolError here is genuinely sufficient — a malformed expression, division by zero, or an oversized exponent all arrive as a clean {"error": ...} tool message the model can react to.
Why these guardrails matter:
| Guardrail | Failure it prevents |
|---|---|
MAX_STEPS loop cap |
Infinite tool-calling loops that burn tokens/money |
TOOL_REGISTRY allowlist |
Arbitrary function execution from a hallucinated name |
Broad except in execute_tool |
Tool exceptions crashing the whole agent |
_safe_pow bounds |
CPU/memory DoS from 9**9**9-style inputs |
temperature=0 |
Non-deterministic tool selection during testing |
| Returning errors as tool content | Lets the model self-correct instead of failing hard |
Step 5 — Add a real network tool (with a timeout)
Replace the mock get_weather with a real call and a timeout — a hung dependency must never hang your agent. Use any geocoding/weather provider; the timeout is the part that matters. (Even if you forgot a timeout and the library raised, the dispatcher's broad except would now contain it — but defense in depth means you set the timeout anyway.)
import requests
def get_weather(city: str, unit: str = "celsius") -> dict:
try:
resp = requests.get(
"https://example-weather-api.test/current",
params={"city": city, "unit": unit},
timeout=5, # hard network timeout
)
resp.raise_for_status()
return resp.json()
except requests.Timeout:
return {"error": "Weather service timed out"}
except requests.RequestException as e:
return {"error": f"Weather service failed: {e}"}
Keep returning structured {"error": ...} dicts where you can: the agent reads them and recovers gracefully, and they're cleaner than relying on the dispatcher's catch-all.
Verify it works
Run it:
python agent.py
Expected behavior (text will vary slightly):
In Tokyo it's currently 75.2°F, and 19 * 4 + 7 equals 83.
To prove the loop fired, log each call.function.name inside the tool-execution block:
print(f"[tool] {name}({call.function.arguments})")
You should see two tool lines, then the final response — confirming the model planned, called both tools, and synthesized the answer from real results rather than guessing.
To confirm error containment, try a hostile input and verify it does not crash:
print(run_agent("Compute 9**9**9 and also 1/0."))
You should get a graceful explanation from the model (driven by the {"error": ...} tool messages), not a traceback.
Troubleshooting
AuthenticationError / 401. Your key isn't in the environment for this shell. Re-run export OPENAI_API_KEY="sk-..." in the same terminal, or confirm with echo $OPENAI_API_KEY. Don't paste the key into code.
BadRequestError: ... 'tool' messages must respond to a preceding 'tool_calls'. You appended a tool message whose tool_call_id doesn't match, or you skipped a parallel call. Ensure you (1) append the full assistant message before the tool results, and (2) emit exactly one tool message per call.id in msg.tool_calls.
The model answers without calling tools when it should. Tighten the tool description fields and the system prompt, or force a tool with tool_choice={"type": "function", "function": {"name": "calculate"}}. Vague descriptions are the most common cause of missed tool use.
Agent loops until MAX_STEPS. Usually a tool keeps returning errors the model can't fix (bad schema, an always-failing dependency). Inspect the tool message contents; make error strings actionable ("city not found" beats "error 500"). Since ToolError messages now include the exception type, that detail flows straight to the model.
Next steps
- Structured final outputs: add
response_format={"type": "json_schema", ...}so the agent's last message is validated JSON, not prose. - Streaming: swap to
stream=Trueand accumulatetool_callsdeltas for responsive UIs (note: assembling streamed tool-call fragments is fiddly — handle it carefully). - The Responses API: OpenAI's newer
client.responses.createendpoint offers built-in tools and state management; the loop concept is identical. - Observability: wrap each turn with tracing (OpenTelemetry, Langfuse, or LangSmith) so you can audit which tools ran with which arguments in production.
- Concurrency & sandboxing: for I/O-heavy tools, execute parallel
tool_callswithasyncio/a thread pool, and run any code-executing tool inside a sandboxed subprocess or container, never in-process.
You now have the real mechanics of an agent — frameworks like LangGraph just formalize this same loop with persistence and branching on top.
Priya covers AI frameworks, developer productivity tooling, and the startup ecosystem across South and Southeast Asia, bringing a researcher's rigour and a practitioner's empathy to every story. She is deeply sceptical of benchmarks and asks hard questions so her readers don't have to.
Discussion 0
No comments yet
Be the first to weigh in.