Chapter 23. Tool Use & Agents, Models That Act
Everything you have learned so far describes models that take input and produce output: text in, text out; image in, text out; audio in, audio out. But a model that can only generate text is fundamentally limited. It cannot check the weather, query a database, send an email, run code, or look up today’s stock price. It can only produce words. Tool use is the capability that breaks this barrier: it allows a language model to call external functions, receive the results, and incorporate those results into its response. Agents take this further, chaining multiple tool calls together in a loop to accomplish complex, multi-step tasks autonomously. This chapter explains exactly how both work, from the low-level API mechanics to the protocols and frameworks that are making 2026 the year of agentic AI.
Why Models Need Tools
A language model’s knowledge is frozen at training time. If you ask GPT-5.4 “What is the current price of Apple stock?”, it cannot answer accurately because its training data has a cutoff date. It does not have access to the internet, your company’s database, your calendar, or any live system. It can only generate text based on patterns learned during training.
This is not a minor limitation. Most useful tasks in the real world require access to live data or the ability to take actions:
- Looking up information: Current weather, stock prices, flight status, sports scores
- Querying databases: Customer records, inventory levels, sales figures
- Performing calculations: Complex math, statistical analysis, financial modeling
- Running code: Executing Python scripts, running tests, deploying applications
- Taking actions: Sending emails, creating calendar events, filing support tickets, making purchases
- Searching the web: Finding recent news, research papers, product reviews
Without tool use, a language model can only guess at answers to these questions based on its training data. With tool use, it can call a function that retrieves the actual answer, then incorporate that answer into a natural language response.
The concept is simple: instead of the model generating a text response directly, it generates a structured request to call a specific function with specific arguments. Your application executes that function, returns the result to the model, and the model then generates its final response using the real data.
Function Calling: The Foundation
Function calling (also called “tool calling”) is the mechanism that allows a language model to request the execution of external functions. OpenAI introduced this capability on June 13, 2023, with the release of the gpt-4-0613 and gpt-3.5-turbo-0613 models. It was the first widely available implementation of structured tool use in a commercial LLM API.
How Function Calling Works
The process has four steps:
You define the available tools. When making an API call, you include a list of function definitions. Each definition specifies the function’s name, a description of what it does, and a JSON Schema describing its parameters.
The model decides whether to call a tool. Based on the user’s message and the available tool definitions, the model either generates a normal text response or generates a structured tool call request. The model does not execute the function itself; it outputs a JSON object specifying which function to call and what arguments to pass.
Your application executes the function. You receive the model’s tool call request, execute the actual function in your code (querying an API, running a database query, performing a calculation), and collect the result.
You send the result back to the model. You add the tool result to the conversation and make another API call. The model now has the real data and generates its final response.
import json
def function_calling_demo():
"""
Demonstrate the four-step function calling flow.
This shows the exact message structure used in the OpenAI API.
"""
# Step 1: Define available tools
tools = [
{
"type": "function",
"function": {
"name": "get_stock_price",
"description": "Get the current stock price for a given ticker symbol",
"parameters": {
"type": "object",
"properties": {
"ticker": {
"type": "string",
"description": "Stock ticker symbol (e.g., AAPL, GOOGL)"
}
},
"required": ["ticker"]
}
}
}
]
# Step 2: The model generates a tool call (not a text response)
model_response = {
"role": "assistant",
"content": None,
"tool_calls": [
{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_stock_price",
"arguments": '{"ticker": "AAPL"}'
}
}
]
}
# Step 3: Your application executes the function
args = json.loads(model_response["tool_calls"][0]["function"]["arguments"])
# In production, this would call a real stock API
result = {"ticker": "AAPL", "price": 237.42, "currency": "USD"}
# Step 4: Send the result back to the model
tool_result_message = {
"role": "tool",
"tool_call_id": "call_abc123",
"content": json.dumps(result)
}
print("Function Calling Flow")
print("=" * 60)
print("\n Step 1: Define tools")
print(f" Tool: {tools[0]['function']['name']}")
print(f" Params: {list(tools[0]['function']['parameters']['properties'].keys())}")
print("\n Step 2: Model generates tool call")
print(f" Function: {model_response['tool_calls'][0]['function']['name']}")
print(f" Arguments: {model_response['tool_calls'][0]['function']['arguments']}")
print("\n Step 3: Application executes function")
print(f" Result: {result}")
print("\n Step 4: Send result back, model generates final response")
print(f" 'Apple (AAPL) is currently trading at $237.42 USD.'")
function_calling_demo()The critical insight is that the model never executes the function. It only generates a structured request. Your application is always in control of what actually runs. This is a deliberate safety design: the model can suggest actions, but a human-controlled system decides whether to execute them.
Source: OpenAI introduced function calling on June 13, 2023, with gpt-4-0613 and gpt-3.5-turbo-0613 models. “Developers can now describe functions to gpt-4-0613 and gpt-3.5-turbo-0613, and have the model intelligently choose to output a JSON object containing arguments to call those functions” (confirmed from openai.com/blog/function-calling-and-other-api-updates, business-standard.com, voicebot.ai).
The Evolution of Function Calling
Function calling has evolved significantly since its June 2023 debut:
June 13, 2023: Initial launch. OpenAI released function calling with the functions parameter in the Chat Completions API. Models could call one function at a time. The parameter was named functions and the response field was function_call.
November 6, 2023: Parallel function calling. At OpenAI DevDay, the functions parameter was replaced by tools (with functions deprecated). The new tools parameter supported parallel function calling: the model could request multiple function calls in a single response. For example, if a user asks “What is the weather in New York and London?”, the model can call get_weather("New York") and get_weather("London") simultaneously, rather than making two sequential round trips.
August 6, 2024: Structured Outputs. OpenAI introduced Structured Outputs, which guarantees that the model’s function call arguments exactly match the JSON Schema you provide. Before this, the model would usually produce valid JSON, but occasionally hallucinate extra fields or use wrong types. With strict: true in the tool definition, the output is guaranteed to conform to the schema with 100% reliability.
March 11, 2025: Responses API. OpenAI launched the Responses API, which combines the simplicity of Chat Completions with the tool-use capabilities of the Assistants API. The Responses API supports built-in tools (web search, file search, computer use) alongside custom function tools, all in a single unified interface. OpenAI announced that the Assistants API (launched at DevDay 2023) would be deprecated, with a sunset date of August 26, 2026.
def function_calling_evolution():
"""
Timeline of function calling capabilities in the OpenAI API.
"""
print("Function Calling Evolution")
print("=" * 70)
milestones = [
("Jun 13, 2023", "Initial launch",
"functions parameter, one call at a time"),
("Nov 6, 2023", "Parallel function calling",
"tools parameter replaces functions, multiple calls per turn"),
("Aug 6, 2024", "Structured Outputs",
"strict: true guarantees schema conformance"),
("Mar 11, 2025", "Responses API",
"Unified API with built-in + custom tools"),
("Aug 26, 2026", "Assistants API sunset",
"Responses API becomes the standard"),
]
for date, event, detail in milestones:
print(f" {date:<16} {event}")
print(f" {'':16} {detail}")
print()
function_calling_evolution()Source: Parallel function calling introduced at OpenAI DevDay November 6, 2023, with the tools parameter replacing functions (confirmed from github.com/gaborcselle, community.openai.com). Structured Outputs launched August 6, 2024, guaranteeing 100% schema conformance (confirmed from openai.com/index/introducing-structured-outputs-in-the-api, cookbook.openai.com). Responses API launched March 11, 2025, combining Chat Completions simplicity with Assistants API tool-use capabilities (confirmed from openai.com/index/new-tools-for-building-agents, community.openai.com, datacamp.com). Assistants API sunset date August 26, 2026 (confirmed from community.openai.com/t/assistants-api-beta-deprecation-august-26-2026-sunset, syntackle.com, zoho.com).
How Other Providers Implement Tool Use
OpenAI was first, but every major provider now supports function calling with similar mechanics:
Anthropic (Claude). Claude’s tool use entered public beta in May 2024 (with the anthropic-beta: tools-2024-05-16 header, later graduating to general availability in June 2024). Claude uses a tool_use content block in its response, containing the tool name, a unique ID, and the input arguments. The developer defines tools with a name, description, and input_schema (JSON Schema). Claude also supports forcing tool use (tool_choice: {"type": "tool", "name": "..."}) and parallel tool calls.
Google (Gemini). Gemini supports function calling through its API with a similar structure: you define function declarations, the model returns function call objects, you execute them and return the results. Gemini also supports automatic function execution in some SDK configurations, where the SDK handles the tool call loop automatically.
Open-source models. Models like LLaMA 4, Qwen 3.5, and Mistral Small 4 support function calling through their chat templates. The model generates tool calls in a structured format (typically JSON within special tokens), and the serving framework (vLLM, TGI, or similar) parses and routes them.
The key point is that function calling is now a universal capability across all frontier models. The exact API format differs between providers, but the underlying pattern is identical: define tools, model requests a call, you execute it, you return the result.
def provider_tool_use_comparison():
"""
Compare tool use API formats across major providers.
"""
print("Tool Use API Comparison (March 2026)")
print("=" * 70)
providers = [
("OpenAI", "tools parameter", "tool_calls array",
"tool message", "Jun 2023"),
("Anthropic", "tools parameter", "tool_use content block",
"tool_result content block", "May 2024"),
("Google", "function_declarations", "function_call part",
"function_response part", "Dec 2023"),
("Open-source", "chat template", "JSON in special tokens",
"tool response token", "Varies"),
]
print(f" {'Provider':<14} {'Define Tools':<22} {'Model Output':<24} "
f"{'Return Result':<24} {'Since':<10}")
print(" " + "-" * 92)
for prov, define, output, result, since in providers:
print(f" {prov:<14} {define:<22} {output:<24} "
f"{result:<24} {since:<10}")
print(f"\n All providers follow the same four-step pattern:")
print(f" Define tools -> Model requests call -> Execute -> Return result")
provider_tool_use_comparison()Source: Anthropic Claude tool use entered public beta May 2024 with anthropic-beta: tools-2024-05-16 header, graduating to general availability in June 2024 (confirmed from docs.anthropic.com/claude/docs/tool-use, scademy.ai, enterpriseai.news). Gemini function calling available since Gemini 1.0 API launch December 13, 2023 (confirmed from ai.google.dev documentation, blockchain.news).
The Model Context Protocol (MCP): A Universal Standard
Function calling solves the problem of a model requesting tool execution. But it creates a new problem: every tool integration is custom. If you want your AI assistant to connect to Slack, GitHub, a database, and a calendar, you need to write four separate integrations, each with its own authentication, error handling, and data formatting. If you switch from one AI provider to another, you need to rewrite all four integrations.
This is the N x M integration problem: N AI applications times M tools equals N x M custom integrations. Ten AI applications connecting to twenty tools requires up to two hundred unique integrations.
The Model Context Protocol (MCP), announced by Anthropic in November 2024 as an open standard, solves this problem. MCP provides a universal interface for connecting AI models to external tools, data sources, and systems. Instead of building custom integrations for each combination of AI application and tool, you build one MCP server per tool and one MCP client per AI application. Any MCP client can connect to any MCP server.
The analogy that has stuck is “USB-C for AI.” Before USB-C, every device had its own proprietary connector. USB-C standardized the physical and electrical interface so any device can connect to any peripheral. MCP does the same for AI tool integration: it standardizes the protocol so any AI application can connect to any tool.
MCP Architecture
MCP uses a client-server architecture built on JSON-RPC 2.0 (the same message format used by the Language Server Protocol that powers code editors like VS Code). The architecture has three components:
Host: The AI application that the user interacts with (e.g., Claude Desktop, a chatbot, an IDE). The host creates and manages MCP clients.
Client: A connector within the host application that maintains a one-to-one connection with an MCP server. The client handles protocol negotiation, capability discovery, and message routing.
Server: A service that provides tools, data, or context to the AI model. Each server exposes a specific set of capabilities through a standardized interface.
def mcp_architecture():
"""
Illustrate the MCP client-server architecture.
"""
print("MCP Architecture")
print("=" * 65)
print()
print(" User")
print(" |")
print(" v")
print(" +------------------+")
print(" | Host | (Claude Desktop, IDE, chatbot)")
print(" | |")
print(" | +------------+ |")
print(" | | MCP Client |--+---> MCP Server A (GitHub)")
print(" | +------------+ |")
print(" | +------------+ |")
print(" | | MCP Client |--+---> MCP Server B (Database)")
print(" | +------------+ |")
print(" | +------------+ |")
print(" | | MCP Client |--+---> MCP Server C (Calendar)")
print(" | +------------+ |")
print(" +------------------+")
print()
print(" Each client maintains a 1:1 connection with one server.")
print(" The host can have multiple clients for multiple servers.")
print(" All communication uses JSON-RPC 2.0 messages.")
mcp_architecture()MCP’s Three Primitives
MCP defines three core primitives that servers can expose:
Tools are functions that the AI model can call. They are model-initiated: the model decides when to call a tool based on the user’s request. Tools perform actions or computations and return results. Examples: searching a database, creating a file, sending a message, running a calculation.
Resources are data that the host application can attach to the model’s context. They are application-initiated: the host decides what resources to include, not the model. Resources provide read-only context. Examples: file contents, database records, API documentation, configuration files.
Prompts are reusable templates that users can select. They are user-initiated: the human picks a prompt template to structure their interaction. Examples: “Summarize this document,” “Review this pull request,” “Generate a test plan.”
This three-way control model is deliberate. Tools give the model agency (it decides when to act). Resources give the application control over context (it decides what data to provide). Prompts give the user control over workflow (they decide what task to perform). This layered approach keeps humans in the loop at every level.
def mcp_primitives():
"""
Explain MCP's three core primitives with examples.
"""
print("MCP's Three Primitives")
print("=" * 65)
primitives = [
("Tools", "Model-initiated",
"Model decides when to call",
["search_database(query)", "create_file(path, content)",
"send_email(to, subject, body)"]),
("Resources", "Application-initiated",
"Host decides what context to attach",
["file://project/README.md", "db://users/profile/123",
"api://docs/openai/chat-completions"]),
("Prompts", "User-initiated",
"Human picks a workflow template",
["Summarize this document", "Review this PR",
"Generate test cases for this function"]),
]
for name, control, desc, examples in primitives:
print(f"\n {name} ({control})")
print(f" {desc}")
for ex in examples:
print(f" - {ex}")
mcp_primitives()Source: MCP announced by Anthropic in November 2024 as an open standard (confirmed from wikipedia.org/wiki/Model_Context_Protocol, anthropic.com, modelcontextprotocol.io). Uses JSON-RPC 2.0 messages with Host/Client/Server architecture inspired by the Language Server Protocol (confirmed from modelcontextprotocol.io/specification/2025-03-26, wikipedia.org/wiki/Model_Context_Protocol). Three primitives: Tools (model-initiated), Resources (application-initiated), Prompts (user-initiated) (confirmed from mcpserverspot.com/learn/architecture/mcp-building-blocks, ggprompts.github.io/htmlstyleguides/techguides/mcp.html).
MCP Adoption: From Anthropic Project to Industry Standard
MCP’s adoption trajectory has been remarkable. What started as an Anthropic project in November 2024 became the de facto industry standard within a year:
- November 2024: Anthropic releases MCP as an open-source specification under the MIT license.
- March 2025: OpenAI announces support for MCP in its products and APIs, ending the fragmentation between OpenAI’s proprietary tool-calling format and MCP.
- March 26, 2025: The MCP specification receives its first major revision, introducing the Streamable HTTP transport (replacing the original HTTP+SSE transport), improved authorization flows, and richer metadata.
- April 2025: Google DeepMind and Microsoft adopt MCP, making it supported by all three major AI providers.
- Mid-2025: Over 5,000 active MCP servers listed in the Glama MCP Server Directory, with more than 115 production-grade vendor implementations.
- November 25, 2025: The MCP specification receives a second major update introducing task-based workflows for long-running operations, Client ID Metadata Documents for improved authorization, sampling with tools, and server identity.
- December 9, 2025: Anthropic donates MCP to the newly formed Agentic AI Foundation (AAIF) under the Linux Foundation. The AAIF has eight founding platinum members ($350,000 each): AWS, Anthropic, Block, Bloomberg, Cloudflare, Google, Microsoft, and OpenAI. Anthropic contributes MCP, OpenAI contributes AGENTS.md (a specification for describing agent capabilities), and Block contributes goose (an open-source, local-first AI agent framework).
- March 2026: Over 3,000 unique servers registered in the official MCP registry (3,012 as of March 2026, with 84.6% having source code available), with over 10,000 MCP servers in the broader ecosystem. The protocol has accumulated 97 million monthly SDK downloads.
def mcp_adoption_timeline():
"""
Track MCP's adoption from Anthropic project to industry standard.
"""
print("MCP Adoption Timeline")
print("=" * 70)
events = [
("Nov 2024", "Anthropic releases MCP",
"Open-source, MIT license"),
("Mar 2025", "OpenAI adopts MCP",
"Ends proprietary tool-calling fragmentation"),
("Mar 26, 2025", "Spec update: Streamable HTTP",
"Replaces HTTP+SSE transport"),
("Apr 2025", "Google DeepMind + Microsoft adopt",
"All three major providers now support MCP"),
("Mid-2025", "5,000+ MCP servers",
"115+ production vendor implementations"),
("Nov 25, 2025", "Spec update: tasks, auth, identity",
"Long-running workflows, Client ID Metadata"),
("Dec 9, 2025", "Donated to Linux Foundation AAIF",
"Co-founded by Anthropic, Block, OpenAI"),
("Mar 2026", "3,000+ registry, 10,000+ ecosystem",
"97M monthly SDK downloads"),
]
for date, event, detail in events:
print(f" {date:<16} {event}")
print(f" {'':16} {detail}")
print()
mcp_adoption_timeline()The speed of MCP adoption is unusual for a protocol standard. HTTP took years to become universal. REST APIs took a decade to become the dominant web architecture. MCP went from announcement to universal adoption by all major AI providers in under a year. This happened because the N x M integration problem was so painful that the industry was desperate for a standard, and because Anthropic released MCP as a genuinely open specification (MIT license, no vendor lock-in) rather than trying to control it.
Source: OpenAI adopted MCP by March 2025 (confirmed from alexcloudstar.com, pento.ai, cuttlesoft.com). Google DeepMind and Microsoft adopted MCP by mid-2025 (confirmed from alexcloudstar.com, pento.ai, getadblock.com). 5,000+ active MCP servers by mid-2025 with 115+ production vendor implementations (confirmed from onhealthcare.tech). March 26, 2025 spec update introduced Streamable HTTP transport replacing HTTP+SSE (confirmed from modelcontextprotocol.io/specification/2025-03-26/changelog, cloudflare.com, shuttle.dev). November 25, 2025 spec update introduced task-based workflows, Client ID Metadata Documents, and improved authorization (confirmed from modelcontextprotocol.io, workos.com, auth0.com, aaronparecki.com). AAIF formed December 9, 2025, with eight founding platinum members at $350,000 each: AWS, Anthropic, Block, Bloomberg, Cloudflare, Google, Microsoft, and OpenAI; founding project contributions from Anthropic (MCP), Block (goose), and OpenAI (AGENTS.md) (confirmed from linuxfoundation.org/press/linux-foundation-announces-the-formation-of-the-agentic-ai-foundation, simonwillison.net, dbta.com, agnost.ai). 3,012 unique servers in official registry as of March 2026, 84.6% with source code available (confirmed from nimblebrain.ai). 10,000+ servers in broader ecosystem (confirmed from segmentstream.com, lws.academy, dreamfactory.com). 97 million monthly SDK downloads (confirmed from ekamoira.com, lws.academy, spikeapi.com).
Building an MCP Server: What It Looks Like
To make MCP concrete, here is what a minimal MCP server looks like. This server exposes a single tool that retrieves the current weather for a given city. In Chapter 29, you will build a complete working MCP server; this example shows the structure.
# weather_server.py
# A minimal MCP server using the Python SDK (fastmcp)
# Install: pip install fastmcp
from fastmcp import FastMCP
# Create the server
mcp = FastMCP("Weather Service")
@mcp.tool()
def get_weather(city: str) -> str:
"""Get the current weather for a city.
Args:
city: The name of the city (e.g., 'London', 'Tokyo')
Returns:
A string describing the current weather conditions.
"""
# In production, this would call a real weather API
weather_data = {
"London": "Cloudy, 12C, 78% humidity",
"Tokyo": "Sunny, 22C, 45% humidity",
"New York": "Partly cloudy, 18C, 62% humidity",
}
return weather_data.get(city, f"Weather data not available for {city}")
if __name__ == "__main__":
mcp.run()That is the entire server. The @mcp.tool() decorator registers the function as an MCP tool. The function’s type hints and docstring are automatically converted into the JSON Schema that MCP clients use to discover the tool’s capabilities. When an AI model connected to this server decides it needs weather information, it calls get_weather with a city name, and the server returns the result.
On the client side, an AI application (like Claude Desktop) connects to this server and discovers the available tools. When a user asks “What is the weather in London?”, the model sees that a get_weather tool is available, generates a tool call with city: "London", the MCP client sends the request to the server, the server executes the function and returns “Cloudy, 12C, 78% humidity”, and the model incorporates this into its response.
The power of MCP is that this same server works with any MCP-compatible client: Claude Desktop, an IDE plugin, a custom chatbot, or any other application that implements the MCP client protocol. You write the server once, and it works everywhere.
Multi-Step Agents: The ReAct Loop
Function calling lets a model make a single tool call. But most real-world tasks require multiple steps. “Book me a flight from New York to London next Tuesday” requires searching for flights, comparing options, selecting one, entering passenger details, and confirming the booking. Each step depends on the result of the previous step.
An agent is a system that uses a language model in a loop: the model reasons about the current situation, decides what action to take, executes that action (via a tool call), observes the result, and then reasons again about what to do next. This loop continues until the task is complete.
The most influential formalization of this pattern is ReAct (Reasoning + Acting), introduced by Yao et al. in October 2022 (arXiv:2210.03629). ReAct interleaves reasoning traces (“Thought”) with actions (“Action”) and observations (“Observation”) in a structured loop:
- Thought: The model reasons about the current state and what it should do next.
- Action: The model calls a tool or takes an action.
- Observation: The model receives the result of the action.
- Repeat: The model generates a new Thought based on the Observation, decides on the next Action, and continues until the task is complete.
def react_loop_demo():
"""
Demonstrate the ReAct (Reasoning + Acting) agent loop.
This shows the step-by-step process of a multi-step agent.
"""
print("ReAct Agent Loop: 'What is the population of the capital of France?'")
print("=" * 70)
steps = [
("Thought 1",
"I need to find the capital of France first, then look up\n"
" its population."),
("Action 1",
"search(query='capital of France')"),
("Observation 1",
"The capital of France is Paris."),
("Thought 2",
"Now I know the capital is Paris. I need to find the\n"
" population of Paris."),
("Action 2",
"search(query='population of Paris 2025')"),
("Observation 2",
"The population of Paris is approximately 2.1 million\n"
" (city proper) or 12.3 million (metro area)."),
("Thought 3",
"I now have the answer. The capital of France is Paris,\n"
" with a population of about 2.1 million (city proper)."),
("Action 3",
"finish(answer='The capital of France is Paris, with a\n"
" population of approximately 2.1 million.')"),
]
for label, content in steps:
print(f"\n {label:<16}{content}")
print(f"\n\n The agent made 2 tool calls and 3 reasoning steps")
print(f" to answer a question that required multi-step reasoning.")
react_loop_demo()The ReAct pattern is powerful because it combines the model’s reasoning ability (chain-of-thought, as discussed in Chapter 16) with the ability to gather real information from the environment. The model does not have to guess or hallucinate; it can look things up, verify its assumptions, and adjust its plan based on what it finds.
The Agent Loop in Code
Here is what the core agent loop looks like in practice. This is the fundamental pattern that all agent frameworks implement:
def agent_loop(model, tools, user_message, max_steps=10):
"""
The core agent loop. This is the pattern that powers all
agentic AI systems, from simple chatbots to autonomous agents.
Args:
model: The language model to use for reasoning
tools: Dictionary of available tool functions
user_message: The user's initial request
max_steps: Maximum number of tool-call iterations
"""
messages = [{"role": "user", "content": user_message}]
for step in range(max_steps):
# Ask the model what to do next
response = model.chat(messages=messages, tools=tools)
# If the model generated text (no tool call), we are done
if not response.tool_calls:
return response.content
# The model wants to call one or more tools
messages.append(response) # Add the assistant's tool call
for tool_call in response.tool_calls:
# Execute the requested tool
func = tools[tool_call.function.name]
args = json.loads(tool_call.function.arguments)
result = func(**args)
# Add the tool result to the conversation
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": str(result)
})
return "Maximum steps reached without completing the task."This loop is deceptively simple, but it is the foundation of every agent system. The model acts as the “brain” that decides what to do, and the tools act as the “hands” that interact with the world. The conversation history (the messages list) serves as the agent’s “memory,” accumulating context from each step.
Source: ReAct (Reasoning + Acting) introduced by Yao et al., arXiv:2210.03629, October 2022. The framework interleaves reasoning traces with actions and observations in a structured loop (confirmed from emergentmind.com/topics/react-workflow, mbrenndoerfer.com, zenvanriel.nl/glossary/react-pattern).
Computer Use: Models That Operate Software
A specialized form of tool use is computer use: the ability for a model to interact with a graphical user interface (GUI) by looking at the screen, moving the mouse, clicking buttons, and typing text. Instead of calling structured API functions, the model operates software the same way a human would.
Anthropic: Computer Use Pioneer
Anthropic launched Computer Use in public beta on October 22, 2024, alongside the upgraded Claude 3.5 Sonnet. This made Claude the first frontier AI model to offer autonomous desktop control.
Computer Use works through a screenshot-action loop:
- You send Claude a screenshot of the current screen.
- Claude analyzes the screenshot and returns a structured action: click at coordinates (x, y), type a string, press a key, or scroll.
- Your code executes the action on the actual computer.
- You take a new screenshot and send it back to Claude.
- Repeat until the task is complete.
This is fundamentally different from API-based tool use. With function calling, the model interacts with structured data (JSON in, JSON out). With computer use, the model interacts with pixels. It must visually identify buttons, text fields, menus, and other UI elements, then decide where to click or what to type. This requires the vision capabilities discussed in Chapter 21.
OpenAI: Operator and CUA
OpenAI launched Operator on January 23, 2025, as a research preview for ChatGPT Pro subscribers. Operator is a web-based AI agent that can carry out online tasks: booking reservations, filling out forms, managing grocery orders, and navigating websites.
Operator is powered by the Computer-Using Agent (CUA) model, which combines GPT-4o’s vision capabilities with reinforcement learning. CUA is trained to interact with graphical user interfaces (buttons, menus, text fields) the way humans do. Unlike API-based integrations that require each website to provide a structured interface, CUA can interact with any website by interpreting its visual layout, even websites it has never encountered before.
On March 11, 2025, OpenAI released the CUA model as a built-in tool in the Responses API, making computer use available to developers building their own applications. The tool accepts a screenshot and returns structured actions (click, type, scroll, keypress), following the same screenshot-action loop as Anthropic’s implementation.
def computer_use_loop():
"""
Illustrate the computer use screenshot-action loop.
"""
print("Computer Use: Screenshot-Action Loop")
print("=" * 60)
steps = [
("1. Capture", "Take screenshot of current screen",
"Image: 1920x1080 pixels"),
("2. Analyze", "Model examines the screenshot",
"Identifies UI elements, text, layout"),
("3. Decide", "Model generates structured action",
'click(x=450, y=320) or type("search query")'),
("4. Execute", "Application performs the action",
"Mouse moves, clicks, or keyboard types"),
("5. Observe", "Take new screenshot",
"See the result of the action"),
("6. Repeat", "Loop until task is complete",
"Or until model signals completion"),
]
for step, action, detail in steps:
print(f" {step:<12} {action}")
print(f" {'':12} {detail}")
print()
print(" This loop enables models to operate ANY software,")
print(" not just software with APIs or MCP servers.")
computer_use_loop()Why Computer Use Matters
Computer use is significant because it provides a universal interface. Most software does not have an API. Most internal enterprise tools, legacy systems, and desktop applications can only be operated through their GUI. Computer use allows AI agents to interact with these systems without requiring any integration work. The model just looks at the screen and operates the software like a human would.
The tradeoff is speed and reliability. Computer use is slower than API-based tool calling (each step requires taking a screenshot, sending it to the model, and waiting for a response) and more error-prone (the model may misidentify UI elements or click in the wrong place). As discussed in Chapter 21, GPT-5.4 scores 75% on OSWorld-Verified (a benchmark for autonomous computer use), which surpasses the 72.4% human baseline but still means roughly one in four tasks fails.
In practice, computer use is best suited for tasks where no API or MCP server exists, and where the cost of building a custom integration exceeds the cost of slower, less reliable GUI-based automation.
On March 18, 2026, OpenAI released GPT-5.4 mini and GPT-5.4 nano, extending computer use capabilities to smaller, cheaper models. GPT-5.4 mini scores 72.1% on OSWorld-Verified, nearly matching the flagship’s 75%, while GPT-5.4 nano drops to 39%. This means computer use is becoming accessible at lower price points, though the smallest models are not yet reliable enough for autonomous GUI tasks.
Source: Anthropic launched Computer Use in public beta on October 22, 2024, alongside upgraded Claude 3.5 Sonnet, making Claude the first frontier AI model to offer autonomous desktop control (confirmed from getathenic.com, digitalapplied.com, techcrunch.com, simonwillison.net). OpenAI launched Operator on January 23, 2025, as a research preview for Pro subscribers, powered by the Computer-Using Agent (CUA) model (confirmed from openai.com/index/computer-using-agent, technologyreview.com, gigazine.net, winbuzzer.com). CUA released as a built-in tool in the Responses API on March 11, 2025 (confirmed from openai.com/index/new-tools-for-building-agents). GPT-5.4 scores 75% on OSWorld-Verified, surpassing 72.4% human baseline; GPT-5.4 mini scores 72.1%; GPT-5.4 nano scores 39% (confirmed from computertech.co, aihaven.com, openai.com/index/introducing-gpt-5-4-mini-and-nano, blockchain.news, ghacks.net).
Agent Frameworks and SDKs
Building an agent from scratch (writing the loop, managing tool calls, handling errors, tracking state) is tedious and error-prone. Several frameworks and SDKs have emerged to handle the boilerplate, letting developers focus on defining the agent’s tools and behavior.
OpenAI Agents SDK
OpenAI released the Agents SDK on March 11, 2025, as an open-source Python framework for building multi-agent applications. The SDK is built around four core primitives:
- Agents: An LLM configured with instructions, tools, and optional handoff targets. Each agent is a self-contained unit with a specific role.
- Handoffs: A mechanism for one agent to delegate a task to another agent. This enables multi-agent workflows where specialized agents handle different parts of a task.
- Guardrails: Input and output validation that runs in parallel with agent execution. If a guardrail check fails, the agent loop terminates early, preventing unsafe or invalid actions.
- Tracing: Built-in observability for visualizing, debugging, and monitoring agent workflows. Traces capture every step of the agent loop, including tool calls, model responses, and handoffs.
The SDK also includes built-in support for MCP server tool calling, which works the same way as function tools. This means agents built with the SDK can connect to any MCP server without additional integration work.
# Example: A simple agent using the OpenAI Agents SDK
# Install: pip install openai-agents
from agents import Agent, Runner
# Define a tool as a plain Python function
def get_weather(city: str) -> str:
"""Get the current weather for a city."""
return f"Weather in {city}: Sunny, 22C"
# Create an agent with instructions and tools
weather_agent = Agent(
name="Weather Assistant",
instructions="You help users check the weather. "
"Use the get_weather tool when asked about weather.",
tools=[get_weather],
)
# Run the agent (this handles the full agent loop)
result = Runner.run_sync(weather_agent, "What is the weather in Tokyo?")
print(result.final_output)
# Output: "The weather in Tokyo is sunny and 22C."The SDK automates the agent loop: it sends the user’s message to the model, checks if the model wants to call a tool, executes the tool, sends the result back, and repeats until the model generates a final text response. Developers define the agent’s behavior through instructions and tools; the SDK handles the orchestration.
In October 2025, OpenAI announced AgentKit at DevDay (October 6, 2025), a comprehensive platform that adds a visual Agent Builder (a drag-and-drop canvas for composing agent workflows), inline evaluation configuration, and deployment tools. AgentKit builds on top of the Agents SDK, providing a higher-level interface for teams that want to build agents without writing code for every component.
Source: OpenAI Agents SDK released March 11, 2025, open-source Python framework with Agents, Handoffs, Guardrails, and Tracing primitives (confirmed from openai.com/index/new-tools-for-building-agents, siddharthbharath.com, hal9.com, datatunnel.io). Built-in MCP server tool calling support (confirmed from openai.github.io/openai-agents-python). AgentKit announced at DevDay October 6, 2025, includes Agent Builder visual canvas (confirmed from openai.com/index/introducing-agentkit, devcontentops.io, aibreaking.org, digitalapplied.com).
Anthropic Claude Agent SDK
Anthropic released the Claude Agent SDK in September 2025 (initially launched as “Claude Code SDK” in February 2025, then rebranded to reflect its broader scope). The SDK powers Claude Code, Anthropic’s agentic coding tool, but is designed for building general-purpose agents: finance assistants, research tools, customer support bots, and productivity agents.
The Claude Agent SDK provides:
- A structured agent loop with automatic tool execution
- Support for subagents (agents that spawn other agents for subtasks)
- Checkpointing (saving and resuming agent state)
- Integration with Claude’s extended thinking capabilities (Chapter 16)
- Built-in MCP support for connecting to external tools
Source: Claude Agent SDK released September 29, 2025, alongside Claude Code 2.0, initially as “Claude Code SDK” in February 2025, rebranded to reflect broader applications (confirmed from aiwiki.ai, digitalapplied.com, c-sharpcorner.com, myaiexp.com, fastmcp.me, anthropic.com/engineering/building-agents-with-the-claude-agent-sdk).
Google Agent Development Kit (ADK)
Google released the Agent Development Kit (ADK) on April 9, 2025, at Google Cloud Next ‘25. ADK is an open-source Python framework for building, evaluating, and deploying AI agents and multi-agent systems. Key features include:
- Bidirectional audio and video streaming for real-time multimodal agent interactions
- A UI Playground for testing and debugging agents
- Built-in support for Google’s tools (search, code execution) and third-party ecosystems (LangChain, CrewAI)
- Model-agnostic design (optimized for Gemini but works with other models)
- Native MCP and A2A protocol support
Source: Google ADK released April 9, 2025, at Google Cloud Next ‘25, open-source Python framework for multi-agent applications (confirmed from developers.googleblog.com/en/agent-development-kit-easy-to-build-multi-agent-applications, aitech.fyi, geeky-gadgets.com).
Third-Party Frameworks
Beyond the provider SDKs, several open-source frameworks have become popular for building agents:
LangGraph (by LangChain): Treats agent workflows as stateful graphs. You define nodes (agents, tools, decision points) and edges (transitions between nodes). Best for complex, multi-step workflows that need explicit state management and conditional routing.
CrewAI: Organizes agents into role-based teams. Each agent has a specific role (researcher, writer, reviewer), and they collaborate to complete tasks. Best for workflows that map naturally to team-based collaboration.
AutoGen (by Microsoft): Frames agent interactions as multi-agent conversations. Agents talk to each other (and to humans in the loop) until a task is done. Best for conversational multi-agent patterns like group debates or consensus-building.
def agent_framework_comparison():
"""
Compare the major agent frameworks available in March 2026.
"""
print("Agent Framework Comparison (March 2026)")
print("=" * 75)
frameworks = [
("OpenAI Agents SDK", "Mar 2025", "Agent loop + handoffs",
"OpenAI models", "Production"),
("Claude Agent SDK", "Sep 2025", "Subagents + checkpoints",
"Claude models", "Production"),
("Google ADK", "Apr 2025", "Multi-agent + streaming",
"Model-agnostic", "Production"),
("LangGraph", "2024", "Stateful graphs",
"Model-agnostic", "Production"),
("CrewAI", "2024", "Role-based teams",
"Model-agnostic", "Production"),
("AutoGen (Microsoft)", "2023", "Conversational agents",
"Model-agnostic", "Research"),
]
print(f" {'Framework':<22} {'Released':<10} {'Pattern':<24} "
f"{'Models':<16} {'Maturity':<12}")
print(" " + "-" * 82)
for name, released, pattern, models, maturity in frameworks:
print(f" {name:<22} {released:<10} {pattern:<24} "
f"{models:<16} {maturity:<12}")
agent_framework_comparison()Source: LangGraph, CrewAI, and AutoGen are the three leading open-source agent frameworks as of 2026 (confirmed from similarlabs.com, calmops.com, pecollective.com, softmaxdata.com). LangGraph uses stateful graphs, CrewAI uses role-based teams, AutoGen uses conversational patterns (confirmed from iterathon.tech, markaicode.com, mayhemcode.com).
Agent-to-Agent Communication: The A2A Protocol
MCP standardizes how AI models connect to tools. But what about AI agents connecting to other AI agents? As multi-agent systems become more common, agents built by different vendors, using different frameworks, need a way to discover each other, communicate, and collaborate.
Google introduced the Agent2Agent (A2A) Protocol on April 9, 2025, at Google Cloud Next ‘25. A2A is an open protocol that enables AI agents to communicate with each other, securely exchange information, and coordinate actions, regardless of their underlying framework or vendor. On June 23, 2025, Google donated A2A to the Linux Foundation at the Open Source Summit North America, establishing neutral governance. The founding members of the Linux Foundation Agent2Agent project include AWS, Cisco, Google, Microsoft, Salesforce, SAP, and ServiceNow, with over 100 organizations supporting the protocol as of mid-2025.
While MCP handles the connection between an agent and its tools (vertical integration), A2A handles the connection between agents (horizontal integration). The two protocols are complementary:
- MCP: Agent connects to tools and data sources. “I need to query the database.”
- A2A: Agent connects to other agents. “I need the research agent to gather information, then the writing agent to draft the report.”
A2A defines a standard way for agents to:
- Discover each other’s capabilities through “Agent Cards” (JSON metadata describing what an agent can do)
- Negotiate task assignments and delegation
- Exchange messages, data, and status updates
- Coordinate multi-step workflows across agent boundaries
def mcp_vs_a2a():
"""
Compare MCP (agent-to-tool) and A2A (agent-to-agent) protocols.
"""
print("MCP vs A2A: Complementary Protocols")
print("=" * 65)
print("\n MCP (Model Context Protocol)")
print(" Direction: Agent <-> Tools/Data")
print(" Purpose: Connect AI to external functions and data sources")
print(" Example: Agent calls a database query tool")
print(" Released: November 2024 (Anthropic)")
print("\n A2A (Agent2Agent Protocol)")
print(" Direction: Agent <-> Agent")
print(" Purpose: Enable agents to discover and collaborate")
print(" Example: Research agent delegates to analysis agent")
print(" Released: April 9, 2025 (Google)")
print(" Governance: Linux Foundation (donated June 2025)")
print("\n Together, MCP + A2A enable:")
print(" - Agents that use tools (MCP)")
print(" - Agents that collaborate with other agents (A2A)")
print(" - Multi-agent systems with shared tool access")
mcp_vs_a2a()Source: Google announced the Agent2Agent (A2A) Protocol on April 9, 2025, at Google Cloud Next ‘25, an open protocol for AI agents to communicate, exchange information, and coordinate actions across different frameworks and vendors (confirmed from hackernoon.com, blog.wadan.co.jp, justin3go.com, a2a-protocol.com, educative.io). Google donated A2A to the Linux Foundation on June 23, 2025, at Open Source Summit North America, with founding members AWS, Cisco, Google, Microsoft, Salesforce, SAP, and ServiceNow; over 100 organizations supporting the protocol (confirmed from developers.googleblog.com/en/google-cloud-donates-a2a-to-linux-foundation, itsfoss.com, a2aprotocol.ai, forbes.com, aijourn.com).
Real-World Agents: What Exists Today
The agent frameworks and protocols described above are not theoretical. Several production agent systems are deployed and used by millions of people as of March 2026.
OpenAI Codex: Autonomous Coding Agent
OpenAI launched Codex on May 16, 2025, as a cloud-based software engineering agent. Codex is powered by codex-1, a version of the o3 reasoning model (Chapter 16) specifically trained using reinforcement learning on real-world coding tasks.
Codex operates differently from coding assistants like GitHub Copilot. Instead of suggesting code completions as you type, Codex takes entire tasks: “Write a function that parses CSV files and generates summary statistics,” “Fix the failing test in test_auth.py,” or “Refactor the database module to use connection pooling.” Each task runs in its own isolated cloud sandbox environment, preloaded with your repository. Codex reads the codebase, writes code, runs tests, debugs failures, and proposes pull requests for review.
The key architectural feature is that Codex runs asynchronously. You assign a task and continue working on other things. Codex works in the background, and when it finishes, it presents the result (a code diff, a passing test suite, a pull request) for your review. You can assign multiple tasks in parallel, each running in its own sandbox.
Codex supports over a dozen programming languages and processes up to 192,000 tokens of context. It was initially available to ChatGPT Pro, Team, and Enterprise users (the Team plan was later renamed to Business on August 29, 2025).
Source: OpenAI Codex launched May 16, 2025, powered by codex-1 (a version of o3 trained with reinforcement learning on real-world coding tasks), runs in isolated cloud sandbox environments (confirmed from openai.com/index/introducing-codex, siliconangle.com, maginative.com, ainews.com). Supports over a dozen programming languages including Python, JavaScript, Go, Perl, PHP, Ruby, Swift, TypeScript, and Shell; 192,000 tokens context (confirmed from openai.com/blog/openai-codex, milvus.io, aboutchromebooks.com). Available to Pro, Team, and Enterprise users (confirmed from swipeinsight.app, medianama.com). Team plan renamed to Business on August 29, 2025 (confirmed from help.openai.com/fr-ca/articles/12111915-chatgpt-business-rename-faq, openai.com/index/introducing-chatgpt-team).
OpenAI Codex CLI: Open-Source Terminal Agent
A month before the cloud-based Codex, OpenAI released Codex CLI on April 16, 2025, as an open-source, terminal-based coding agent. While cloud Codex runs tasks asynchronously in sandboxed environments, Codex CLI operates directly in your local terminal, similar to Claude Code. It reads, modifies, and executes code on your machine, with all processing happening locally (only API calls leave your environment).
Codex CLI was initially built in Node.js and TypeScript, then later rewritten in Rust for performance. It was published under the Apache 2.0 license. It supports three safety modes that control how much autonomy the agent has:
- Suggest mode: The agent can only read files and suggest changes. It cannot modify anything without your approval.
- Auto-edit mode: The agent can read and write files, but must ask permission before running commands.
- Full-auto mode: The agent can read, write, and execute commands autonomously within a sandboxed environment.
The open-source nature of Codex CLI means developers can inspect the code, contribute improvements, and run it with any OpenAI API key. This contrasts with cloud Codex, which is a closed product available only through ChatGPT subscriptions.
Source: OpenAI Codex CLI released April 16, 2025, open-source under Apache 2.0 license, initially Node.js/TypeScript, later rewritten in Rust (confirmed from handwiki.org/wiki/OpenAI_Codex, archlinux.org/packages/openai-codex “Licenses: Apache-2.0”, openaicli.com/docs, cometapi.com, gigazine.net). Three safety modes: suggest, auto-edit, full-auto (confirmed from informatecdigital.com, openreplay.com).
Claude Code: Agentic Terminal Coding
Anthropic’s Claude Code, first released as a beta research preview on February 24, 2025, alongside Claude 3.7 Sonnet, and made generally available on May 22, 2025, takes a different approach. Instead of running in the cloud, Claude Code operates directly in your terminal as a command-line tool. It reads your entire codebase, makes multi-file edits, runs tests, creates commits, and submits pull requests, all without leaving your development environment.
Claude Code is powered by Claude’s latest models (currently Claude Opus 4.6) and uses the agent loop pattern described earlier in this chapter. It reasons about the task, decides what files to read, makes edits, runs tests to verify the changes, and iterates until the tests pass. The September 2025 update (Claude Code 2.0) added checkpointing (saving and resuming agent state) and subagents (spawning specialized sub-agents for subtasks).
Source: Claude Code released as beta research preview February 24, 2025, generally available May 22, 2025 (confirmed from aiwiki.ai, druce.ai, spectrumailab.com). Powered by Claude Opus 4.6 (confirmed from mindtwo.com). Claude Code 2.0 with checkpoints and subagents released September 2025 (confirmed from digitalapplied.com).
OpenAI Operator: Web Browsing Agent
As described in the computer use section, OpenAI’s Operator (launched January 23, 2025) is a web-based agent that can navigate websites and perform tasks: booking reservations, filling forms, managing grocery orders, and more. Operator uses the CUA model to interpret screenshots and generate mouse/keyboard actions, enabling it to interact with any website without requiring an API.
Devin: Autonomous Software Engineer
Devin, built by Cognition Labs, was announced in early 2024 as the “first fully autonomous AI software engineer.” Devin takes entire development tasks from description to pull request: it plans the approach, writes code, runs tests, debugs failures, and submits the result. The initial version was priced at $500/month for early access. On April 3, 2025, Cognition released Devin 2.0 in beta, which dramatically reduced the price to $20/month for the Core plan (with pay-as-you-go pricing at $2.25 per “Agent Compute Unit” after that), introduced an agent-native IDE, interactive planning, code search, and a built-in wiki.
Devin’s approach differs from Codex and Claude Code in that it operates as a fully autonomous agent with its own development environment, rather than integrating into the developer’s existing workflow. You assign a task via a chat interface, and Devin works independently in its own sandboxed environment, presenting the completed work for review.
Source: Devin announced early 2024 by Cognition Labs as “first fully autonomous AI software engineer,” initially $500/month (confirmed from trickle.so, eesel.ai). Devin 2.0 released in beta April 3, 2025, price reduced to $20/month Core plan with pay-as-you-go at $2.25 per Agent Compute Unit, added agent-native IDE, interactive planning, code search, built-in wiki (confirmed from siliconangle.com, communeify.com, echoapi.com, newyorkdawn.com).
Multi-Agent Systems: Teams of Agents
A single agent with tools can accomplish a lot, but complex tasks often benefit from multiple specialized agents working together. A multi-agent system divides a complex task among several agents, each with its own role, tools, and expertise.
Why Multiple Agents?
The motivation is the same as why companies have teams instead of one person doing everything. A single agent trying to handle research, analysis, writing, and code review will have a very long system prompt, many tools, and a context window that fills up quickly. Splitting the work across specialized agents has several advantages:
- Focused expertise: Each agent has a narrow role with specific instructions and tools. A “researcher” agent has web search tools. A “coder” agent has code execution tools. A “reviewer” agent has code analysis tools.
- Smaller context: Each agent only needs context relevant to its role, rather than the entire task history.
- Parallel execution: Independent subtasks can be assigned to different agents running simultaneously.
- Easier debugging: When something goes wrong, you can identify which agent failed and why.
How Multi-Agent Systems Work
The typical architecture has a coordinator (or “orchestrator”) agent that receives the user’s request, breaks it into subtasks, delegates each subtask to a specialized agent, collects the results, and synthesizes a final response.
def multi_agent_architecture():
"""
Illustrate a multi-agent system with a coordinator and specialists.
"""
print("Multi-Agent Architecture: Research Report Generator")
print("=" * 65)
print("""
User: "Write a competitive analysis of cloud GPU providers"
|
v
+---------------------+
| Coordinator Agent | Breaks task into subtasks,
| (orchestrator) | delegates, synthesizes
+---------------------+
| | |
v v v
+-------+ +-------+ +-------+
|Research| |Analyst| |Writer |
| Agent | | Agent | | Agent |
+-------+ +-------+ +-------+
Tools: Tools: Tools:
- web - calc - format
search - chart - edit
- scrape gen - cite
""")
print(" Flow:")
print(" 1. Coordinator receives user request")
print(" 2. Coordinator delegates 'gather data' to Research Agent")
print(" 3. Research Agent searches web, returns findings")
print(" 4. Coordinator delegates 'analyze data' to Analyst Agent")
print(" 5. Analyst Agent processes data, generates charts")
print(" 6. Coordinator delegates 'write report' to Writer Agent")
print(" 7. Writer Agent drafts report using research + analysis")
print(" 8. Coordinator reviews and returns final report to user")
multi_agent_architecture()Handoffs: The OpenAI Approach
The OpenAI Agents SDK implements multi-agent coordination through handoffs. A handoff is a mechanism where one agent transfers control to another agent. The first agent decides (based on its instructions and the current context) that a different agent is better suited for the next part of the task, and hands off the conversation.
# Multi-agent system with handoffs using OpenAI Agents SDK
from agents import Agent, Runner
# Define specialized agents
research_agent = Agent(
name="Research Agent",
instructions="You research topics using web search. "
"When research is complete, hand off to the Writer.",
tools=[web_search],
handoffs=["writer_agent"],
)
writer_agent = Agent(
name="Writer Agent",
instructions="You write clear, well-structured reports "
"based on research provided to you.",
tools=[format_document],
)
# The coordinator delegates to the research agent first
result = Runner.run_sync(
research_agent,
"Research the latest developments in quantum computing"
)The handoff pattern is simpler than a full coordinator architecture. Instead of a central orchestrator managing all agents, each agent knows which other agents it can hand off to, and makes the decision locally. This is more flexible (agents can be added or removed without changing a central coordinator) but less predictable (the handoff chain depends on each agent’s decisions).
The Agentic Era: What It Means
The convergence of function calling, MCP, computer use, and multi-agent frameworks is creating what the industry calls the “agentic era.” This is not just a marketing term. It represents a fundamental shift in how AI systems are used: from answering questions to executing tasks.
The Scale of Adoption
Gartner predicts that 40% of enterprise applications will integrate task-specific AI agents by the end of 2026, up from less than 5% in 2025. This is an 8x increase in a single year, one of the fastest adoption curves in enterprise technology history.
The agentic AI market reached approximately $7-8 billion in 2025 and is projected to grow to $52-53 billion by 2030, with a compound annual growth rate of around 46%. However, the reality is more nuanced: while 65% of enterprises report running agentic AI pilots, only about 11% have crossed those pilots into production. Gartner also warns that over 40% of agentic AI projects will be canceled by 2027 due to escalating costs, unclear business value, or inadequate risk controls.
What “Agentic” Actually Means
The term “agentic” is used loosely in the industry, but it has a specific technical meaning in the context of this chapter. An AI system is agentic to the degree that it:
- Makes decisions autonomously: The model decides what tools to call, in what order, without human intervention at each step.
- Operates in a loop: The system iterates (reason, act, observe, repeat) rather than producing a single response.
- Interacts with the environment: The system reads from and writes to external systems (databases, APIs, files, GUIs).
- Handles errors and adapts: When a tool call fails or returns unexpected results, the system adjusts its plan rather than crashing.
A chatbot that answers questions is not agentic. A system that takes a user request, searches the web, reads several pages, synthesizes the information, writes a report, and emails it to the user is agentic. The difference is autonomy and multi-step execution.
def agentic_spectrum():
"""
Show the spectrum from simple chatbot to fully autonomous agent.
"""
print("The Agentic Spectrum")
print("=" * 70)
levels = [
("Level 0: Chatbot",
"Single turn, no tools",
"Q&A, text generation",
"ChatGPT basic chat"),
("Level 1: Tool-augmented",
"Single tool call per turn",
"Weather lookup, calculation",
"Function calling"),
("Level 2: Multi-step agent",
"Multiple tool calls in a loop",
"Research, data analysis",
"ReAct pattern"),
("Level 3: Multi-agent system",
"Multiple agents collaborating",
"Complex workflows, reports",
"CrewAI, Agents SDK"),
("Level 4: Autonomous agent",
"Long-running, self-directed",
"Software engineering, operations",
"Codex, Claude Code, Devin"),
]
for level, desc, use_case, example in levels:
print(f"\n {level}")
print(f" {desc}")
print(f" Use case: {use_case}")
print(f" Example: {example}")
agentic_spectrum()Source: Gartner predicts 40% of enterprise applications will integrate task-specific AI agents by end of 2026, up from less than 5% in 2025 (confirmed from gartner.com/en/newsroom/press-releases/2025-08-26, businessworld.in, flowtivity.ai). Agentic AI market reached approximately $7.84 billion in 2025, projected to $52.62 billion by 2030 at 46.3% CAGR (confirmed from marketsandmarkets.com AI Agents Market report). 65% of enterprises running agentic AI pilots, only 11% in production (confirmed from detroitcomputing.com, anyreach.ai). Over 40% of agentic AI projects predicted to be canceled by 2027 due to escalating costs, unclear business value, or inadequate risk controls, announced June 25, 2025 (confirmed from gartner.com/en/newsroom/press-releases/2025-06-25, ppc.land, forbes.com).
Measuring Agent Performance: Benchmarks
As agents move from demos to production, the question becomes: how do you measure whether an agent actually works? Several benchmarks have emerged to evaluate agent capabilities on real-world tasks.
SWE-bench: Can the Agent Fix Real Bugs?
SWE-bench (Software Engineering Benchmark) is the most influential benchmark for coding agents. It evaluates whether an AI agent can resolve real GitHub issues from open-source Python repositories. Each task gives the agent an issue description and a full repository; the agent must produce a code patch that passes the project’s test suite.
SWE-bench Verified is a curated subset of 500 tasks validated by human annotators. As of early 2026, the top scores on SWE-bench Verified are:
- Claude Opus 4.6: 80.8%
- Claude Opus 4.6 (Thinking): 79.2%
- GPT-5.4: 77.2%
- Gemini 3 Flash: 76.2-78% (76.2% on the vals.ai leaderboard using the standardized mini-SWE-agent v2 harness; 78% per Google’s own evaluation)
These numbers mean the best agents can resolve roughly four out of five real-world software issues autonomously. That is impressive, but it also means one in five tasks still fails, which matters enormously in production.
Notably, OpenAI announced in February 2026 that it would stop reporting SWE-bench Verified scores, citing significant contamination and test case flaws that undermine the benchmark’s reliability for measuring frontier capabilities. OpenAI’s audit found that 59.4% of the tasks its models failed contained fundamentally broken tests that reject correct solutions. OpenAI now endorses SWE-bench Pro (from Scale AI) as a harder, more diverse replacement that includes tasks across more repositories and programming languages, with longer task durations (1-4+ hours) and substantially less evidence of contamination. This highlights a broader challenge: as agents improve, the benchmarks used to measure them must evolve too.
OSWorld: Can the Agent Use a Computer?
OSWorld tests whether an AI agent can complete real desktop tasks by interpreting screenshots and generating mouse/keyboard actions. As discussed in the computer use section, GPT-5.4 scores 75% on OSWorld-Verified (surpassing the 72.4% human baseline), GPT-5.4 mini scores 72.1%, and GPT-5.4 nano drops to 39%.
The Benchmark Gap
The gap between benchmark performance and real-world reliability remains significant. An agent that scores 77% on SWE-bench may perform very differently on your specific codebase, with your specific coding conventions, dependencies, and edge cases. Benchmarks measure average performance across a standardized set of tasks; production performance depends on the specific distribution of tasks your agent encounters.
This is why production agent deployments almost always include human review checkpoints, automated testing, and gradual rollout, rather than fully autonomous operation from day one.
Source: SWE-bench Verified scores: Claude Opus 4.6 80.8% (confirmed from anthropic.com, digitalapplied.com, neuralstackly.com, beehiiv.com/walterslabreport), Claude Opus 4.6 (Thinking) 79.2%, GPT-5.4 77.2%, Gemini 3 Flash 76.2% on vals.ai leaderboard using standardized mini-SWE-agent v2 harness (confirmed from vals.ai/benchmarks/swebench, updated February 2026); Gemini 3 Flash 78% per Google’s own evaluation (confirmed from developers.googleblog.com/gemini-3-flash-is-now-available-in-gemini-cli, spectrumailab.com, digitalapplied.com). OpenAI stopped reporting SWE-bench Verified scores in February 2026, citing 59.4% of audited failed tasks having flawed test cases and evidence of training data contamination across all major frontier models; endorses SWE-bench Pro as replacement (confirmed from openai.com/index/why-we-no-longer-evaluate-swe-bench-verified, blockchain.news, thenextgentechinsider.com, the-decoder.com, Latent Space podcast with Mia Glaese and Olivia Watkins). GPT-5.4 mini 72.1% OSWorld-Verified, GPT-5.4 nano 39% (confirmed from blockchain.news, ghacks.net, buildfastwithai.com).
Challenges and Limitations of Agents
Agents are powerful, but they are far from reliable. Understanding their limitations is essential for building systems that work in practice.
Compounding Errors
Each step in an agent loop has a probability of failure. If each step succeeds 95% of the time, a 10-step task succeeds only 0.95^10 = 60% of the time. A 20-step task succeeds only 36% of the time. This is the fundamental challenge of multi-step agents: errors compound exponentially with the number of steps.
def compounding_error_rates():
"""
Show how error rates compound over multiple agent steps.
"""
print("Compounding Error Rates in Multi-Step Agents")
print("=" * 55)
step_success_rates = [0.99, 0.97, 0.95, 0.90, 0.85]
step_counts = [1, 5, 10, 20, 50]
print(f" {'Steps':<8}", end="")
for rate in step_success_rates:
print(f" {rate*100:.0f}%/step", end="")
print()
print(" " + "-" * 50)
for steps in step_counts:
print(f" {steps:<8}", end="")
for rate in step_success_rates:
overall = rate ** steps * 100
print(f" {overall:>7.1f}%", end="")
print()
print(f"\n Even at 95% per-step accuracy, a 20-step task")
print(f" succeeds only 36% of the time.")
print(f" This is why agents need error recovery and retries.")
compounding_error_rates()This is why production agent systems include retry logic, error recovery, checkpointing (saving state so you can resume from the last successful step), and human-in-the-loop checkpoints (pausing to ask the user for confirmation before critical actions).
Context Window Exhaustion
Each step in the agent loop adds messages to the conversation: the model’s reasoning, the tool call request, and the tool result. For complex tasks with many steps, the conversation can grow to tens of thousands of tokens, eventually filling the model’s context window. When this happens, the model loses access to earlier parts of the conversation, which can cause it to repeat actions, forget its plan, or lose track of what it has already accomplished.
This is the “context rot” problem discussed in Chapter 20, applied to agents. The solutions are the same: summarizing earlier steps, using prompt caching (Chapter 19) to reduce costs, and designing agents that complete tasks in fewer steps.
Cost
Agent workflows are expensive. Each step in the loop requires an API call, and each call processes the full conversation history (including all previous tool calls and results). A 10-step agent workflow might process 50,000-100,000 tokens total, costing $0.10-$0.50 at current API prices. A complex 50-step workflow could cost $1-$5 or more. For high-volume applications, these costs add up quickly.
The cost structure also creates a perverse incentive: the more capable the agent (more steps, more tools, more reasoning), the more expensive it is to run. This is why many production systems use a tiered approach: a fast, cheap model handles simple requests, and a more capable (and expensive) model handles complex ones.
Safety and Control
An agent that can take actions in the world (sending emails, modifying files, making purchases) introduces safety risks that do not exist with a simple chatbot. If the model misunderstands the user’s intent or hallucinates a tool call, it could take harmful actions: deleting files, sending incorrect emails, or making unauthorized purchases.
This is why all production agent systems include:
- Confirmation prompts: Asking the user to approve high-stakes actions before execution
- Sandboxing: Running agent actions in isolated environments (like Codex’s cloud sandboxes) where they cannot affect production systems
- Guardrails: Validating tool call arguments before execution (e.g., checking that a file deletion target is within an allowed directory)
- Audit logging: Recording every action the agent takes for review and accountability
- Rate limiting: Preventing the agent from making too many actions in a short period
The fundamental tension is between autonomy and safety. A fully autonomous agent is more useful (it can complete tasks without human intervention) but more dangerous (it can take harmful actions without human oversight). The industry is still figuring out the right balance, and different applications require different levels of autonomy.
Putting It All Together: The Full Agent Stack
Here is how all the pieces fit together in a modern agent system as of March 2026:
def full_agent_stack():
"""
Show the complete technology stack for a modern agent system.
"""
print("The Full Agent Stack (March 2026)")
print("=" * 65)
layers = [
("User Interface",
"Chat UI, voice, IDE integration",
"ChatGPT, Claude Desktop, VS Code"),
("Agent Framework",
"Loop orchestration, handoffs, guardrails",
"OpenAI Agents SDK, Claude Agent SDK, LangGraph"),
("Model (Brain)",
"Reasoning, planning, tool selection",
"GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro"),
("Tool Protocol",
"Standardized tool discovery and execution",
"MCP (agent-to-tool), A2A (agent-to-agent)"),
("Tools and Data",
"External functions, APIs, databases",
"MCP servers, REST APIs, databases, GUIs"),
("Safety Layer",
"Guardrails, sandboxing, audit logging",
"Input validation, confirmation prompts, rate limits"),
]
for i, (layer, desc, examples) in enumerate(layers):
print(f"\n Layer {i+1}: {layer}")
print(f" {desc}")
print(f" Examples: {examples}")
full_agent_stack()The stack works as follows:
- The user makes a request through the interface (“Book me a flight to London next Tuesday”).
- The agent framework sends the request to the model.
- The model reasons about the task and decides to call a tool (e.g., search for flights).
- The framework routes the tool call through MCP to the appropriate server.
- The MCP server executes the function and returns the result.
- The framework sends the result back to the model.
- The model reasons about the result and decides on the next step.
- Steps 3-7 repeat until the task is complete.
- The safety layer validates each action before execution.
- The final result is presented to the user.
This is the architecture that powers Codex, Claude Code, Operator, and every other production agent system. The specific components vary (different models, different frameworks, different tools), but the pattern is universal.
Key Takeaways
Function calling allows language models to request the execution of external functions. The model generates a structured JSON request specifying the function name and arguments; your application executes the function and returns the result. OpenAI introduced this on June 13, 2023, and it is now supported by all major providers (Anthropic, Google, open-source models).
The evolution of function calling progressed from single calls (June 2023) to parallel calls (November 2023, DevDay) to guaranteed schema conformance via Structured Outputs (August 2024) to the unified Responses API (March 2025). The Assistants API, launched at DevDay 2023, is being sunset on August 26, 2026, in favor of the Responses API.
The Model Context Protocol (MCP), announced by Anthropic in November 2024, is an open standard that solves the N x M integration problem by providing a universal interface for connecting AI models to tools. MCP uses JSON-RPC 2.0 with a Host/Client/Server architecture and defines three primitives: Tools (model-initiated actions), Resources (application-initiated context), and Prompts (user-initiated templates).
MCP adoption was extraordinarily fast. OpenAI adopted it by March 2025, Google DeepMind and Microsoft by mid-2025. By December 2025, Anthropic donated MCP to the Agentic AI Foundation (AAIF) under the Linux Foundation, with eight founding platinum members including AWS, Anthropic, Block, Bloomberg, Cloudflare, Google, Microsoft, and OpenAI. As of March 2026, 3,012 servers are in the official registry (84.6% with source code available), over 10,000 in the broader ecosystem, and the protocol has 97 million monthly SDK downloads.
Computer use enables models to operate software through its GUI by interpreting screenshots and generating mouse/keyboard actions. Anthropic pioneered this with Computer Use (October 22, 2024), and OpenAI followed with Operator (January 23, 2025) powered by the CUA model. GPT-5.4 scores 75% on OSWorld-Verified (above the 72.4% human baseline), but computer use remains slower and less reliable than API-based tool calling.
The ReAct pattern (Reasoning + Acting, Yao et al., October 2022) formalizes the agent loop: Thought (reason about the situation), Action (call a tool), Observation (receive the result), Repeat. This is the fundamental pattern behind all modern agent systems.
Agent frameworks handle the boilerplate of building agents. The OpenAI Agents SDK (March 2025) provides Agents, Handoffs, Guardrails, and Tracing. The Claude Agent SDK (September 2025) adds subagents and checkpointing. Google ADK (April 2025) offers multi-agent support with bidirectional streaming. Third-party frameworks include LangGraph (stateful graphs), CrewAI (role-based teams), and AutoGen (conversational agents).
The A2A Protocol (Google, April 9, 2025) complements MCP by standardizing agent-to-agent communication. MCP connects agents to tools (vertical); A2A connects agents to other agents (horizontal). Together, they enable multi-agent systems with shared tool access. Google donated A2A to the Linux Foundation in June 2025, with over 100 supporting organizations including AWS, Cisco, Microsoft, Salesforce, SAP, and ServiceNow.
Production agents exist today. OpenAI Codex (May 2025) runs autonomous coding tasks in cloud sandboxes using codex-1 (an o3 variant trained with reinforcement learning). OpenAI Codex CLI (April 2025) is an open-source terminal agent under the Apache 2.0 license. Claude Code (February 2025, GA May 2025) operates in your terminal for agentic coding. Devin 2.0 (April 3, 2025) offers fully autonomous software engineering starting at $20/month. OpenAI Operator (January 2025) browses the web and performs online tasks.
Multi-agent systems divide complex tasks among specialized agents. A coordinator agent breaks the task into subtasks, delegates to specialists (researcher, analyst, writer), and synthesizes the results. The OpenAI Agents SDK implements this through handoffs; CrewAI uses role-based teams; LangGraph uses stateful graphs.
Gartner predicts 40% of enterprise applications will integrate AI agents by end of 2026, up from less than 5% in 2025. The agentic AI market reached approximately $7.84 billion in 2025, projected to $52.62 billion by 2030 at 46.3% CAGR. However, only 11% of enterprise pilots have reached production, and Gartner warns over 40% of agentic AI projects will be canceled by 2027 due to costs and unclear value.
Agent limitations are real. Errors compound exponentially (95% per-step accuracy yields only 36% success over 20 steps). Context windows fill up during long agent loops. Costs accumulate with each step. Safety requires confirmation prompts, sandboxing, guardrails, and audit logging. The tension between autonomy and safety remains the central challenge.
Agent benchmarks are maturing but imperfect. SWE-bench Verified measures coding agents on real GitHub issues (top score: Claude Opus 4.6 at 80.8%). OSWorld measures computer use (GPT-5.4 at 75%, above the 72.4% human baseline). However, OpenAI stopped reporting SWE-bench Verified scores in February 2026 due to contamination concerns (59.4% of audited failed tasks had flawed tests), endorsing SWE-bench Pro as a harder replacement. This highlights that benchmarks must evolve alongside the agents they measure. Production performance depends on your specific tasks, not benchmark averages.
The full agent stack in March 2026 consists of: a user interface, an agent framework (for orchestration), a model (for reasoning), a tool protocol (MCP for tools, A2A for agents), external tools and data, and a safety layer. This architecture powers every production agent system, from coding assistants to web browsing agents.
What’s Next
You now understand how models go beyond generating text to taking actions in the world: the function calling mechanism that lets models request tool execution, the MCP protocol that standardizes tool integration, the ReAct loop that powers multi-step agents, and the frameworks and protocols that enable multi-agent collaboration. In Chapter 24, we will look at the infrastructure that makes all of this possible: the GPU clusters, model parallelism strategies, batching techniques, and quantization methods that serve these models to millions of users simultaneously.