Most LLM tool calling tutorials show you the basics: define a tool, call it, return the result. That works fine when your model needs one piece of data. But real-world agents rarely need just one thing.
Say your user asks “What’s the weather in Tokyo, London, and New York?” A naive implementation makes three sequential API calls to your weather service. Each takes 200-500ms. Your user waits 1.5 seconds for no good reason.
Parallel tool calling lets the model request multiple tool executions in a single response. You get back an array of tool calls, run them all concurrently, and send every result back in one message. The user gets their answer in ~500ms instead of 1500ms.
OpenAI’s GPT-4o and Claude 3.5+ both support this natively. The model decides when to request multiple tools at once – you just need to handle the response correctly.
First, set up your tool definitions. These go in the tools parameter of your chat completion request. Each tool is a JSON Schema describing its name, purpose, and parameters.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
| from openai import OpenAI
client = OpenAI()
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "City name, e.g. 'Tokyo'"
},
"units": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["city"]
}
}
},
{
"type": "function",
"function": {
"name": "get_population",
"description": "Get the population of a city",
"parameters": {
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "City name"
}
},
"required": ["city"]
}
}
}
]
|
When you send a message like “Compare the weather and population of Tokyo and Paris,” the model can emit up to four tool calls in a single response – get_weather and get_population for each city.
Here’s the full flow: send the request, detect tool calls, execute them, and return results. The key detail is that each tool call has a unique id. You must reference that id when sending results back.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
| import json
def get_weather(city: str, units: str = "celsius") -> dict:
"""Simulated weather lookup. Replace with a real API call."""
weather_data = {
"Tokyo": {"temp": 18, "condition": "cloudy"},
"Paris": {"temp": 12, "condition": "rainy"},
"New York": {"temp": 8, "condition": "sunny"},
"London": {"temp": 10, "condition": "foggy"},
}
data = weather_data.get(city, {"temp": 20, "condition": "unknown"})
if units == "fahrenheit":
data["temp"] = round(data["temp"] * 9 / 5 + 32)
data["units"] = units
data["city"] = city
return data
def get_population(city: str) -> dict:
"""Simulated population lookup."""
populations = {
"Tokyo": 13_960_000,
"Paris": 2_161_000,
"New York": 8_336_000,
"London": 8_982_000,
}
return {"city": city, "population": populations.get(city, 0)}
# Map function names to actual callables
tool_functions = {
"get_weather": get_weather,
"get_population": get_population,
}
messages = [
{"role": "user", "content": "Compare the weather and population of Tokyo and Paris."}
]
# Step 1: Send the request
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
)
assistant_message = response.choices[0].message
# Step 2: Check if the model wants to call tools
if assistant_message.tool_calls:
# Append the assistant message (with tool_calls) to the conversation
messages.append(assistant_message)
# Step 3: Execute each tool call and collect results
for tool_call in assistant_message.tool_calls:
func_name = tool_call.function.name
func_args = json.loads(tool_call.function.arguments)
# Call the actual function
result = tool_functions[func_name](**func_args)
# Step 4: Append each result with the matching tool_call id
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(result),
})
# Step 5: Send everything back for the final answer
final_response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
)
print(final_response.choices[0].message.content)
|
The model receives all tool results at once and synthesizes them into a single answer. This is already faster than sequential calling because the model batches its requests. But the tool executions themselves still run in a loop. Let’s fix that.
When your tools make real network calls (HTTP requests to APIs, database queries), you want them running at the same time. Use asyncio.gather to execute all tool calls concurrently.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
| import asyncio
import json
import httpx
from openai import AsyncOpenAI
aclient = AsyncOpenAI()
async def get_weather_async(city: str, units: str = "celsius") -> dict:
"""Fetch weather from an API. Using httpx for async HTTP."""
async with httpx.AsyncClient() as http:
resp = await http.get(
"https://wttr.in/",
params={"format": "j1", "q": city},
timeout=10.0,
)
data = resp.json()
current = data["current_condition"][0]
temp_key = "temp_C" if units == "celsius" else "temp_F"
return {
"city": city,
"temp": int(current[temp_key]),
"condition": current["weatherDesc"][0]["value"],
"units": units,
}
async def get_population_async(city: str) -> dict:
"""Simulated async population lookup."""
await asyncio.sleep(0.1) # Simulate network delay
populations = {
"Tokyo": 13_960_000,
"Paris": 2_161_000,
"New York": 8_336_000,
"London": 8_982_000,
}
return {"city": city, "population": populations.get(city, 0)}
# Async function registry
async_tool_functions = {
"get_weather": get_weather_async,
"get_population": get_population_async,
}
async def run_parallel_tools(tool_calls):
"""Execute all tool calls concurrently and return results."""
tasks = []
for tc in tool_calls:
func_name = tc.function.name
func_args = json.loads(tc.function.arguments)
coro = async_tool_functions[func_name](**func_args)
tasks.append((tc.id, coro))
# Run all coroutines concurrently
results = await asyncio.gather(*(coro for _, coro in tasks))
# Pair each result with its tool_call_id
return [
{
"role": "tool",
"tool_call_id": tc_id,
"content": json.dumps(result),
}
for (tc_id, _), result in zip(tasks, results)
]
async def chat_with_parallel_tools(user_message: str) -> str:
messages = [{"role": "user", "content": user_message}]
response = await aclient.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools, # same tool definitions from above
)
assistant_message = response.choices[0].message
if not assistant_message.tool_calls:
return assistant_message.content
messages.append(assistant_message)
# Execute all tools concurrently
tool_results = await run_parallel_tools(assistant_message.tool_calls)
messages.extend(tool_results)
# Get final response
final = await aclient.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
)
return final.choices[0].message.content
# Run it
answer = asyncio.run(
chat_with_parallel_tools("What's the weather in Tokyo, London, and New York?")
)
print(answer)
|
With three cities, this runs all three weather lookups simultaneously. If each takes 300ms, total wall time is ~300ms instead of ~900ms.
You can influence whether the model calls tools at all, and OpenAI gives you a parallel_tool_calls parameter to explicitly control batching.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| # Force the model to call a specific tool
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
tool_choice={"type": "function", "function": {"name": "get_weather"}},
parallel_tool_calls=True, # default is True for supported models
)
# Disable parallel tool calls (force sequential)
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
parallel_tool_calls=False,
)
|
Set parallel_tool_calls=False when tool ordering matters – for example, when the second tool depends on the first tool’s output. The model will then emit one tool call at a time.
Sometimes the model needs multiple rounds of tool calls. The first round gathers data, and based on those results, it requests more tools. Here’s a pattern that handles arbitrary rounds.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
| async def multi_round_tool_loop(user_message: str, max_rounds: int = 5) -> str:
messages = [{"role": "user", "content": user_message}]
for _ in range(max_rounds):
response = await aclient.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
)
assistant_message = response.choices[0].message
messages.append(assistant_message)
# If no tool calls, we're done
if not assistant_message.tool_calls:
return assistant_message.content
# Execute this round's tool calls in parallel
tool_results = await run_parallel_tools(assistant_message.tool_calls)
messages.extend(tool_results)
# If we hit max_rounds, get a final answer without tools
response = await aclient.chat.completions.create(
model="gpt-4o",
messages=messages,
)
return response.choices[0].message.content
|
Cap max_rounds to prevent runaway loops. Five rounds is generous for most use cases.
Claude supports parallel tool calls through the same pattern. The API shape is different, but the concept is identical. Claude returns a list of tool_use content blocks when it wants multiple tools.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
| import anthropic
import json
ac = anthropic.Anthropic()
claude_tools = [
{
"name": "get_weather",
"description": "Get the current weather for a city",
"input_schema": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
"units": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
}
},
"required": ["city"]
}
},
{
"name": "get_population",
"description": "Get the population of a city",
"input_schema": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"}
},
"required": ["city"]
}
}
]
response = ac.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
tools=claude_tools,
messages=[
{"role": "user", "content": "Weather and population for Tokyo and Paris?"}
],
)
# Collect all tool_use blocks from the response
tool_use_blocks = [
block for block in response.content if block.type == "tool_use"
]
# Execute each tool and build result blocks
tool_results = []
for block in tool_use_blocks:
result = tool_functions[block.name](**block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": json.dumps(result),
})
# Send results back
final_response = ac.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
tools=claude_tools,
messages=[
{"role": "user", "content": "Weather and population for Tokyo and Paris?"},
{"role": "assistant", "content": response.content},
{"role": "user", "content": tool_results},
],
)
# Extract text from the response
text_blocks = [b.text for b in final_response.content if b.type == "text"]
print("\n".join(text_blocks))
|
Notice the structural difference: Claude nests tool results inside a user message with tool_result type blocks, while OpenAI uses a top-level tool role. Both APIs return parallel calls as arrays you can execute concurrently.
Common Errors and Fixes#
tool_call_id mismatch: Every tool result must reference the exact id from the corresponding tool_call. If you mix these up, the API returns a 400 error. Double-check your ID mapping.
Forgetting to append the assistant message: Before sending tool results, you must include the assistant’s message (the one containing tool_calls) in the conversation. Skipping it breaks the message flow.
1
2
3
4
5
6
| # Wrong -- missing assistant message
messages.append({"role": "tool", "tool_call_id": "call_abc", "content": "..."})
# Correct
messages.append(assistant_message) # includes tool_calls
messages.append({"role": "tool", "tool_call_id": "call_abc", "content": "..."})
|
JSON parsing failures: tool_call.function.arguments is a JSON string, not a dict. Always parse it with json.loads(). The model occasionally produces malformed JSON – wrap it in a try/except.
1
2
3
4
| try:
func_args = json.loads(tool_call.function.arguments)
except json.JSONDecodeError:
func_args = {} # fallback or re-prompt the model
|
asyncio.gather swallowing exceptions: By default, asyncio.gather raises the first exception and cancels remaining tasks. Use return_exceptions=True if you want partial results.
1
2
3
4
5
| results = await asyncio.gather(*coroutines, return_exceptions=True)
for r in results:
if isinstance(r, Exception):
# Handle the failed tool call gracefully
pass
|
Rate limits on concurrent calls: Firing 10 parallel HTTP requests to the same API can trigger rate limits. Add a semaphore to cap concurrency.
1
2
3
4
5
| sem = asyncio.Semaphore(5)
async def rate_limited_call(func, **kwargs):
async with sem:
return await func(**kwargs)
|
Claude tool results in wrong message format: Claude expects tool results inside the user role as a list of tool_result objects. Sending them as top-level tool role messages (like OpenAI) returns a validation error.