The OpenAI Assistants API gives you a stateful, server-side agent runtime. Unlike the Chat Completions API where you manage conversation history yourself, Assistants handle threading, tool execution, and state persistence on OpenAI’s servers. You create an assistant once, spin up threads per conversation, and let the API manage the back-and-forth – including automatic code execution and file handling.

This is the right choice when you need persistent conversations, built-in code execution, or file retrieval without building that infrastructure yourself. If you want something lighter-weight or fully client-side, stick with Chat Completions and function calling. But for anything with multi-turn tool use and state, Assistants save you a lot of plumbing.

Set Up the Client

1
pip install openai
1
2
3
from openai import OpenAI

client = OpenAI()  # Uses OPENAI_API_KEY env var

Set your API key:

1
export OPENAI_API_KEY="sk-your-key-here"

That’s all you need. The openai package handles auth, retries, and streaming. Make sure you’re on version 1.x or later – the Assistants API doesn’t exist in the 0.x SDK.

Create an Assistant with Tools

An assistant is a configured agent that persists across conversations. You define its instructions, model, and which tools it can use. The three built-in tool types are code_interpreter, file_search, and custom function tools.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
assistant = client.beta.assistants.create(
    name="Data Analyst",
    instructions=(
        "You are a data analyst agent. When given data or questions about data, "
        "write and execute Python code to answer. Always show your work. "
        "If you need to make calculations, use code_interpreter."
    ),
    model="gpt-4o",
    tools=[
        {"type": "code_interpreter"},
        {
            "type": "function",
            "function": {
                "name": "get_stock_price",
                "description": "Get the current stock price for a given ticker symbol.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "ticker": {
                            "type": "string",
                            "description": "Stock ticker symbol, e.g. AAPL"
                        }
                    },
                    "required": ["ticker"]
                }
            }
        }
    ]
)

print(f"Assistant ID: {assistant.id}")
# Save this ID -- you reuse it across sessions

A few opinions here: always set explicit instructions. The default behavior without instructions is unpredictable. And use gpt-4o for agents that need tool calling – the smaller models hallucinate tool arguments more often and struggle with multi-step reasoning.

Run a Conversation with Threads

Threads are conversation containers. Each thread holds a message history, and you can create as many as you need per assistant. OpenAI stores the thread server-side, so you just need the thread ID to resume later.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Create a thread
thread = client.beta.threads.create()

# Add a user message
client.beta.threads.messages.create(
    thread_id=thread.id,
    role="user",
    content="What's the stock price of AAPL, and calculate what 1000 shares would be worth?"
)

# Start a run
run = client.beta.threads.runs.create(
    thread_id=thread.id,
    assistant_id=assistant.id,
)

The run kicks off the agent loop on OpenAI’s servers. The assistant reads the thread, decides which tools to call, and starts executing. But it doesn’t return the result immediately – you need to poll for completion.

Poll for Run Completion

Runs are asynchronous. After creating one, you poll until it reaches a terminal status. The key statuses are completed, requires_action (needs you to handle a function call), failed, and expired.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import time

def wait_for_run(client, thread_id, run_id):
    """Poll a run until it completes or requires action."""
    while True:
        run = client.beta.threads.runs.retrieve(
            thread_id=thread_id,
            run_id=run_id
        )

        if run.status == "completed":
            return run
        elif run.status == "requires_action":
            return run
        elif run.status in ("failed", "cancelled", "expired"):
            raise RuntimeError(f"Run ended with status: {run.status} - {run.last_error}")

        time.sleep(1)  # Don't hammer the API

run = wait_for_run(client, thread.id, run.id)

Polling every second is fine for most use cases. If you need real-time updates in a UI, use the streaming API instead (covered below). Don’t poll faster than once per second – you’ll hit rate limits and waste API calls.

Handle Function Tool Calls

When the assistant decides to call your custom function, the run enters the requires_action status. You extract the function call arguments, execute your function locally, and submit the results back.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import json

def handle_tool_calls(client, thread_id, run):
    """Execute function tool calls and submit results."""
    tool_outputs = []

    for tool_call in run.required_action.submit_tool_outputs.tool_calls:
        function_name = tool_call.function.name
        arguments = json.loads(tool_call.function.arguments)

        if function_name == "get_stock_price":
            # Your actual implementation here
            result = fetch_stock_price(arguments["ticker"])
            tool_outputs.append({
                "tool_call_id": tool_call.id,
                "output": json.dumps({"price": result, "currency": "USD"})
            })
        else:
            tool_outputs.append({
                "tool_call_id": tool_call.id,
                "output": json.dumps({"error": f"Unknown function: {function_name}"})
            })

    # Submit all results at once
    run = client.beta.threads.runs.submit_tool_outputs(
        thread_id=thread_id,
        run_id=run.id,
        tool_outputs=tool_outputs,
    )
    return run


def fetch_stock_price(ticker: str) -> float:
    """Stub -- replace with a real API call (e.g., yfinance, Alpha Vantage)."""
    prices = {"AAPL": 237.50, "GOOGL": 191.30, "MSFT": 452.80}
    return prices.get(ticker, 0.0)

Always return JSON strings in the output field. The assistant parses the output and uses it in its next reasoning step. If your function fails, return an error message in the output rather than raising an exception – the assistant can often recover and explain the error to the user.

The Full Agent Loop

Putting it all together, here’s the complete autonomous agent loop that handles multiple rounds of tool calls:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
def run_agent(client, assistant_id, user_message):
    """Run a full agent conversation with tool call handling."""
    thread = client.beta.threads.create()

    client.beta.threads.messages.create(
        thread_id=thread.id,
        role="user",
        content=user_message,
    )

    run = client.beta.threads.runs.create(
        thread_id=thread.id,
        assistant_id=assistant_id,
    )

    while True:
        run = wait_for_run(client, thread.id, run.id)

        if run.status == "completed":
            break
        elif run.status == "requires_action":
            run = handle_tool_calls(client, thread.id, run)
            # Loop continues -- the assistant may call more tools

    # Fetch the assistant's final response
    messages = client.beta.threads.messages.list(thread_id=thread.id)

    for msg in messages.data:
        if msg.role == "assistant":
            for block in msg.content:
                if block.type == "text":
                    return block.text.value

    return None


# Usage
response = run_agent(
    client,
    assistant.id,
    "What's the stock price of AAPL, and calculate what 1000 shares would be worth?"
)
print(response)

The key thing to notice: the while True loop handles cases where the assistant calls a function, gets the result, then decides it needs to call another function (or use code_interpreter) before producing a final answer. Real agents often chain 3-5 tool calls per query.

Use Streaming for Real-Time Output

Polling works but the user sees nothing until the run completes. Streaming gives you token-by-token output and real-time tool call events:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from openai import AssistantEventHandler

class AgentStreamHandler(AssistantEventHandler):
    def on_text_created(self, text):
        print("\nassistant > ", end="", flush=True)

    def on_text_delta(self, delta, snapshot):
        print(delta.value, end="", flush=True)

    def on_tool_call_created(self, tool_call):
        print(f"\n[calling {tool_call.type}]", flush=True)

    def on_tool_call_delta(self, delta, snapshot):
        if delta.type == "code_interpreter" and delta.code_interpreter:
            if delta.code_interpreter.input:
                print(delta.code_interpreter.input, end="", flush=True)


# Stream a run
with client.beta.threads.runs.stream(
    thread_id=thread.id,
    assistant_id=assistant.id,
    event_handler=AgentStreamHandler(),
) as stream:
    stream.until_done()

Streaming is strictly better for user-facing applications. The only reason to use polling is batch processing where latency doesn’t matter.

Common Errors

Run expired after 10 minutes. Runs have a default timeout. If your function calls take too long, the run expires. Submit tool outputs within a few seconds, or increase the timeout with timeout in the run configuration.

Invalid tool_call_id when submitting outputs. You must use the exact tool_call.id from the run’s required_action. Don’t generate your own IDs. Each tool call has a unique ID and the API validates it.

Thread not found errors. Thread IDs are scoped to your organization. If you’re switching between API keys or orgs, threads from one won’t be visible in another. Store thread IDs alongside the org context.

Rate limit exceeded during polling. If you’re polling too aggressively across many concurrent runs, you’ll burn through your rate limit. Use exponential backoff or switch to streaming, which uses a single long-lived connection.

Function arguments are malformed JSON. The model sometimes generates invalid JSON for complex parameter schemas. Keep your function parameter schemas simple – flat objects with clear descriptions work best. If you need nested objects, provide explicit examples in the function description.

code_interpreter runs but produces wrong results. The sandbox environment has limited packages. It includes numpy, pandas, matplotlib, and scipy, but not specialized libraries. If the assistant tries to import sklearn and fails, it’ll usually fall back to manual implementations – which may be wrong. Specify what’s available in your assistant instructions.

When to Use Assistants vs Chat Completions

Use the Assistants API when you need persistent threads (customer support, tutoring), built-in code execution, or file handling. The server manages state, so you don’t need a database for conversation history.

Use Chat Completions with function calling when you want full control over the loop, lower latency (no polling overhead), or you’re building something stateless. Chat Completions is also cheaper per token since there’s no thread storage cost.

Don’t mix them in the same application for the same use case. Pick one pattern and stick with it.