How to Use the OpenAI Realtime API for Voice Applications

The OpenAI Realtime API gives you a persistent WebSocket connection to GPT-4o with native audio understanding. Instead of transcribing speech to text, sending it to a chat endpoint, and then synthesizing the response back to audio, the Realtime API handles all of that in a single round trip. Latency drops dramatically.

Here’s the fastest way to connect and get a response:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import asyncio
import json
import os
import websockets

API_KEY = os.environ["OPENAI_API_KEY"]
URL = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"

async def quick_test():
    headers = {
        "Authorization": f"Bearer {API_KEY}",
        "OpenAI-Beta": "realtime=v1",
    }
    async with websockets.connect(URL, additional_headers=headers) as ws:
        # Wait for the session.created event
        event = json.loads(await ws.recv())
        print(f"Session started: {event['type']}")

        # Send a text message and request a response
        await ws.send(json.dumps({
            "type": "conversation.item.create",
            "item": {
                "type": "message",
                "role": "user",
                "content": [{"type": "input_text", "text": "Say hello in three words."}],
            },
        }))
        await ws.send(json.dumps({"type": "response.create"}))

        # Read events until the response completes
        async for raw in ws:
            event = json.loads(raw)
            if event["type"] == "response.text.delta":
                print(event["delta"], end="", flush=True)
            elif event["type"] == "response.done":
                print()
                break

asyncio.run(quick_test())

Install the dependency with pip install websockets. Set your API key as OPENAI_API_KEY in your environment. That’s all you need to get a working connection.

Configuring the Session

After the WebSocket connects, the server sends a session.created event with default settings. You almost always want to override those defaults with a session.update event. This is where you set the model behavior, voice, and which modalities you want.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
async def configure_session(ws):
    session_config = {
        "type": "session.update",
        "session": {
            "modalities": ["text", "audio"],
            "instructions": "You are a helpful voice assistant. Keep answers brief and conversational.",
            "voice": "alloy",
            "input_audio_format": "pcm16",
            "output_audio_format": "pcm16",
            "input_audio_transcription": {
                "model": "whisper-1",
            },
            "turn_detection": {
                "type": "server_vad",
                "threshold": 0.5,
                "prefix_padding_ms": 300,
                "silence_duration_ms": 500,
            },
        },
    }
    await ws.send(json.dumps(session_config))
    ack = json.loads(await ws.recv())
    print(f"Config acknowledged: {ack['type']}")

Key settings to know:

modalities controls what the model can receive and produce. Set ["text", "audio"] for voice apps. Use ["text"] if you only need text output.
voice picks the TTS voice for audio responses. Options include alloy, echo, shimmer, ash, ballad, coral, sage, and verse.
turn_detection with server_vad lets the server detect when the user stops speaking and automatically triggers a response. Set silence_duration_ms higher (like 800) if users are getting cut off mid-sentence.
input_audio_transcription is optional but useful. When enabled, the server sends conversation.item.input_audio_transcription.completed events with the text of what the user said.

Sending Audio Input

For voice applications, you stream raw audio to the server using input_audio_buffer.append events. The audio must be base64-encoded PCM16 at 24kHz mono (the default format). Here’s how to capture from a microphone and stream it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import base64
import pyaudio

CHUNK = 4800  # 200ms at 24kHz
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 24000

async def stream_microphone(ws):
    audio = pyaudio.PyAudio()
    stream = audio.open(
        format=FORMAT,
        channels=CHANNELS,
        rate=RATE,
        input=True,
        frames_per_buffer=CHUNK,
    )
    print("Recording... press Ctrl+C to stop")
    try:
        while True:
            data = stream.read(CHUNK, exception_on_overflow=False)
            encoded = base64.b64encode(data).decode("utf-8")
            await ws.send(json.dumps({
                "type": "input_audio_buffer.append",
                "audio": encoded,
            }))
            await asyncio.sleep(0.01)
    except KeyboardInterrupt:
        pass
    finally:
        stream.stop_stream()
        stream.close()
        audio.terminate()

If you have server VAD enabled, the server automatically detects speech boundaries and triggers a response. If you disabled VAD (set turn_detection to null), you need to manually commit the buffer and request a response:

1
2
await ws.send(json.dumps({"type": "input_audio_buffer.commit"}))
await ws.send(json.dumps({"type": "response.create"}))

Audio responses come back as response.audio.delta events with base64-encoded chunks. Decode them and write to your audio output device or save to a file.

Adding Function Calling

The Realtime API supports tool use the same way the Chat Completions API does. You define tools in the session config, and the model calls them when relevant. This is where things get powerful – your voice assistant can look up data, control devices, or call external APIs mid-conversation.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
import asyncio
import json
import os
import websockets

API_KEY = os.environ["OPENAI_API_KEY"]
URL = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"

def get_weather(location: str) -> dict:
    # Replace with a real weather API call
    return {"location": location, "temperature": "72°F", "condition": "sunny"}

TOOLS = [
    {
        "type": "function",
        "name": "get_weather",
        "description": "Get the current weather for a given location.",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "City name, e.g. San Francisco",
                },
            },
            "required": ["location"],
        },
    },
]

async def run_with_tools():
    headers = {
        "Authorization": f"Bearer {API_KEY}",
        "OpenAI-Beta": "realtime=v1",
    }
    async with websockets.connect(URL, additional_headers=headers) as ws:
        # Wait for session.created
        await ws.recv()

        # Configure session with tools
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "instructions": "You are a weather assistant. Use the get_weather tool when asked about weather.",
                "voice": "alloy",
                "tools": TOOLS,
                "tool_choice": "auto",
            },
        }))
        await ws.recv()  # session.updated ack

        # Send a user message that should trigger tool use
        await ws.send(json.dumps({
            "type": "conversation.item.create",
            "item": {
                "type": "message",
                "role": "user",
                "content": [{"type": "input_text", "text": "What's the weather in Denver?"}],
            },
        }))
        await ws.send(json.dumps({"type": "response.create"}))

        # Process events
        async for raw in ws:
            event = json.loads(raw)

            if event["type"] == "response.function_call_arguments.done":
                call_id = event["call_id"]
                name = event["name"]
                args = json.loads(event["arguments"])
                print(f"Tool called: {name}({args})")

                # Execute the function
                result = get_weather(**args)

                # Send the result back
                await ws.send(json.dumps({
                    "type": "conversation.item.create",
                    "item": {
                        "type": "function_call_output",
                        "call_id": call_id,
                        "output": json.dumps(result),
                    },
                }))
                # Ask the model to continue with the tool result
                await ws.send(json.dumps({"type": "response.create"}))

            elif event["type"] == "response.text.delta":
                print(event["delta"], end="", flush=True)

            elif event["type"] == "response.audio_transcript.delta":
                print(event["delta"], end="", flush=True)

            elif event["type"] == "response.done":
                print()
                break

asyncio.run(run_with_tools())

The flow works like this: the model receives your message, decides to call get_weather, and sends a response.function_call_arguments.done event. You execute the function locally, send the result as a function_call_output conversation item, then trigger another response.create. The model generates its final answer using the tool result.

You can define multiple tools. The model picks which ones to call based on the conversation context, just like with the Chat Completions API.

Common Errors and Fixes

401 Unauthorized on connect – Your API key is missing or invalid. Make sure you’re passing the Authorization header, not a query parameter. The Realtime API requires the OpenAI-Beta: realtime=v1 header too.

websockets.exceptions.InvalidStatusCode: 429 – You’ve hit the rate limit. The Realtime API has separate rate limits from the REST API. Back off and retry with exponential delay:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import random

async def connect_with_retry(max_retries=5):
    for attempt in range(max_retries):
        try:
            headers = {
                "Authorization": f"Bearer {API_KEY}",
                "OpenAI-Beta": "realtime=v1",
            }
            ws = await websockets.connect(URL, additional_headers=headers)
            return ws
        except websockets.exceptions.InvalidStatusCode as e:
            if e.status_code == 429 and attempt < max_retries - 1:
                wait = (2 ** attempt) + random.uniform(0, 1)
                print(f"Rate limited. Retrying in {wait:.1f}s...")
                await asyncio.sleep(wait)
            else:
                raise

Audio sounds garbled or too fast – You’re probably using the wrong sample rate. The Realtime API defaults to PCM16 at 24kHz. If your input is 16kHz or 44.1kHz, it’ll sound wrong on both ends. Match RATE = 24000 in your PyAudio config.

input_audio_buffer.speech_stopped fires too early – Users get cut off mid-sentence. Increase silence_duration_ms in the turn detection config. The default 500ms is aggressive for some speakers. Try 800ms or 1000ms.

Connection drops after 15 minutes – The Realtime API has a maximum session duration. For long-running applications, implement reconnection logic that creates a new session and replays the conversation context.

conversation.item.create returns an error about invalid content type – When sending audio items manually, make sure the content type is input_audio with base64 data, not input_text. For text messages, use input_text. Mixing them up causes silent failures or error events.

No response.audio.delta events received – Check that your session modalities include "audio". If you configured modalities: ["text"], the model only returns text. Update the session to ["text", "audio"] for voice output.

Configuring the Session#

Sending Audio Input#

Adding Function Calling#

Common Errors and Fixes#

Related Guides#

About the Author

Configuring the Session

Sending Audio Input

Adding Function Calling

Common Errors and Fixes

Related Guides