The OpenAI Realtime API gives you a persistent WebSocket connection to GPT-4o with native audio understanding. Instead of transcribing speech to text, sending it to a chat endpoint, and then synthesizing the response back to audio, the Realtime API handles all of that in a single round trip. Latency drops dramatically.
Here’s the fastest way to connect and get a response:
| |
Install the dependency with pip install websockets. Set your API key as OPENAI_API_KEY in your environment. That’s all you need to get a working connection.
Configuring the Session
After the WebSocket connects, the server sends a session.created event with default settings. You almost always want to override those defaults with a session.update event. This is where you set the model behavior, voice, and which modalities you want.
| |
Key settings to know:
- modalities controls what the model can receive and produce. Set
["text", "audio"]for voice apps. Use["text"]if you only need text output. - voice picks the TTS voice for audio responses. Options include
alloy,echo,shimmer,ash,ballad,coral,sage, andverse. - turn_detection with
server_vadlets the server detect when the user stops speaking and automatically triggers a response. Setsilence_duration_mshigher (like 800) if users are getting cut off mid-sentence. - input_audio_transcription is optional but useful. When enabled, the server sends
conversation.item.input_audio_transcription.completedevents with the text of what the user said.
Sending Audio Input
For voice applications, you stream raw audio to the server using input_audio_buffer.append events. The audio must be base64-encoded PCM16 at 24kHz mono (the default format). Here’s how to capture from a microphone and stream it:
| |
If you have server VAD enabled, the server automatically detects speech boundaries and triggers a response. If you disabled VAD (set turn_detection to null), you need to manually commit the buffer and request a response:
| |
Audio responses come back as response.audio.delta events with base64-encoded chunks. Decode them and write to your audio output device or save to a file.
Adding Function Calling
The Realtime API supports tool use the same way the Chat Completions API does. You define tools in the session config, and the model calls them when relevant. This is where things get powerful – your voice assistant can look up data, control devices, or call external APIs mid-conversation.
| |
The flow works like this: the model receives your message, decides to call get_weather, and sends a response.function_call_arguments.done event. You execute the function locally, send the result as a function_call_output conversation item, then trigger another response.create. The model generates its final answer using the tool result.
You can define multiple tools. The model picks which ones to call based on the conversation context, just like with the Chat Completions API.
Common Errors and Fixes
401 Unauthorized on connect – Your API key is missing or invalid. Make sure you’re passing the Authorization header, not a query parameter. The Realtime API requires the OpenAI-Beta: realtime=v1 header too.
websockets.exceptions.InvalidStatusCode: 429 – You’ve hit the rate limit. The Realtime API has separate rate limits from the REST API. Back off and retry with exponential delay:
| |
Audio sounds garbled or too fast – You’re probably using the wrong sample rate. The Realtime API defaults to PCM16 at 24kHz. If your input is 16kHz or 44.1kHz, it’ll sound wrong on both ends. Match RATE = 24000 in your PyAudio config.
input_audio_buffer.speech_stopped fires too early – Users get cut off mid-sentence. Increase silence_duration_ms in the turn detection config. The default 500ms is aggressive for some speakers. Try 800ms or 1000ms.
Connection drops after 15 minutes – The Realtime API has a maximum session duration. For long-running applications, implement reconnection logic that creates a new session and replays the conversation context.
conversation.item.create returns an error about invalid content type – When sending audio items manually, make sure the content type is input_audio with base64 data, not input_text. For text messages, use input_text. Mixing them up causes silent failures or error events.
No response.audio.delta events received – Check that your session modalities include "audio". If you configured modalities: ["text"], the model only returns text. Update the session to ["text", "audio"] for voice output.
Related Guides
- How to Use the Anthropic Tool Use API for Agentic Workflows
- How to Use the Anthropic Prompt Caching API with Context Blocks
- How to Use the Fireworks AI API for Fast Open-Source LLMs
- How to Use the Stability AI API for Image and Video Generation
- How to Use the AWS Bedrock Converse API for Multi-Model Chat
- How to Use the Cerebras API for Fast LLM Inference
- How to Use the xAI Grok API for Chat and Function Calling
- How to Use the Weights and Biases Prompts API for LLM Tracing
- How to Use the Anthropic Multi-Turn Conversation API with Tool Use
- How to Use the Together AI API for Open-Source LLMs