Waiting for a full LLM response before showing anything to the user feels broken. A 2000-token answer from GPT-4.1 or Claude can take 5-10 seconds to generate. Streaming fixes this by sending tokens as they are produced, using Server-Sent Events (SSE) over a single HTTP connection. Time-to-first-token drops to under a second.
Here is the fastest way to stream from both major providers, plus how to build your own streaming proxy with FastAPI.
Streaming with the OpenAI SDK
| |
Pass stream=True and you get back an iterator of event objects instead of a single response. The key event type is response.output_text.delta – each one carries a delta string containing the next chunk of text. The flush=True matters: without it, Python buffers stdout and you lose the streaming effect in terminals.
Streaming with the Anthropic SDK
Anthropic’s SDK uses a context-managed stream, which is the better pattern because it guarantees cleanup of the underlying HTTP connection.
| |
The .text_stream property gives you just the text deltas, stripping away event metadata. If you need the raw SSE events (for tracking input tokens, stop reasons, etc.), use stream.events() instead.
One thing worth calling out: Anthropic uses messages.stream() as a separate method, not a stream=True flag. This is a deliberate API design choice that makes it harder to accidentally ignore the stream context manager.
How SSE Works Under the Hood
Both providers use Server-Sent Events. The HTTP response has Content-Type: text/event-stream and the body is a series of newline-delimited chunks:
| |
Each data: line is a JSON object. The blank lines between events are part of the SSE spec – they signal event boundaries. The SDKs handle parsing all of this, but understanding the wire format helps when you need to debug proxy issues or build custom clients.
Async Streaming
For web servers and concurrent workloads, use the async clients. Both SDKs support this natively.
| |
The OpenAI async version is nearly identical – swap OpenAI for AsyncOpenAI and add async/await:
| |
Use async streaming whenever you are handling multiple concurrent requests. The sync versions block the event loop and will tank your server throughput.
Building a Streaming FastAPI Endpoint
The most common real-world pattern is proxying LLM streams through your own API. FastAPI’s StreamingResponse pairs perfectly with SSE.
| |
Run it with uvicorn main:app --reload and consume it from JavaScript:
| |
Set Cache-Control: no-cache and Connection: keep-alive on the response. Without these, reverse proxies like nginx will buffer the entire response and defeat the purpose of streaming.
Handling Partial Chunks
Tokens do not always arrive as clean words. You will get chunks like " th", "ink", "ing" for the word “thinking.” This matters when you are doing any post-processing on the stream, such as rendering Markdown or detecting code blocks.
The safest approach: accumulate text in a buffer and only process complete units.
| |
For Markdown rendering, you need a similar approach but split on code fence boundaries (triple backticks) so you do not try to render half a code block.
Common Errors
openai.APIConnectionError: Connection error – Your network is blocking the SSE connection, or a proxy is terminating it early. Check if you are behind a corporate firewall. SSE requires long-lived HTTP connections, and some proxies kill them after 30 seconds.
anthropic.APIStatusError: 529 Overloaded – Anthropic’s servers are at capacity. Implement exponential backoff. Unlike non-streaming requests, you cannot retry mid-stream – you have to restart from the beginning.
Chunks arrive but the stream hangs before [DONE] – Usually a max_tokens issue. The model hit the token limit and stopped generating, but some SDK versions do not emit a clear stop event. Always set max_tokens explicitly and check the stop_reason in the final message event.
TypeError: 'async_generator' object is not iterable – You used for instead of async for on an async stream. This is the most common mistake when switching between sync and async code paths.
nginx buffering kills streaming – Add proxy_buffering off; and X-Accel-Buffering: no header to your nginx config. Without this, nginx collects the entire response before forwarding it, making streaming pointless.
| |
Browser EventSource does not support POST – The native EventSource API only works with GET requests. For POST-based streaming endpoints, use fetch() with ReadableStream as shown in the FastAPI section above, or use a library like sse.js.
Related Guides
- How to Use Prompt Caching to Cut LLM API Costs
- How to Build Prompt Chains with Async LLM Calls and Batching
- How to Build Prompt Templates with Python F-Strings and Chat Markup
- How to Route Prompts to the Best LLM with a Semantic Router
- How to Build Multi-Turn Chatbots with Conversation Memory
- How to Build Token-Efficient Prompt Batching with LLM APIs
- How to Build Prompt Regression Tests with LLM-as-Judge
- How to Build Prompt Versioning and Regression Testing for LLMs
- How to Build Prompt Caching Strategies for Multi-Turn LLM Sessions
- How to Build LLM Output Validators with Instructor and Pydantic