How to Implement Streaming Responses from LLM APIs

Waiting for a full LLM response before showing anything to the user feels broken. A 2000-token answer from GPT-4.1 or Claude can take 5-10 seconds to generate. Streaming fixes this by sending tokens as they are produced, using Server-Sent Events (SSE) over a single HTTP connection. Time-to-first-token drops to under a second.

Here is the fastest way to stream from both major providers, plus how to build your own streaming proxy with FastAPI.

Streaming with the OpenAI SDK

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
from openai import OpenAI

client = OpenAI()

stream = client.responses.create(
    model="gpt-4.1",
    input="Explain how TCP handshakes work in 3 sentences.",
    stream=True,
)

for event in stream:
    if event.type == "response.output_text.delta":
        print(event.delta, end="", flush=True)
print()

Pass stream=True and you get back an iterator of event objects instead of a single response. The key event type is response.output_text.delta – each one carries a delta string containing the next chunk of text. The flush=True matters: without it, Python buffers stdout and you lose the streaming effect in terminals.

Streaming with the Anthropic SDK

Anthropic’s SDK uses a context-managed stream, which is the better pattern because it guarantees cleanup of the underlying HTTP connection.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import anthropic

client = anthropic.Anthropic()

with client.messages.stream(
    model="claude-sonnet-4-5-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain how TCP handshakes work in 3 sentences."}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)
print()

The .text_stream property gives you just the text deltas, stripping away event metadata. If you need the raw SSE events (for tracking input tokens, stop reasons, etc.), use stream.events() instead.

One thing worth calling out: Anthropic uses messages.stream() as a separate method, not a stream=True flag. This is a deliberate API design choice that makes it harder to accidentally ignore the stream context manager.

How SSE Works Under the Hood

Both providers use Server-Sent Events. The HTTP response has Content-Type: text/event-stream and the body is a series of newline-delimited chunks:

1
2
3
4
5
6
7
data: {"type":"content_block_delta","delta":{"type":"text_delta","text":"TCP"}}

data: {"type":"content_block_delta","delta":{"type":"text_delta","text":" uses"}}

data: {"type":"content_block_delta","delta":{"type":"text_delta","text":" a three"}}

data: [DONE]

Each data: line is a JSON object. The blank lines between events are part of the SSE spec – they signal event boundaries. The SDKs handle parsing all of this, but understanding the wire format helps when you need to debug proxy issues or build custom clients.

Async Streaming

For web servers and concurrent workloads, use the async clients. Both SDKs support this natively.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import asyncio
import anthropic

async def stream_response(prompt: str) -> str:
    client = anthropic.AsyncAnthropic()
    collected = []

    async with client.messages.stream(
        model="claude-sonnet-4-5-20250514",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
    ) as stream:
        async for text in stream.text_stream:
            collected.append(text)
            print(text, end="", flush=True)

    return "".join(collected)

result = asyncio.run(stream_response("What is QUIC?"))

The OpenAI async version is nearly identical – swap OpenAI for AsyncOpenAI and add async/await:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from openai import AsyncOpenAI

async def stream_openai(prompt: str):
    client = AsyncOpenAI()
    stream = await client.responses.create(
        model="gpt-4.1",
        input=prompt,
        stream=True,
    )
    async for event in stream:
        if event.type == "response.output_text.delta":
            print(event.delta, end="", flush=True)

Use async streaming whenever you are handling multiple concurrent requests. The sync versions block the event loop and will tank your server throughput.

Building a Streaming FastAPI Endpoint

The most common real-world pattern is proxying LLM streams through your own API. FastAPI’s StreamingResponse pairs perfectly with SSE.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import anthropic

app = FastAPI()

async def generate_stream(prompt: str):
    client = anthropic.AsyncAnthropic()
    async with client.messages.stream(
        model="claude-sonnet-4-5-20250514",
        max_tokens=2048,
        messages=[{"role": "user", "content": prompt}],
    ) as stream:
        async for text in stream.text_stream:
            # SSE format: each chunk is a data line followed by two newlines
            yield f"data: {text}\n\n"
    yield "data: [DONE]\n\n"

@app.post("/chat")
async def chat(prompt: str):
    return StreamingResponse(
        generate_stream(prompt),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
        },
    )

Run it with uvicorn main:app --reload and consume it from JavaScript:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
const response = await fetch("/chat", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({ prompt: "Explain DNS resolution" }),
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  const chunk = decoder.decode(value);
  // Parse SSE lines
  for (const line of chunk.split("\n")) {
    if (line.startsWith("data: ") && line !== "data: [DONE]") {
      process.stdout.write(line.slice(6));
    }
  }
}

Set Cache-Control: no-cache and Connection: keep-alive on the response. Without these, reverse proxies like nginx will buffer the entire response and defeat the purpose of streaming.

Handling Partial Chunks

Tokens do not always arrive as clean words. You will get chunks like " th", "ink", "ing" for the word “thinking.” This matters when you are doing any post-processing on the stream, such as rendering Markdown or detecting code blocks.

The safest approach: accumulate text in a buffer and only process complete units.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import re

buffer = ""

async def process_stream(stream):
    global buffer
    async for text in stream.text_stream:
        buffer += text
        # Only yield complete sentences
        while True:
            match = re.search(r"[.!?]\s", buffer)
            if not match:
                break
            end = match.end()
            yield buffer[:end]
            buffer = buffer[end:]
    # Flush remaining buffer
    if buffer:
        yield buffer

For Markdown rendering, you need a similar approach but split on code fence boundaries (triple backticks) so you do not try to render half a code block.

Common Errors

openai.APIConnectionError: Connection error – Your network is blocking the SSE connection, or a proxy is terminating it early. Check if you are behind a corporate firewall. SSE requires long-lived HTTP connections, and some proxies kill them after 30 seconds.

anthropic.APIStatusError: 529 Overloaded – Anthropic’s servers are at capacity. Implement exponential backoff. Unlike non-streaming requests, you cannot retry mid-stream – you have to restart from the beginning.

Chunks arrive but the stream hangs before [DONE] – Usually a max_tokens issue. The model hit the token limit and stopped generating, but some SDK versions do not emit a clear stop event. Always set max_tokens explicitly and check the stop_reason in the final message event.

TypeError: 'async_generator' object is not iterable – You used for instead of async for on an async stream. This is the most common mistake when switching between sync and async code paths.

nginx buffering kills streaming – Add proxy_buffering off; and X-Accel-Buffering: no header to your nginx config. Without this, nginx collects the entire response before forwarding it, making streaming pointless.

1
2
3
4
5
6
7
location /chat {
    proxy_pass http://localhost:8000;
    proxy_buffering off;
    proxy_set_header X-Accel-Buffering no;
    proxy_set_header Connection '';
    proxy_http_version 1.1;
}

Browser EventSource does not support POST – The native EventSource API only works with GET requests. For POST-based streaming endpoints, use fetch() with ReadableStream as shown in the FastAPI section above, or use a library like sse.js.

Streaming with the OpenAI SDK#

Streaming with the Anthropic SDK#

How SSE Works Under the Hood#

Async Streaming#

Building a Streaming FastAPI Endpoint#

Handling Partial Chunks#

Common Errors#

Related Guides#

About the Author