StreamDiffusion rewrites how diffusion models handle sequential frames. Instead of running full denoising per image, it batches denoising steps across a sliding window of frames, cutting latency to the point where you get 100+ FPS text-to-image on an RTX 4090. This is the library to reach for when you need Stable Diffusion at interactive speeds.

Quick Start: Text-to-Image in a Loop

Here is the fastest path to generating images with StreamDiffusion. This uses sd-turbo with a single denoising step and the Tiny VAE for maximum throughput.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import torch
from diffusers import AutoencoderTiny, StableDiffusionPipeline
from streamdiffusion import StreamDiffusion
from streamdiffusion.image_utils import postprocess_image

pipe = StableDiffusionPipeline.from_pretrained("stabilityai/sd-turbo").to(
    device=torch.device("cuda"),
    dtype=torch.float16,
)

stream = StreamDiffusion(
    pipe,
    t_index_list=[0],
    torch_dtype=torch.float16,
    cfg_type="none",
)

stream.load_lcm_lora()
stream.fuse_lora()
stream.vae = AutoencoderTiny.from_pretrained("madebyollin/taesd").to(
    device=pipe.device, dtype=pipe.dtype
)
pipe.enable_xformers_memory_efficient_attention()

stream.prepare("a cyberpunk cityscape at sunset, neon lights, rain")

# Warmup pass — must run at least len(t_index_list) iterations
for _ in range(4):
    stream()

# Generate frames
for i in range(20):
    x_output = stream.txt2img()
    image = postprocess_image(x_output, output_type="pil")[0]
    image.save(f"frame_{i:03d}.png")

The t_index_list controls which denoising timesteps to execute. With sd-turbo, a single step ([0]) is enough. For models like KBlueLeaf/kohaku-v2.1 with LCM-LoRA, use [0, 16, 32, 45] for four-step generation.

Installation

StreamDiffusion requires Python 3.10, PyTorch with CUDA, and xformers.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
conda create -n streamdiffusion python=3.10
conda activate streamdiffusion

pip3 install torch==2.1.0 torchvision==0.16.0 xformers \
  --index-url https://download.pytorch.org/whl/cu121

pip install git+https://github.com/cumulo-autumn/StreamDiffusion.git@main#egg=streamdiffusion[tensorrt]

# Install TensorRT extension for maximum performance
python -m streamdiffusion.tools.install-tensorrt

If you want to modify the source or run the bundled examples, clone the repo instead:

1
2
3
4
git clone https://github.com/cumulo-autumn/StreamDiffusion.git
cd StreamDiffusion
pip install -e .[tensorrt]
python -m streamdiffusion.tools.install-tensorrt

Image-to-Image Streaming

The img2img path takes an input image and applies the prompt as a style transfer. This is the building block for webcam pipelines and screen capture tools.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import torch
from PIL import Image
from diffusers import AutoencoderTiny, StableDiffusionPipeline
from streamdiffusion import StreamDiffusion
from streamdiffusion.image_utils import postprocess_image

pipe = StableDiffusionPipeline.from_pretrained("KBlueLeaf/kohaku-v2.1").to(
    device=torch.device("cuda"),
    dtype=torch.float16,
)

stream = StreamDiffusion(
    pipe,
    t_index_list=[32, 45],
    torch_dtype=torch.float16,
)

stream.load_lcm_lora()
stream.fuse_lora()
stream.vae = AutoencoderTiny.from_pretrained("madebyollin/taesd").to(
    device=pipe.device, dtype=pipe.dtype
)
pipe.enable_xformers_memory_efficient_attention()

stream.prepare("oil painting style, vibrant colors, impressionist")

input_image = Image.open("photo.jpg").resize((512, 512))

# Warmup with the input image
for _ in range(2):
    stream(input_image)

# Stream — each call returns the next processed frame
x_output = stream(input_image)
result = postprocess_image(x_output, output_type="pil")[0]
result.save("stylized.png")

For img2img, t_index_list=[32, 45] is a good default. Lower values give more creative reinterpretation; higher values stick closer to the input.

Using StreamDiffusionWrapper for Cleaner Code

The StreamDiffusionWrapper in utils/wrapper.py bundles model loading, VAE swapping, LoRA loading, and acceleration into a single constructor. This is what the official examples use.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import os
import sys
sys.path.append(os.path.join(os.path.dirname(__file__), "StreamDiffusion"))
from utils.wrapper import StreamDiffusionWrapper

stream = StreamDiffusionWrapper(
    model_id_or_path="stabilityai/sd-turbo",
    t_index_list=[0],
    frame_buffer_size=1,
    width=512,
    height=512,
    warmup=10,
    acceleration="xformers",  # or "tensorrt"
    mode="txt2img",
    use_lcm_lora=False,       # sd-turbo doesn't need LCM-LoRA
    use_tiny_vae=True,
    cfg_type="none",
    use_denoising_batch=True,
    seed=42,
)

stream.prepare(
    prompt="a red fox in a snowy forest, photorealistic",
    num_inference_steps=50,
    guidance_scale=1.2,
)

# Generate — the wrapper returns a PIL Image
output_image = stream()
output_image.save("fox.png")

The wrapper handles the warmup loop internally. You call prepare() once, then call the instance repeatedly to get frames.

Webcam Feed Integration

StreamDiffusion ships with a screen capture example, but adapting it to a webcam is straightforward. The pattern is: capture frames in one thread, feed them through the stream in another, and display results.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import os
import sys
import cv2
import torch
import numpy as np
from PIL import Image

sys.path.append(os.path.join(os.path.dirname(__file__), "StreamDiffusion"))
from utils.wrapper import StreamDiffusionWrapper

stream = StreamDiffusionWrapper(
    model_id_or_path="stabilityai/sd-turbo",
    t_index_list=[32, 45],
    frame_buffer_size=1,
    width=512,
    height=512,
    warmup=10,
    acceleration="xformers",
    mode="img2img",
    use_lcm_lora=False,
    use_tiny_vae=True,
    cfg_type="none",
    use_denoising_batch=True,
)

stream.prepare(
    prompt="anime style portrait, studio ghibli, soft lighting",
    num_inference_steps=50,
    guidance_scale=1.2,
)

cap = cv2.VideoCapture(0)

while True:
    ret, frame = cap.read()
    if not ret:
        break

    # Convert OpenCV BGR to PIL RGB
    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    pil_image = Image.fromarray(frame_rgb).resize((512, 512))

    # Process through StreamDiffusion
    output = stream(image=pil_image)

    # Display result
    output_np = np.array(output)
    output_bgr = cv2.cvtColor(output_np, cv2.COLOR_RGB2BGR)
    cv2.imshow("StreamDiffusion", output_bgr)

    if cv2.waitKey(1) & 0xFF == ord("q"):
        break

cap.release()
cv2.destroyAllWindows()

Frame rate depends on your GPU. On an RTX 4090 with TensorRT, expect 90+ FPS for img2img with sd-turbo. On an RTX 3080, plan for 30-40 FPS with xformers acceleration.

Performance Optimization

TensorRT Acceleration

TensorRT builds optimized CUDA engines for the UNet, which is the bottleneck in diffusion inference. When using the low-level StreamDiffusion API directly:

1
2
3
4
5
6
7
from streamdiffusion.acceleration.tensorrt import accelerate_with_tensorrt

stream = accelerate_with_tensorrt(
    stream,
    "engines",
    max_batch_size=2,
)

With the StreamDiffusionWrapper, just pass acceleration="tensorrt" in the constructor. The first run builds the engine files (stored in the engines/ directory), which takes several minutes. Subsequent runs load them instantly.

Stochastic Similarity Filter

When processing video or webcam input, consecutive frames are often nearly identical. The similarity filter skips redundant computation:

1
2
3
4
stream.enable_similar_image_filter(
    similar_image_filter_threshold=0.98,
    similar_image_filter_max_skip_frame=10,
)

This compares each input frame to the previous one and reuses the cached output when similarity exceeds the threshold. On mostly-static scenes, this can cut GPU usage by 50% or more.

Batch Denoising

StreamDiffusion’s core innovation is use_denoising_batch=True. Instead of running N denoising steps sequentially for each frame, it interleaves steps across frames in a batch. Frame N gets step 1, frame N-1 gets step 2, and so on. This fills GPU utilization gaps and is the main reason it hits 100+ FPS. Keep it enabled unless you are debugging.

Common Errors and Fixes

RuntimeError: Expected all tensors to be on the same device

This happens when the VAE and pipeline land on different devices. After loading the Tiny VAE, explicitly move it:

1
2
3
stream.vae = AutoencoderTiny.from_pretrained("madebyollin/taesd").to(
    device=pipe.device, dtype=pipe.dtype
)

ModuleNotFoundError: No module named 'tensorrt'

You skipped the TensorRT install step. Run python -m streamdiffusion.tools.install-tensorrt after installing the main package. If that still fails, fall back to acceleration="xformers" which gives 70-80% of TensorRT’s speed without the setup headaches.

CUDA out of memory on 8 GB GPUs

The default configuration assumes 12+ GB VRAM. Reduce memory usage by lowering frame_buffer_size to 1 and keeping width and height at 512. Using cfg_type="none" instead of "full" halves UNet memory since it skips the unconditional pass.

Warmup errors or garbled first frames

The warmup loop must run at least len(t_index_list) * frame_buffer_size iterations before outputs stabilize. If you are using four denoising steps, run at least 4 warmup calls. The StreamDiffusionWrapper handles this automatically when you set the warmup parameter.

xformers not found or version mismatch

xformers is sensitive to the exact PyTorch and CUDA versions. Install them together from the same index URL:

1
2
pip3 install torch==2.1.0 torchvision==0.16.0 xformers \
  --index-url https://download.pytorch.org/whl/cu121

Do not install xformers separately from a different source. Version mismatches cause silent failures or segfaults.