StreamDiffusion rewrites how diffusion models handle sequential frames. Instead of running full denoising per image, it batches denoising steps across a sliding window of frames, cutting latency to the point where you get 100+ FPS text-to-image on an RTX 4090. This is the library to reach for when you need Stable Diffusion at interactive speeds.
Quick Start: Text-to-Image in a Loop
Here is the fastest path to generating images with StreamDiffusion. This uses sd-turbo with a single denoising step and the Tiny VAE for maximum throughput.
| |
The t_index_list controls which denoising timesteps to execute. With sd-turbo, a single step ([0]) is enough. For models like KBlueLeaf/kohaku-v2.1 with LCM-LoRA, use [0, 16, 32, 45] for four-step generation.
Installation
StreamDiffusion requires Python 3.10, PyTorch with CUDA, and xformers.
| |
If you want to modify the source or run the bundled examples, clone the repo instead:
| |
Image-to-Image Streaming
The img2img path takes an input image and applies the prompt as a style transfer. This is the building block for webcam pipelines and screen capture tools.
| |
For img2img, t_index_list=[32, 45] is a good default. Lower values give more creative reinterpretation; higher values stick closer to the input.
Using StreamDiffusionWrapper for Cleaner Code
The StreamDiffusionWrapper in utils/wrapper.py bundles model loading, VAE swapping, LoRA loading, and acceleration into a single constructor. This is what the official examples use.
| |
The wrapper handles the warmup loop internally. You call prepare() once, then call the instance repeatedly to get frames.
Webcam Feed Integration
StreamDiffusion ships with a screen capture example, but adapting it to a webcam is straightforward. The pattern is: capture frames in one thread, feed them through the stream in another, and display results.
| |
Frame rate depends on your GPU. On an RTX 4090 with TensorRT, expect 90+ FPS for img2img with sd-turbo. On an RTX 3080, plan for 30-40 FPS with xformers acceleration.
Performance Optimization
TensorRT Acceleration
TensorRT builds optimized CUDA engines for the UNet, which is the bottleneck in diffusion inference. When using the low-level StreamDiffusion API directly:
| |
With the StreamDiffusionWrapper, just pass acceleration="tensorrt" in the constructor. The first run builds the engine files (stored in the engines/ directory), which takes several minutes. Subsequent runs load them instantly.
Stochastic Similarity Filter
When processing video or webcam input, consecutive frames are often nearly identical. The similarity filter skips redundant computation:
| |
This compares each input frame to the previous one and reuses the cached output when similarity exceeds the threshold. On mostly-static scenes, this can cut GPU usage by 50% or more.
Batch Denoising
StreamDiffusion’s core innovation is use_denoising_batch=True. Instead of running N denoising steps sequentially for each frame, it interleaves steps across frames in a batch. Frame N gets step 1, frame N-1 gets step 2, and so on. This fills GPU utilization gaps and is the main reason it hits 100+ FPS. Keep it enabled unless you are debugging.
Common Errors and Fixes
RuntimeError: Expected all tensors to be on the same device
This happens when the VAE and pipeline land on different devices. After loading the Tiny VAE, explicitly move it:
| |
ModuleNotFoundError: No module named 'tensorrt'
You skipped the TensorRT install step. Run python -m streamdiffusion.tools.install-tensorrt after installing the main package. If that still fails, fall back to acceleration="xformers" which gives 70-80% of TensorRT’s speed without the setup headaches.
CUDA out of memory on 8 GB GPUs
The default configuration assumes 12+ GB VRAM. Reduce memory usage by lowering frame_buffer_size to 1 and keeping width and height at 512. Using cfg_type="none" instead of "full" halves UNet memory since it skips the unconditional pass.
Warmup errors or garbled first frames
The warmup loop must run at least len(t_index_list) * frame_buffer_size iterations before outputs stabilize. If you are using four denoising steps, run at least 4 warmup calls. The StreamDiffusionWrapper handles this automatically when you set the warmup parameter.
xformers not found or version mismatch
xformers is sensitive to the exact PyTorch and CUDA versions. Install them together from the same index URL:
| |
Do not install xformers separately from a different source. Version mismatches cause silent failures or segfaults.
Related Guides
- How to Build AI Sprite Sheet Generation with Stable Diffusion
- How to Build AI Logo Generation with Stable Diffusion and SDXL
- How to Build AI Scene Generation with Layered Diffusion
- How to Build AI Coloring Book Generation with Line Art Diffusion
- How to Build AI Pixel Art Generation with Stable Diffusion
- How to Control Image Generation with ControlNet and IP-Adapter
- How to Build AI Sticker and Emoji Generation with Stable Diffusion
- How to Build AI Sketch-to-Image Generation with ControlNet Scribble
- How to Build AI Comic Strip Generation with Stable Diffusion
- How to Build AI Wallpaper Generation with Stable Diffusion and Tiling