Generate a Video from a Single Image
Stable Video Diffusion (SVD) takes an input image and produces a short video – typically 2 to 4 seconds at 576x1024 resolution. It ships as a pipeline in the diffusers library, so the workflow is: load model, load image, call the pipeline, export frames.
| |
| |
First run downloads roughly 10GB of model weights. After that, everything loads from the Hugging Face cache. The decode_chunk_size=8 parameter controls how many frames the VAE decodes at once – lower values use less VRAM, higher values are faster.
SVD vs SVD-XT: Pick the Right Variant
There are two model variants, and the choice matters.
| Variant | Model ID | Frames | Video Length | VRAM (fp16) |
|---|---|---|---|---|
| SVD | stabilityai/stable-video-diffusion-img2vid | 14 | ~2 seconds | ~6GB with offloading |
| SVD-XT | stabilityai/stable-video-diffusion-img2vid-xt | 25 | ~4 seconds | ~8GB with offloading |
SVD-XT is the better choice for almost everything. It was fine-tuned from SVD to generate 25 frames instead of 14, giving you smoother and longer clips. The extra VRAM cost is minimal if you enable CPU offloading.
Use plain SVD only when you need faster iteration – it generates in roughly half the time of SVD-XT.
Control Motion and Style
SVD exposes micro-conditioning parameters that let you tune how much the video moves and how closely it sticks to the source image.
| |
Here is what each parameter does:
motion_bucket_id– Controls how much the scene moves. Default is 127. Set it to 180+ for dramatic motion or below 100 for subtle camera pans. Values range from 0 to 255.noise_aug_strength– Adds noise to the conditioning image before generation. Higher values (0.1-0.3) give the model more creative freedom but the output drifts further from your source. Default is 0.02.fps– The framerate used for conditioning during generation. This does not change the playback speed of the exported video – that’s controlled by thefpsargument inexport_to_video. Setting this to 7 is a good default.num_inference_steps– Denoising iterations. Default is 25. Going above 30 rarely improves quality but going below 20 degrades it noticeably.
Save as GIF
If you want a looping GIF instead of an MP4, the diffusers library has you covered.
| |
You can also do it manually with Pillow for more control over optimization and loop count:
| |
GIFs are convenient for previewing but the file sizes get large quickly. Stick with MP4 for anything you plan to share or store.
Memory Optimization
Video generation is memory-intensive because the model generates all frames at once. On a 24GB GPU you can run SVD-XT without much trouble. On 8-12GB cards, you need to stack optimizations.
| |
Combining all three techniques – CPU offloading, forward chunking, and decode_chunk_size=2 – gets VRAM usage below 8GB. The tradeoff is speed: generation takes roughly 2-3x longer than running everything on the GPU.
| Optimization | VRAM Savings | Speed Impact |
|---|---|---|
torch.float16 (fp16) | ~50% | Faster on modern GPUs |
enable_model_cpu_offload() | ~40% | Moderate slowdown |
enable_forward_chunking() | ~15% | Slight slowdown |
decode_chunk_size=2 | ~20% | Slower VAE decoding |
| All combined | Below 8GB | 2-3x slower total |
If you have a 24GB+ GPU and want speed instead of savings, skip the offloading and compile the UNet:
| |
This gives a 20-25% speedup after the first inference call warms up the compiled graph.
Common Errors and Fixes
CUDA out of memory
| |
This is the most common error. Three fixes, in order of preference:
- Enable CPU offloading:
pipe.enable_model_cpu_offload() - Lower
decode_chunk_sizeto 2 or even 1 - Enable forward chunking:
pipe.unet.enable_forward_chunking()
If you see absurdly high memory requests (like 39GB on a 24GB card), check your PyTorch version. PyTorch 1.x lacks Scaled Dot-Product Attention (SDPA), which causes the attention layers to allocate far more memory than necessary. Upgrading to PyTorch 2.0+ fixes this immediately.
| |
Videos with no motion or only slow pans
SVD sometimes produces nearly static output, especially with certain source images. Fix this by increasing motion_bucket_id to 180-200 and bumping noise_aug_strength to 0.1.
If the image is very detailed or busy, the model tends to produce less motion. Try a cleaner composition with a clear subject and simple background.
Wrong resolution produces artifacts
SVD was trained on 576x1024 images. Feed it a different aspect ratio and you’ll get warped, stretched, or artifact-heavy output. Always resize your input:
| |
If your source image has a different aspect ratio, crop before resizing rather than stretching. Center crop works well:
| |
Faces and text look bad
SVD was not trained for generating realistic faces or legible text. Faces often distort during motion, and text becomes unreadable. This is a known limitation of the model. If your use case requires faces, consider using a dedicated video generation model that handles them better.
Flickering between frames
Lower decode_chunk_size values can introduce flickering because the VAE decodes frames independently in small batches. If you see flickering, try increasing decode_chunk_size back to 4 or 8. You can also increase num_inference_steps to 30 for more coherent temporal consistency.
Related Guides
- How to Edit Images with AI Inpainting Using Stable Diffusion
- How to Build AI Architectural Rendering with ControlNet and Stable Diffusion
- How to Generate and Edit Audio with Stable Audio and AudioLDM
- How to Build AI Wallpaper Generation with Stable Diffusion and Tiling
- How to Generate Images with Stable Diffusion in Python
- How to Build AI Motion Graphics Generation with Deforum Stable Diffusion
- How to Build AI Seamless Pattern Generation with Stable Diffusion
- How to Generate Images with FLUX.2 in Python
- How to Build AI Sticker and Emoji Generation with Stable Diffusion
- How to Fine-Tune Stable Diffusion with LoRA and DreamBooth