How to Generate Videos with Stable Video Diffusion

Generate a Video from a Single Image

Stable Video Diffusion (SVD) takes an input image and produces a short video – typically 2 to 4 seconds at 576x1024 resolution. It ships as a pipeline in the diffusers library, so the workflow is: load model, load image, call the pipeline, export frames.

1
pip install diffusers transformers accelerate torch pillow

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import torch
from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video

# Load SVD-XT (generates 25 frames, ~4 seconds of video)
pipe = StableVideoDiffusionPipeline.from_pretrained(
    "stabilityai/stable-video-diffusion-img2vid-xt",
    torch_dtype=torch.float16,
    variant="fp16",
)
pipe.enable_model_cpu_offload()

# Load and resize the conditioning image
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png")
image = image.resize((1024, 576))

# Generate video frames
generator = torch.manual_seed(42)
frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0]

# Save as MP4
export_to_video(frames, "generated.mp4", fps=7)
print(f"Saved {len(frames)} frames to generated.mp4")

First run downloads roughly 10GB of model weights. After that, everything loads from the Hugging Face cache. The decode_chunk_size=8 parameter controls how many frames the VAE decodes at once – lower values use less VRAM, higher values are faster.

SVD vs SVD-XT: Pick the Right Variant

There are two model variants, and the choice matters.

Variant	Model ID	Frames	Video Length	VRAM (fp16)
SVD	`stabilityai/stable-video-diffusion-img2vid`	14	~2 seconds	~6GB with offloading
SVD-XT	`stabilityai/stable-video-diffusion-img2vid-xt`	25	~4 seconds	~8GB with offloading

SVD-XT is the better choice for almost everything. It was fine-tuned from SVD to generate 25 frames instead of 14, giving you smoother and longer clips. The extra VRAM cost is minimal if you enable CPU offloading.

Use plain SVD only when you need faster iteration – it generates in roughly half the time of SVD-XT.

Control Motion and Style

SVD exposes micro-conditioning parameters that let you tune how much the video moves and how closely it sticks to the source image.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import torch
from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video

pipe = StableVideoDiffusionPipeline.from_pretrained(
    "stabilityai/stable-video-diffusion-img2vid-xt",
    torch_dtype=torch.float16,
    variant="fp16",
)
pipe.enable_model_cpu_offload()

image = load_image("your_image.png")
image = image.resize((1024, 576))

generator = torch.manual_seed(42)
frames = pipe(
    image,
    decode_chunk_size=8,
    generator=generator,
    num_frames=25,               # 25 for SVD-XT, 14 for SVD
    num_inference_steps=25,      # More steps = higher quality, slower
    motion_bucket_id=180,        # Higher = more motion (range: 0-255)
    noise_aug_strength=0.1,      # Higher = less faithful to source image
    fps=7,                       # Target framerate for conditioning
).frames[0]

export_to_video(frames, "high_motion.mp4", fps=7)

Here is what each parameter does:

motion_bucket_id – Controls how much the scene moves. Default is 127. Set it to 180+ for dramatic motion or below 100 for subtle camera pans. Values range from 0 to 255.
noise_aug_strength – Adds noise to the conditioning image before generation. Higher values (0.1-0.3) give the model more creative freedom but the output drifts further from your source. Default is 0.02.
fps – The framerate used for conditioning during generation. This does not change the playback speed of the exported video – that’s controlled by the fps argument in export_to_video. Setting this to 7 is a good default.
num_inference_steps – Denoising iterations. Default is 25. Going above 30 rarely improves quality but going below 20 degrades it noticeably.

Save as GIF

If you want a looping GIF instead of an MP4, the diffusers library has you covered.

1
2
3
4
from diffusers.utils import export_to_gif

# After generating frames from the pipeline
export_to_gif(frames, "output.gif")

You can also do it manually with Pillow for more control over optimization and loop count:

1
2
3
4
5
6
7
8
frames[0].save(
    "output.gif",
    save_all=True,
    append_images=frames[1:],
    optimize=True,
    duration=142,  # milliseconds per frame (~7 fps)
    loop=0,        # 0 = infinite loop
)

GIFs are convenient for previewing but the file sizes get large quickly. Stick with MP4 for anything you plan to share or store.

Memory Optimization

Video generation is memory-intensive because the model generates all frames at once. On a 24GB GPU you can run SVD-XT without much trouble. On 8-12GB cards, you need to stack optimizations.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import torch
from diffusers import StableVideoDiffusionPipeline

pipe = StableVideoDiffusionPipeline.from_pretrained(
    "stabilityai/stable-video-diffusion-img2vid-xt",
    torch_dtype=torch.float16,
    variant="fp16",
)

# Level 1: CPU offloading (moves each component to CPU when not in use)
pipe.enable_model_cpu_offload()

# Level 2: Feed-forward chunking (loop instead of single huge batch)
pipe.unet.enable_forward_chunking()

# Generate with small decode chunk size
frames = pipe(
    image,
    decode_chunk_size=2,  # Decode 2 frames at a time instead of 8
    generator=torch.manual_seed(42),
    num_frames=25,
).frames[0]

Combining all three techniques – CPU offloading, forward chunking, and decode_chunk_size=2 – gets VRAM usage below 8GB. The tradeoff is speed: generation takes roughly 2-3x longer than running everything on the GPU.

Optimization	VRAM Savings	Speed Impact
`torch.float16` (fp16)	~50%	Faster on modern GPUs
`enable_model_cpu_offload()`	~40%	Moderate slowdown
`enable_forward_chunking()`	~15%	Slight slowdown
`decode_chunk_size=2`	~20%	Slower VAE decoding
All combined	Below 8GB	2-3x slower total

If you have a 24GB+ GPU and want speed instead of savings, skip the offloading and compile the UNet:

1
2
pipe.to("cuda")
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

This gives a 20-25% speedup after the first inference call warms up the compiled graph.

Common Errors and Fixes

CUDA out of memory

1
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 39.55 GiB

This is the most common error. Three fixes, in order of preference:

Enable CPU offloading: pipe.enable_model_cpu_offload()
Lower decode_chunk_size to 2 or even 1
Enable forward chunking: pipe.unet.enable_forward_chunking()

If you see absurdly high memory requests (like 39GB on a 24GB card), check your PyTorch version. PyTorch 1.x lacks Scaled Dot-Product Attention (SDPA), which causes the attention layers to allocate far more memory than necessary. Upgrading to PyTorch 2.0+ fixes this immediately.

1
pip install --upgrade torch torchvision

Videos with no motion or only slow pans

SVD sometimes produces nearly static output, especially with certain source images. Fix this by increasing motion_bucket_id to 180-200 and bumping noise_aug_strength to 0.1.

If the image is very detailed or busy, the model tends to produce less motion. Try a cleaner composition with a clear subject and simple background.

Wrong resolution produces artifacts

SVD was trained on 576x1024 images. Feed it a different aspect ratio and you’ll get warped, stretched, or artifact-heavy output. Always resize your input:

1
2
3
4
from PIL import Image

image = Image.open("your_photo.jpg")
image = image.resize((1024, 576))  # Width x Height

If your source image has a different aspect ratio, crop before resizing rather than stretching. Center crop works well:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from PIL import Image

image = Image.open("your_photo.jpg")
w, h = image.size
target_ratio = 1024 / 576

if w / h > target_ratio:
    new_w = int(h * target_ratio)
    left = (w - new_w) // 2
    image = image.crop((left, 0, left + new_w, h))
else:
    new_h = int(w / target_ratio)
    top = (h - new_h) // 2
    image = image.crop((0, top, w, top + new_h))

image = image.resize((1024, 576))

Faces and text look bad

SVD was not trained for generating realistic faces or legible text. Faces often distort during motion, and text becomes unreadable. This is a known limitation of the model. If your use case requires faces, consider using a dedicated video generation model that handles them better.

Flickering between frames

Lower decode_chunk_size values can introduce flickering because the VAE decodes frames independently in small batches. If you see flickering, try increasing decode_chunk_size back to 4 or 8. You can also increase num_inference_steps to 30 for more coherent temporal consistency.

Generate a Video from a Single Image#

SVD vs SVD-XT: Pick the Right Variant#

Control Motion and Style#

Save as GIF#

Memory Optimization#

Common Errors and Fixes#

CUDA out of memory#

Videos with no motion or only slow pans#

Wrong resolution produces artifacts#

Faces and text look bad#

Flickering between frames#

Related Guides#

About the Author