The Deforum Approach to AI Animation#
Deforum generates video by running Stable Diffusion frame-by-frame, applying geometric transformations (pan, zoom, rotate) between frames, and feeding each output back as the input for the next. The result is smooth, coherent animation that can shift between scenes based on prompt scheduling – different text prompts at different frame numbers.
Most tutorials show the Automatic1111 web UI extension. We’re skipping that entirely and building the pipeline in pure Python with diffusers, torch, PIL, numpy, and subprocess for FFmpeg. This gives you full programmatic control over every parameter.
1
| pip install diffusers transformers accelerate torch torchvision pillow numpy
|
You also need FFmpeg installed on your system for the final video export:
1
2
3
4
5
| # Ubuntu/Debian
sudo apt install ffmpeg
# macOS
brew install ffmpeg
|
Set Up the Stable Diffusion Pipeline#
We use StableDiffusionImg2ImgPipeline because Deforum-style animation is fundamentally img2img – each frame is generated by conditioning on the previous frame.
1
2
3
4
5
6
7
8
9
10
11
12
| import torch
from diffusers import StableDiffusionImg2ImgPipeline
pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
torch_dtype=torch.float16,
safety_checker=None,
)
pipe.enable_model_cpu_offload()
# Optional: compile the UNet for 20-30% speedup on PyTorch 2.x
# pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
|
CPU offloading keeps VRAM usage around 4-5GB. If you have a 24GB card, skip offloading and call pipe.to("cuda") for faster generation.
Define Animation Parameters and Keyframes#
Deforum’s power comes from keyframe-driven animation. You define how the camera moves at specific frame numbers, and values get interpolated between them.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
| import numpy as np
# Animation settings
TOTAL_FRAMES = 120
WIDTH, HEIGHT = 512, 512
FPS = 15
# Keyframes: frame_number -> value
# Translation (pixels per frame)
translation_x_keys = {0: 0, 30: 2, 60: -1, 90: 3, 120: 0}
translation_y_keys = {0: 0, 30: -1, 60: 0, 90: -2, 120: 0}
# Rotation (degrees per frame)
rotation_keys = {0: 0, 30: 0.5, 60: -0.3, 90: 0.8, 120: 0}
# Zoom (1.0 = no zoom, >1 = zoom in, <1 = zoom out)
zoom_keys = {0: 1.0, 30: 1.02, 60: 1.0, 90: 1.01, 120: 1.0}
# Strength controls how much Stable Diffusion changes each frame
# Lower = more coherent (less change), higher = more creative (more change)
strength_keys = {0: 0.55, 60: 0.6, 120: 0.55}
def interpolate_keyframes(keys: dict, frame: int) -> float:
"""Linearly interpolate between keyframe values."""
sorted_frames = sorted(keys.keys())
if frame <= sorted_frames[0]:
return keys[sorted_frames[0]]
if frame >= sorted_frames[-1]:
return keys[sorted_frames[-1]]
for i in range(len(sorted_frames) - 1):
f_start = sorted_frames[i]
f_end = sorted_frames[i + 1]
if f_start <= frame <= f_end:
t = (frame - f_start) / (f_end - f_start)
return keys[f_start] + t * (keys[f_end] - keys[f_start])
return keys[sorted_frames[-1]]
|
The interpolation function handles smooth transitions between keyframes. You set values at specific frames, and everything between gets linearly blended. For more organic motion, swap linear interpolation for a cubic spline – scipy.interpolate.CubicSpline works well here.
Between each frame, we apply 2D transformations to simulate camera movement. This warped image becomes the init image for the next Stable Diffusion pass.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
| import numpy as np
from PIL import Image
def apply_2d_transforms(
image: Image.Image,
translate_x: float,
translate_y: float,
rotation_deg: float,
zoom: float,
) -> Image.Image:
"""Apply translation, rotation, and zoom to a PIL image."""
img_array = np.array(image, dtype=np.float32)
h, w = img_array.shape[:2]
cx, cy = w / 2.0, h / 2.0
# Build the affine transformation matrix
# Step 1: translate center to origin
# Step 2: apply zoom and rotation
# Step 3: translate back and apply pan
cos_a = np.cos(np.radians(rotation_deg))
sin_a = np.sin(np.radians(rotation_deg))
# Combined rotation + zoom matrix
M = np.array([
[zoom * cos_a, -zoom * sin_a, (1 - zoom * cos_a) * cx + zoom * sin_a * cy + translate_x],
[zoom * sin_a, zoom * cos_a, (1 - zoom * cos_a) * cy - zoom * sin_a * cx + translate_y],
], dtype=np.float32)
import cv2
warped = cv2.warpAffine(
img_array,
M,
(w, h),
flags=cv2.INTER_CUBIC,
borderMode=cv2.BORDER_REFLECT_101,
)
return Image.fromarray(warped.astype(np.uint8))
|
BORDER_REFLECT_101 mirrors pixels at the edges instead of filling with black. This prevents the harsh black borders you’d get with zero padding, and keeps the image looking natural as the camera pans.
Set Up Prompt Scheduling#
Prompt scheduling lets you change the text prompt at specific frame numbers. This is how you create scene transitions – morph from a forest into a city, or shift from day to night.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| # Prompt schedule: frame_number -> prompt
prompt_schedule = {
0: "a mystical forest with glowing mushrooms, digital art, highly detailed, volumetric lighting",
40: "an ancient temple overgrown with vines, digital art, highly detailed, volumetric lighting",
80: "a futuristic neon city at night, cyberpunk, digital art, highly detailed, volumetric lighting",
}
negative_prompt = "blurry, low quality, watermark, text, deformed, disfigured, jpeg artifacts"
def get_prompt_for_frame(schedule: dict, frame: int) -> str:
"""Return the active prompt for a given frame number."""
active_prompt = schedule[0]
for f in sorted(schedule.keys()):
if frame >= f:
active_prompt = schedule[f]
else:
break
return active_prompt
|
Keeping a consistent style suffix across all prompts (“digital art, highly detailed, volumetric lighting”) helps maintain visual coherence as the content changes. The negative prompt stays the same throughout – it prevents common artifacts across all frames.
Generate the Animation Frame by Frame#
This is the core loop. Each iteration: get the current prompt, apply geometric transforms to the previous frame, run img2img, and save the result.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
| import os
import torch
from PIL import Image
output_dir = "frames"
os.makedirs(output_dir, exist_ok=True)
# Generate the first frame from text (using a blank init image with high strength)
generator = torch.Generator(device="cpu").manual_seed(42)
init_image = Image.new("RGB", (WIDTH, HEIGHT), color=(20, 20, 40))
first_frame = pipe(
prompt=get_prompt_for_frame(prompt_schedule, 0),
negative_prompt=negative_prompt,
image=init_image,
strength=0.99, # Near-full generation for the first frame
num_inference_steps=30,
guidance_scale=7.5,
generator=generator,
).images[0]
first_frame.save(os.path.join(output_dir, "frame_0000.png"))
prev_frame = first_frame
print("Frame 0000 saved")
# Generate subsequent frames
for frame_idx in range(1, TOTAL_FRAMES):
# Get interpolated animation values
tx = interpolate_keyframes(translation_x_keys, frame_idx)
ty = interpolate_keyframes(translation_y_keys, frame_idx)
rot = interpolate_keyframes(rotation_keys, frame_idx)
zm = interpolate_keyframes(zoom_keys, frame_idx)
strength = interpolate_keyframes(strength_keys, frame_idx)
# Apply geometric transforms to previous frame
warped = apply_2d_transforms(prev_frame, tx, ty, rot, zm)
# Get the current prompt
prompt = get_prompt_for_frame(prompt_schedule, frame_idx)
# Run img2img on the warped frame
generator = torch.Generator(device="cpu").manual_seed(42 + frame_idx)
result = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
image=warped,
strength=strength,
num_inference_steps=20,
guidance_scale=7.5,
generator=generator,
).images[0]
filename = f"frame_{frame_idx:04d}.png"
result.save(os.path.join(output_dir, filename))
prev_frame = result
if (frame_idx + 1) % 10 == 0:
print(f"Frame {frame_idx:04d}/{TOTAL_FRAMES} saved (prompt: {prompt[:40]}...)")
print(f"All {TOTAL_FRAMES} frames generated in {output_dir}/")
|
A few things to note about the strength parameter: at 0.55, you get strong temporal coherence – each frame looks like its predecessor with subtle changes. Push it to 0.7+ and Stable Diffusion takes more creative liberty, which makes prompt transitions more dramatic but can cause flickering. Start at 0.55 and adjust based on your results.
Each frame gets a unique seed derived from the frame index. This keeps generation deterministic and reproducible while giving each frame variation.
Smooth Scene Transitions with Prompt Blending#
Hard prompt switches at keyframes can cause jarring visual jumps. To smooth them out, blend prompts during transition windows using the prompt_embeds approach.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
| import torch
from diffusers import StableDiffusionImg2ImgPipeline
def get_blended_prompt_embeds(
pipe: StableDiffusionImg2ImgPipeline,
prompt_a: str,
prompt_b: str,
blend_factor: float,
) -> torch.Tensor:
"""Blend two prompt embeddings. blend_factor=0 is all A, 1 is all B."""
inputs_a = pipe.tokenizer(
prompt_a, padding="max_length",
max_length=pipe.tokenizer.model_max_length,
truncation=True, return_tensors="pt",
)
inputs_b = pipe.tokenizer(
prompt_b, padding="max_length",
max_length=pipe.tokenizer.model_max_length,
truncation=True, return_tensors="pt",
)
with torch.no_grad():
embeds_a = pipe.text_encoder(inputs_a.input_ids.to(pipe.device))[0]
embeds_b = pipe.text_encoder(inputs_b.input_ids.to(pipe.device))[0]
blended = (1 - blend_factor) * embeds_a + blend_factor * embeds_b
return blended
# Example: blend between prompts over a 20-frame window
BLEND_WINDOW = 20
def get_prompt_embeds_for_frame(pipe, schedule, frame):
"""Get prompt embeddings, blending during transition windows."""
sorted_keyframes = sorted(schedule.keys())
# Check if we're in a transition window
for i in range(len(sorted_keyframes) - 1):
transition_start = sorted_keyframes[i + 1] - BLEND_WINDOW
transition_end = sorted_keyframes[i + 1]
if transition_start <= frame < transition_end:
t = (frame - transition_start) / BLEND_WINDOW
prompt_a = schedule[sorted_keyframes[i]]
prompt_b = schedule[sorted_keyframes[i + 1]]
return get_blended_prompt_embeds(pipe, prompt_a, prompt_b, t)
# Not in a transition -- use the active prompt directly
active_prompt = get_prompt_for_frame(schedule, frame)
inputs = pipe.tokenizer(
active_prompt, padding="max_length",
max_length=pipe.tokenizer.model_max_length,
truncation=True, return_tensors="pt",
)
with torch.no_grad():
embeds = pipe.text_encoder(inputs.input_ids.to(pipe.device))[0]
return embeds
|
To use blended embeddings in the generation loop, pass prompt_embeds instead of prompt to the pipeline call:
1
2
3
4
5
6
7
8
9
10
| embeds = get_prompt_embeds_for_frame(pipe, prompt_schedule, frame_idx)
result = pipe(
prompt_embeds=embeds,
negative_prompt=negative_prompt,
image=warped,
strength=strength,
num_inference_steps=20,
guidance_scale=7.5,
generator=generator,
).images[0]
|
This produces gradual morphs between scenes instead of abrupt cuts.
Custom Motion Paths#
Beyond simple pans and zooms, you can define complex camera paths using parametric curves. Here’s a circular orbit that smoothly loops:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
| import math
def circular_motion_keys(total_frames: int, radius_x: float = 3.0,
radius_y: float = 2.0) -> tuple[dict, dict]:
"""Generate translation keyframes for a circular camera orbit."""
tx_keys = {}
ty_keys = {}
for f in range(0, total_frames + 1, 10): # Keyframe every 10 frames
angle = 2 * math.pi * (f / total_frames)
tx_keys[f] = radius_x * math.cos(angle)
ty_keys[f] = radius_y * math.sin(angle)
return tx_keys, ty_keys
# Spiraling zoom: zoom in while orbiting
def spiral_zoom_keys(total_frames: int, start_zoom: float = 1.0,
end_zoom: float = 1.5) -> dict:
"""Generate zoom keyframes for a smooth spiral."""
keys = {}
for f in range(0, total_frames + 1, 10):
t = f / total_frames
# Ease-in-out with a cosine curve
eased = 0.5 * (1 - math.cos(math.pi * t))
keys[f] = start_zoom + eased * (end_zoom - start_zoom)
return keys
# Use these in place of the static keyframes
translation_x_keys, translation_y_keys = circular_motion_keys(TOTAL_FRAMES)
zoom_keys = spiral_zoom_keys(TOTAL_FRAMES)
|
The circular_motion_keys function creates a smooth elliptical orbit. Combine it with the spiraling zoom for a dramatic fly-in effect. You can chain these with rotation keyframes for even more dynamic camera work.
Export Frames to Video with FFmpeg#
Once all frames are saved as PNGs, assemble them into an MP4 with FFmpeg via subprocess.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
| import subprocess
def frames_to_video(
frame_dir: str,
output_path: str,
fps: int = 15,
pattern: str = "frame_%04d.png",
crf: int = 18,
) -> None:
"""Combine PNG frames into an H.264 MP4 video."""
cmd = [
"ffmpeg", "-y",
"-framerate", str(fps),
"-i", f"{frame_dir}/{pattern}",
"-c:v", "libx264",
"-crf", str(crf),
"-pix_fmt", "yuv420p",
"-preset", "slow",
output_path,
]
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode != 0:
print(f"FFmpeg error: {result.stderr}")
raise RuntimeError("FFmpeg encoding failed")
print(f"Video saved to {output_path}")
frames_to_video("frames", "motion_graphics.mp4", fps=FPS)
|
The -pix_fmt yuv420p flag ensures the video plays on all devices and browsers. Without it, some players show a green screen. CRF 18 gives near-lossless quality – bump it to 23 for smaller files if quality is acceptable.
For a looping GIF (useful for previews):
1
| ffmpeg -y -framerate 15 -i frames/frame_%04d.png -vf "fps=15,scale=512:-1:flags=lanczos,split[s0][s1];[s0]palettegen[p];[s1][p]paletteuse" motion_graphics.gif
|
Common Errors and Fixes#
Flickering between frames#
This happens when strength is too high. Each frame gets too much creative freedom from Stable Diffusion, breaking temporal consistency. Lower strength to 0.45-0.55. Also make sure you pass the warped previous frame – not the original untransformed one – as the init image.
Black borders after panning#
If you see black edges creeping in as the camera moves, your border mode is wrong. Use cv2.BORDER_REFLECT_101 instead of the default cv2.BORDER_CONSTANT. This mirrors edge pixels and keeps the image seamless.
1
2
3
4
5
| warped = cv2.warpAffine(
img_array, M, (w, h),
flags=cv2.INTER_CUBIC,
borderMode=cv2.BORDER_REFLECT_101, # Not BORDER_CONSTANT
)
|
CUDA out of memory#
The img2img pipeline with SD 1.5 at 512x512 needs about 4GB of VRAM. If you run out:
- Enable CPU offloading:
pipe.enable_model_cpu_offload() - Use
torch.float16 (already set in our pipeline config) - Reduce
num_inference_steps to 15 – quality drop is minimal for img2img
FFmpeg says “No such file or directory”#
The frame naming pattern must match exactly. If your frames are frame_0000.png through frame_0119.png, the pattern is frame_%04d.png. A mismatch gives a cryptic “No such file” error even though the frames exist.
1
2
3
| # Verify your frame naming
ls frames/ | head -5
# Should output: frame_0000.png, frame_0001.png, ...
|
Prompt transitions cause visual “pops”#
If switching prompts creates a jarring visual jump, either lower the strength during transition frames or use the prompt blending approach from the blending section above. A 15-20 frame blend window produces smooth morphs for most prompt pairs.
Generated video plays too fast or too slow#
The FPS in the generation loop and the FPS in the FFmpeg export must match. If you generate at 15 FPS pacing but export at 30 FPS, the video plays at double speed. Keep FPS consistent across both steps, or adjust frame count proportionally.