How to Build a Text-to-Video Pipeline with CogVideoX

CogVideoX is the best open-source text-to-video model you can run locally. Here’s a working pipeline that generates video clips from text prompts in under 10 lines of code.

Quick Start: Generate Your First Video

Install dependencies and generate a video in seconds:

1
pip install diffusers transformers accelerate torch torchvision imageio-ffmpeg

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from diffusers import CogVideoXPipeline
import torch

# Load the model (CogVideoX-2B fits on 12GB VRAM)
pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-2b",
    torch_dtype=torch.float16
)
pipe.to("cuda")

# Generate a 6-second clip (48 frames at 8 fps)
prompt = "A golden retriever running through a sunlit meadow, slow motion, cinematic"
video = pipe(
    prompt=prompt,
    num_frames=49,
    guidance_scale=6.0,
    num_inference_steps=50
).frames[0]

# Save as MP4
from diffusers.utils import export_to_video
export_to_video(video, "golden_retriever.mp4", fps=8)

This generates a 480x720 video in about 2-3 minutes on an RTX 3090. The 2B model is the sweet spot for local generation—5B gives better quality but needs 24GB+ VRAM.

Image-to-Video: Animate Static Images

You can condition generation on a starting frame for more control:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from diffusers import CogVideoXImageToVideoPipeline
from PIL import Image

pipe = CogVideoXImageToVideoPipeline.from_pretrained(
    "THUDM/CogVideoX-5b-I2V",
    torch_dtype=torch.float16
)
pipe.enable_model_cpu_offload()  # Offload layers to CPU to fit in VRAM

# Load your starting image
image = Image.open("forest_path.jpg").resize((720, 480))

prompt = "Camera slowly pushes forward along the forest path, leaves rustling"
video = pipe(
    prompt=prompt,
    image=image,
    num_frames=49,
    guidance_scale=6.0,
    num_inference_steps=50
).frames[0]

export_to_video(video, "forest_path_animated.mp4", fps=8)

Image-to-video gives you pixel-perfect control over the first frame and composition. The model extrapolates motion from there.

Control Video Length and Resolution

CogVideoX supports multiple aspect ratios and durations. Here’s how to generate longer clips and custom resolutions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# Generate a 12-second clip (96 frames)
video = pipe(
    prompt="Time-lapse of clouds moving over mountains",
    num_frames=81,  # ~10 seconds at 8 fps
    height=480,
    width=720,
    guidance_scale=6.0,
    num_inference_steps=50
).frames[0]

# For 16:9 widescreen
video_wide = pipe(
    prompt="City skyline at sunset, wide panoramic shot",
    num_frames=49,
    height=432,
    width=768,  # 16:9 aspect ratio
    guidance_scale=6.0
).frames[0]

Longer videos (96+ frames) require more VRAM. If you hit memory errors, reduce num_frames or use CPU offloading.

Optimize VRAM Usage

Running out of memory? Here are the fixes that actually work:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from diffusers import CogVideoXPipeline
import torch

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-2b",
    torch_dtype=torch.float16
)

# Enable CPU offloading (trades speed for VRAM)
pipe.enable_model_cpu_offload()

# Or use sequential offloading for even lower VRAM (slower)
pipe.enable_sequential_cpu_offload()

# Generate with lower resolution
video = pipe(
    prompt="A cat sleeping on a windowsill",
    num_frames=49,  # CogVideoX works best with (num_seconds * fps + 1) frames
    height=384,  # Reduced from 480
    width=640,   # Reduced from 720
    guidance_scale=6.0
).frames[0]

With these settings, you can run CogVideoX-2B on 8GB VRAM. CPU offloading adds ~30% to generation time but makes it possible to run without upgrading hardware.

Batch Processing Multiple Prompts

Generate multiple videos in a queue without reloading the model:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
prompts = [
    "A spaceship flying through an asteroid field",
    "Ocean waves crashing on a rocky shore at sunset",
    "A robot assembling a circuit board, close-up shot",
    "Rain falling on a neon-lit city street at night"
]

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-2b",
    torch_dtype=torch.float16
)
pipe.enable_model_cpu_offload()

for i, prompt in enumerate(prompts):
    print(f"Generating video {i+1}/{len(prompts)}: {prompt}")

    video = pipe(
        prompt=prompt,
        num_frames=49,
        guidance_scale=6.0,
        num_inference_steps=50
    ).frames[0]

    output_path = f"output_{i:03d}.mp4"
    export_to_video(video, output_path, fps=8)

    # Clear CUDA cache between generations
    torch.cuda.empty_cache()

print("Batch complete!")

This loop keeps the model loaded in memory and only clears the CUDA cache between videos. Much faster than reloading the model each time.

Common Errors and Fixes

“CUDA out of memory” Enable CPU offloading with pipe.enable_model_cpu_offload() or reduce resolution and frame count. The 2B model needs minimum 8GB VRAM with offloading, 12GB without.

“AttributeError: ‘NoneType’ object has no attribute ‘frames’” The pipeline returned None. Check that your prompt isn’t empty and that the model downloaded correctly. Try re-running from_pretrained() to ensure weights loaded.

Videos are blurry or low quality Increase num_inference_steps from 50 to 75-100. Higher steps = better quality but longer generation time. Also try bumping guidance_scale to 7.0-8.0 for stronger prompt adherence.

Generation is extremely slow You’re probably using the 5B model with CPU offloading. Switch to the 2B model (THUDM/CogVideoX-2b) or disable offloading if you have enough VRAM. On GPU without offloading, 48 frames should take 2-4 minutes.

“ImportError: cannot import name ’export_to_video’” Update diffusers: pip install --upgrade diffusers. The export utility was added in version 0.21.0.

Tuning Parameters for Better Results

Guidance scale controls how closely the model follows your prompt. Start at 6.0:

4.0-5.0: More creative, less literal interpretation
6.0-7.0: Balanced (recommended starting point)
8.0-10.0: Strict prompt adherence, sometimes over-saturated

Inference steps trade quality for speed:

30 steps: Fast preview, lower quality
50 steps: Good balance (default)
75-100 steps: Best quality, 2x generation time

Negative prompts help avoid unwanted elements:

1
2
3
4
5
6
7
video = pipe(
    prompt="A chef cooking in a modern kitchen",
    negative_prompt="blurry, distorted, warped, low quality, text, watermark",
    num_frames=49,
    guidance_scale=6.5,
    num_inference_steps=75
).frames[0]

Quick Start: Generate Your First Video#

Image-to-Video: Animate Static Images#

Control Video Length and Resolution#

Optimize VRAM Usage#

Batch Processing Multiple Prompts#

Common Errors and Fixes#

Tuning Parameters for Better Results#

Related Guides#

About the Author