CogVideoX is the best open-source text-to-video model you can run locally. Here’s a working pipeline that generates video clips from text prompts in under 10 lines of code.
Quick Start: Generate Your First Video
Install dependencies and generate a video in seconds:
| |
| |
This generates a 480x720 video in about 2-3 minutes on an RTX 3090. The 2B model is the sweet spot for local generation—5B gives better quality but needs 24GB+ VRAM.
Image-to-Video: Animate Static Images
You can condition generation on a starting frame for more control:
| |
Image-to-video gives you pixel-perfect control over the first frame and composition. The model extrapolates motion from there.
Control Video Length and Resolution
CogVideoX supports multiple aspect ratios and durations. Here’s how to generate longer clips and custom resolutions:
| |
Longer videos (96+ frames) require more VRAM. If you hit memory errors, reduce num_frames or use CPU offloading.
Optimize VRAM Usage
Running out of memory? Here are the fixes that actually work:
| |
With these settings, you can run CogVideoX-2B on 8GB VRAM. CPU offloading adds ~30% to generation time but makes it possible to run without upgrading hardware.
Batch Processing Multiple Prompts
Generate multiple videos in a queue without reloading the model:
| |
This loop keeps the model loaded in memory and only clears the CUDA cache between videos. Much faster than reloading the model each time.
Common Errors and Fixes
“CUDA out of memory”
Enable CPU offloading with pipe.enable_model_cpu_offload() or reduce resolution and frame count. The 2B model needs minimum 8GB VRAM with offloading, 12GB without.
“AttributeError: ‘NoneType’ object has no attribute ‘frames’”
The pipeline returned None. Check that your prompt isn’t empty and that the model downloaded correctly. Try re-running from_pretrained() to ensure weights loaded.
Videos are blurry or low quality
Increase num_inference_steps from 50 to 75-100. Higher steps = better quality but longer generation time. Also try bumping guidance_scale to 7.0-8.0 for stronger prompt adherence.
Generation is extremely slow
You’re probably using the 5B model with CPU offloading. Switch to the 2B model (THUDM/CogVideoX-2b) or disable offloading if you have enough VRAM. On GPU without offloading, 48 frames should take 2-4 minutes.
“ImportError: cannot import name ’export_to_video’”
Update diffusers: pip install --upgrade diffusers. The export utility was added in version 0.21.0.
Tuning Parameters for Better Results
Guidance scale controls how closely the model follows your prompt. Start at 6.0:
- 4.0-5.0: More creative, less literal interpretation
- 6.0-7.0: Balanced (recommended starting point)
- 8.0-10.0: Strict prompt adherence, sometimes over-saturated
Inference steps trade quality for speed:
- 30 steps: Fast preview, lower quality
- 50 steps: Good balance (default)
- 75-100 steps: Best quality, 2x generation time
Negative prompts help avoid unwanted elements:
| |
Related Guides
- How to Build AI Pixel Art Generation with Stable Diffusion
- How to Generate Videos with Stable Video Diffusion
- How to Build AI Texture Generation for Game Assets with Stable Diffusion
- How to Build AI Sticker and Emoji Generation with Stable Diffusion
- How to Build AI Motion Graphics Generation with Deforum Stable Diffusion
- How to Build AI Sprite Sheet Generation with Stable Diffusion
- How to Build AI Interior Design Rendering with ControlNet and Stable Diffusion
- How to Build AI Wallpaper Generation with Stable Diffusion and Tiling
- How to Build AI Seamless Pattern Generation with Stable Diffusion
- How to Build AI Logo Generation with Stable Diffusion and SDXL