Estimate Motion Between Two Frames
Optical flow tells you where every pixel moved between two consecutive frames. RAFT (Recurrent All-Pairs Field Transforms) is the go-to model for this – it won best paper at ECCV 2020, and torchvision ships pretrained weights so you don’t need to clone any external repos. You load the model, pass in two frames, and get back a dense flow field with per-pixel (dx, dy) displacement vectors.
| |
The flow_predictions list contains intermediate refinements from RAFT’s recurrent update blocks. Always grab the last one – it’s the final, most accurate estimate. The flow tensor shape is (batch, 2, H, W) where channel 0 is horizontal displacement and channel 1 is vertical displacement.
Install Dependencies
| |
For GPU acceleration, install the CUDA build:
| |
The pretrained RAFT Large weights are about 20MB. They download on first use and cache in your torch hub directory.
Full Working Example with Visualization
Here’s a self-contained script that generates two synthetic frames with a moving rectangle, estimates the flow, and saves the visualization:
| |
You should see horizontal displacement values around 40 in the rectangle region (since we shifted it 40 pixels to the right) and near-zero vertical displacement everywhere. The saved image will show the rectangle region colored to indicate rightward motion.
RAFT Large vs. RAFT Small
torchvision provides two RAFT variants:
| Model | Parameters | Speed | Accuracy |
|---|---|---|---|
raft_large | ~5.3M | Slower, higher quality | Best for offline processing |
raft_small | ~0.99M | ~2x faster | Good for real-time or constrained hardware |
Swap in the small model with minimal code changes:
| |
RAFT Small is noticeably less accurate on fine-grained motion and thin structures, but it’s perfectly acceptable for tracking large objects or getting a rough motion field.
Process a Video Sequence
For video, iterate over consecutive frame pairs. Here’s how to extract flow from a video file:
| |
This processes each consecutive pair and saves the flow visualization. For long videos, you’ll want to batch frames or limit the clip length to avoid memory issues.
Understanding the Flow Output
The flow tensor has shape (batch, 2, H, W):
- Channel 0: Horizontal displacement (positive = rightward motion)
- Channel 1: Vertical displacement (positive = downward motion)
Values are in pixels. A flow value of (10.5, -3.2) at position (y, x) means the pixel at that location in frame 1 moved 10.5 pixels right and 3.2 pixels up to reach its position in frame 2.
You can compute the magnitude (speed of motion) easily:
| |
flow_to_image uses a color wheel encoding: the hue represents motion direction and the saturation represents magnitude. Pure red typically means rightward, cyan means leftward, and so on.
Common Errors and Fixes
RuntimeError: Expected image1 and image2 to have the same shape
Both input frames must have identical dimensions. Resize them to the same size before passing to the model:
| |
RuntimeError: Input height and width must be divisible by 8
RAFT’s architecture uses a correlation pyramid with multiple downsampling steps. Both spatial dimensions must be multiples of 8. Pick a resolution that satisfies this – 520x960, 512x512, or 480x640 all work.
RuntimeError: Expected all tensors to be on the same device
Model is on CUDA but inputs are still on CPU. Move both images to the same device:
| |
OutOfMemoryError: CUDA out of memory
RAFT builds a 4D correlation volume that scales with image resolution. For a 1080x1920 input, this can eat several GB of VRAM. Reduce input resolution – 520x960 gives good results while staying under 4GB. Alternatively, switch to raft_small.
Flow looks uniform/blank despite visible motion
You probably forgot the preprocessing transforms. Raw uint8 tensors won’t produce meaningful results. Always apply the weight-specific transforms:
| |
These handle the conversion to float, scaling to [0, 1], and ImageNet normalization that RAFT expects.
Related Guides
- How to Build a Video Frame Interpolation Pipeline with RIFE
- How to Build Video Action Recognition with SlowFast and PyTorch
- How to Build a Real-Time Pose Estimation Pipeline with MediaPipe
- How to Build Multi-Object Tracking with DeepSORT and YOLOv8
- How to Classify Images with Vision Transformers in PyTorch
- How to Build Semantic Segmentation with Segment Anything and SAM 2
- How to Build an Image Captioning Pipeline with BLIP and Transformers
- How to Build Video Analytics Pipelines with OpenCV and Deep Learning
- How to Build a Visual Grounding Pipeline with Grounding DINO
- How to Build a Product Defect Detector with YOLOv8 and OpenCV