How to Build Optical Flow Estimation with RAFT and PyTorch

Estimate Motion Between Two Frames

Optical flow tells you where every pixel moved between two consecutive frames. RAFT (Recurrent All-Pairs Field Transforms) is the go-to model for this – it won best paper at ECCV 2020, and torchvision ships pretrained weights so you don’t need to clone any external repos. You load the model, pass in two frames, and get back a dense flow field with per-pixel (dx, dy) displacement vectors.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import torch
from torchvision.models.optical_flow import raft_large, Raft_Large_Weights
from torchvision.utils import flow_to_image
from torchvision.io import read_image
import torchvision.transforms.functional as F

weights = Raft_Large_Weights.DEFAULT
transforms = weights.transforms()

device = "cuda" if torch.cuda.is_available() else "cpu"
model = raft_large(weights=weights, progress=True).to(device)
model.eval()

# Load two consecutive frames (any two images work)
img1 = read_image("frame_001.png")  # (C, H, W) uint8
img2 = read_image("frame_002.png")

# Resize to dimensions divisible by 8 (RAFT requirement)
img1 = F.resize(img1, size=[520, 960], antialias=True)
img2 = F.resize(img2, size=[520, 960], antialias=True)

# Apply weight-specific transforms (converts to float, normalizes)
img1_batch, img2_batch = transforms(img1, img2)
img1_batch = img1_batch[None].to(device)  # Add batch dim: (1, 3, H, W)
img2_batch = img2_batch[None].to(device)

with torch.no_grad():
    flow_predictions = model(img1_batch, img2_batch)

# Last prediction is the most refined
flow = flow_predictions[-1]  # (1, 2, H, W) -- channels are (dx, dy)

# Convert to RGB visualization
flow_img = flow_to_image(flow[0])  # (3, H, W) uint8

The flow_predictions list contains intermediate refinements from RAFT’s recurrent update blocks. Always grab the last one – it’s the final, most accurate estimate. The flow tensor shape is (batch, 2, H, W) where channel 0 is horizontal displacement and channel 1 is vertical displacement.

Install Dependencies

1
pip install torch torchvision pillow matplotlib

For GPU acceleration, install the CUDA build:

1
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

The pretrained RAFT Large weights are about 20MB. They download on first use and cache in your torch hub directory.

Full Working Example with Visualization

Here’s a self-contained script that generates two synthetic frames with a moving rectangle, estimates the flow, and saves the visualization:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import torch
import numpy as np
from PIL import Image
from torchvision.models.optical_flow import raft_large, Raft_Large_Weights
from torchvision.utils import flow_to_image
import torchvision.transforms.functional as F

# Create two synthetic frames with a white rectangle that moves right
frame1 = np.zeros((520, 960, 3), dtype=np.uint8)
frame1[200:320, 300:500, :] = 255  # White rectangle

frame2 = np.zeros((520, 960, 3), dtype=np.uint8)
frame2[200:320, 340:540, :] = 255  # Same rectangle shifted 40px right

img1 = torch.from_numpy(frame1).permute(2, 0, 1)  # (3, 520, 960)
img2 = torch.from_numpy(frame2).permute(2, 0, 1)

# Load model
weights = Raft_Large_Weights.DEFAULT
transforms = weights.transforms()

device = "cuda" if torch.cuda.is_available() else "cpu"
model = raft_large(weights=weights, progress=True).to(device)
model.eval()

# Preprocess
img1_t, img2_t = transforms(img1, img2)
img1_t = img1_t[None].to(device)
img2_t = img2_t[None].to(device)

# Inference
with torch.no_grad():
    predictions = model(img1_t, img2_t)

flow = predictions[-1]  # (1, 2, H, W)

# Visualize: flow_to_image maps flow magnitude/direction to HSV-style colors
flow_rgb = flow_to_image(flow[0].cpu())  # (3, H, W) uint8 tensor
flow_pil = F.to_pil_image(flow_rgb)
flow_pil.save("optical_flow_output.png")

# Print flow stats
print(f"Flow shape: {flow.shape}")
print(f"Horizontal displacement range: [{flow[0, 0].min():.2f}, {flow[0, 0].max():.2f}]")
print(f"Vertical displacement range: [{flow[0, 1].min():.2f}, {flow[0, 1].max():.2f}]")

You should see horizontal displacement values around 40 in the rectangle region (since we shifted it 40 pixels to the right) and near-zero vertical displacement everywhere. The saved image will show the rectangle region colored to indicate rightward motion.

RAFT Large vs. RAFT Small

torchvision provides two RAFT variants:

Model	Parameters	Speed	Accuracy
`raft_large`	~5.3M	Slower, higher quality	Best for offline processing
`raft_small`	~0.99M	~2x faster	Good for real-time or constrained hardware

Swap in the small model with minimal code changes:

1
2
3
4
5
from torchvision.models.optical_flow import raft_small, Raft_Small_Weights

weights = Raft_Small_Weights.DEFAULT
transforms = weights.transforms()
model = raft_small(weights=weights).to(device)

RAFT Small is noticeably less accurate on fine-grained motion and thin structures, but it’s perfectly acceptable for tracking large objects or getting a rough motion field.

Process a Video Sequence

For video, iterate over consecutive frame pairs. Here’s how to extract flow from a video file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import torch
from torchvision.io import read_video
from torchvision.models.optical_flow import raft_large, Raft_Large_Weights
from torchvision.utils import flow_to_image
import torchvision.transforms.functional as F
from pathlib import Path

weights = Raft_Large_Weights.DEFAULT
transforms = weights.transforms()

device = "cuda" if torch.cuda.is_available() else "cpu"
model = raft_large(weights=weights).to(device)
model.eval()

# read_video returns (T, H, W, C) uint8
video, _, info = read_video("input_video.mp4", pts_unit="sec", end_pts=5.0)
frames = video.permute(0, 3, 1, 2)  # (T, C, H, W)

output_dir = Path("flow_frames")
output_dir.mkdir(exist_ok=True)

for i in range(len(frames) - 1):
    img1 = F.resize(frames[i], [520, 960], antialias=True)
    img2 = F.resize(frames[i + 1], [520, 960], antialias=True)

    img1_t, img2_t = transforms(img1, img2)
    img1_t = img1_t[None].to(device)
    img2_t = img2_t[None].to(device)

    with torch.no_grad():
        flow = model(img1_t, img2_t)[-1]

    flow_rgb = flow_to_image(flow[0].cpu())
    F.to_pil_image(flow_rgb).save(output_dir / f"flow_{i:04d}.png")
    print(f"Processed frame pair {i}/{len(frames) - 2}")

This processes each consecutive pair and saves the flow visualization. For long videos, you’ll want to batch frames or limit the clip length to avoid memory issues.

Understanding the Flow Output

The flow tensor has shape (batch, 2, H, W):

Channel 0: Horizontal displacement (positive = rightward motion)
Channel 1: Vertical displacement (positive = downward motion)

Values are in pixels. A flow value of (10.5, -3.2) at position (y, x) means the pixel at that location in frame 1 moved 10.5 pixels right and 3.2 pixels up to reach its position in frame 2.

You can compute the magnitude (speed of motion) easily:

1
2
3
magnitude = torch.sqrt(flow[0, 0] ** 2 + flow[0, 1] ** 2)  # (H, W)
print(f"Max motion: {magnitude.max():.1f} pixels")
print(f"Mean motion: {magnitude.mean():.1f} pixels")

flow_to_image uses a color wheel encoding: the hue represents motion direction and the saturation represents magnitude. Pure red typically means rightward, cyan means leftward, and so on.

Common Errors and Fixes

RuntimeError: Expected image1 and image2 to have the same shape

Both input frames must have identical dimensions. Resize them to the same size before passing to the model:

1
2
img1 = F.resize(img1, size=[520, 960], antialias=True)
img2 = F.resize(img2, size=[520, 960], antialias=True)

RuntimeError: Input height and width must be divisible by 8

RAFT’s architecture uses a correlation pyramid with multiple downsampling steps. Both spatial dimensions must be multiples of 8. Pick a resolution that satisfies this – 520x960, 512x512, or 480x640 all work.

RuntimeError: Expected all tensors to be on the same device

Model is on CUDA but inputs are still on CPU. Move both images to the same device:

1
2
img1_t = img1_t[None].to(device)
img2_t = img2_t[None].to(device)

OutOfMemoryError: CUDA out of memory

RAFT builds a 4D correlation volume that scales with image resolution. For a 1080x1920 input, this can eat several GB of VRAM. Reduce input resolution – 520x960 gives good results while staying under 4GB. Alternatively, switch to raft_small.

Flow looks uniform/blank despite visible motion

You probably forgot the preprocessing transforms. Raw uint8 tensors won’t produce meaningful results. Always apply the weight-specific transforms:

1
2
transforms = Raft_Large_Weights.DEFAULT.transforms()
img1_t, img2_t = transforms(img1, img2)

These handle the conversion to float, scaling to [0, 1], and ImageNet normalization that RAFT expects.

Estimate Motion Between Two Frames#

Install Dependencies#

Full Working Example with Visualization#

RAFT Large vs. RAFT Small#

Process a Video Sequence#

Understanding the Flow Output#

Common Errors and Fixes#

Related Guides#

About the Author