How to Build a Video Frame Interpolation Pipeline with RIFE

Frame interpolation synthesizes new frames between existing ones, turning choppy 24fps footage into smooth 60fps or even 120fps video. RIFE (Real-Time Intermediate Flow Estimation) does this by estimating intermediate optical flows directly – no separate flow computation step needed. It runs at 30+ FPS on a 2080 Ti for 720p 2x interpolation, which makes it practical for real video workflows.

Install and Load RIFE

RIFE is distributed through the Practical-RIFE repository. Clone it, install dependencies, and download the model weights.

1
2
3
4
git clone https://github.com/hzwer/Practical-RIFE.git
cd Practical-RIFE
pip install -r requirements.txt
pip install opencv-python-headless

Download the v4.25 model (recommended for most scenes) from the release page and place the files in train_log/:

1
2
3
mkdir -p train_log
# Download model v4.25 from the Practical-RIFE model list
# Place RIFE_HDv3.py, IFNet_HDv3.py, and flownet.pkl into train_log/

Now load the model in Python:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import torch
import numpy as np
import cv2
import sys
import os

# Add the Practical-RIFE directory to the path
sys.path.insert(0, os.path.abspath("Practical-RIFE"))

from model.RIFE_HDv3 import Model

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.set_grad_enabled(False)
if torch.cuda.is_available():
    torch.backends.cudnn.enabled = True
    torch.backends.cudnn.benchmark = True

model = Model()
model.load_model("Practical-RIFE/train_log", -1)
model.eval()
model.device()

The load_model method reads flownet.pkl from the specified directory. Passing -1 as the rank tells it to load on whatever device is available. The model.device() call moves the network to CUDA if present.

Read and Prepare Video Frames

Use OpenCV to read video frames and convert them to the tensor format RIFE expects. Frames go from BGR uint8 arrays to normalized float32 tensors in NCHW layout.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import torch
import numpy as np
import cv2
from torch.nn import functional as F

def load_video_frames(video_path: str) -> tuple[list[np.ndarray], float, tuple[int, int]]:
    """Read all frames from a video file. Returns frames, FPS, and (height, width)."""
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)
    frames = []
    while True:
        ret, frame = cap.read()
        if not ret:
            break
        frames.append(frame)
    cap.release()
    h, w = frames[0].shape[:2]
    return frames, fps, (h, w)

def frame_to_tensor(frame: np.ndarray, device: torch.device) -> torch.Tensor:
    """Convert a BGR uint8 frame to a normalized NCHW float32 tensor."""
    # BGR -> RGB, HWC -> CHW, normalize to [0, 1]
    tensor = torch.from_numpy(np.transpose(frame, (2, 0, 1))).float() / 255.0
    tensor = tensor.unsqueeze(0).to(device, non_blocking=True)
    return tensor

def pad_to_divisible(tensor: torch.Tensor, divisor: int = 32) -> tuple[torch.Tensor, int, int]:
    """Pad tensor so H and W are divisible by divisor. Returns padded tensor and pad amounts."""
    _, _, h, w = tensor.shape
    pad_h = (divisor - h % divisor) % divisor
    pad_w = (divisor - w % divisor) % divisor
    if pad_h > 0 or pad_w > 0:
        tensor = F.pad(tensor, (0, pad_w, 0, pad_h), mode="replicate")
    return tensor, pad_h, pad_w

frames, original_fps, (h, w) = load_video_frames("input_video.mp4")
print(f"Loaded {len(frames)} frames at {original_fps:.1f} FPS, resolution {w}x{h}")

RIFE’s IFNet requires spatial dimensions divisible by 32. The padding function handles this with replicate padding, which avoids edge artifacts better than zero padding. You crop the output back to the original size after inference.

Interpolate Frames

With the model loaded and frames prepared, run RIFE to generate one intermediate frame between each consecutive pair. This doubles the frame count.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import torch
import numpy as np
import cv2
from torch.nn import functional as F

def interpolate_pair(model, frame1: np.ndarray, frame2: np.ndarray,
                     device: torch.device, scale: float = 1.0) -> np.ndarray:
    """Generate one intermediate frame between frame1 and frame2."""
    I0 = frame_to_tensor(frame1, device)
    I1 = frame_to_tensor(frame2, device)

    I0, pad_h, pad_w = pad_to_divisible(I0, divisor=32)
    I1, _, _ = pad_to_divisible(I1, divisor=32)

    # timestep=0.5 generates the exact midpoint frame
    mid = model.inference(I0, I1, timestep=0.5, scale=scale)

    # Remove padding
    _, _, orig_h, orig_w = mid.shape
    if pad_h > 0:
        mid = mid[:, :, :orig_h - pad_h, :]
    if pad_w > 0:
        mid = mid[:, :, :, :orig_w - pad_w]

    # Convert back to BGR uint8
    mid_np = (mid[0] * 255.0).byte().cpu().numpy().transpose(1, 2, 0)
    return mid_np

def interpolate_2x(model, frames: list[np.ndarray],
                   device: torch.device, scale: float = 1.0) -> list[np.ndarray]:
    """Double the frame count by inserting one interpolated frame between each pair."""
    output_frames = []
    for i in range(len(frames) - 1):
        output_frames.append(frames[i])
        mid = interpolate_pair(model, frames[i], frames[i + 1], device, scale)
        output_frames.append(mid)
        if (i + 1) % 50 == 0:
            print(f"  Interpolated {i + 1}/{len(frames) - 1} pairs")
    output_frames.append(frames[-1])  # Don't forget the last frame
    return output_frames

interpolated = interpolate_2x(model, frames, device, scale=1.0)
print(f"2x interpolation: {len(frames)} -> {len(interpolated)} frames")

The scale parameter controls internal resolution scaling during flow estimation. Use scale=1.0 for 720p and below. For 1080p, try scale=0.5 to cut memory usage. For 4K, scale=0.5 is almost mandatory unless you have 24GB+ of VRAM.

The timestep=0.5 argument tells RIFE to generate the frame exactly halfway between the two inputs. You can pass other values (0.25, 0.75) to generate frames at different temporal positions – useful for variable-rate interpolation.

Write the Output Video

Assemble the interpolated frames into a new video file with double the original frame rate.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import cv2

def write_video(output_path: str, frames: list, fps: float,
                size: tuple[int, int]) -> None:
    """Write frames to an MP4 video file."""
    w, h = size
    fourcc = cv2.VideoWriter_fourcc(*"mp4v")
    writer = cv2.VideoWriter(output_path, fourcc, fps, (w, h))

    for frame in frames:
        writer.write(frame)

    writer.release()
    print(f"Wrote {len(frames)} frames to {output_path} at {fps:.1f} FPS")

# Original was 30 FPS, interpolated is 60 FPS
output_fps = original_fps * 2
write_video("output_60fps.mp4", interpolated, output_fps, (w, h))

The codec mp4v works everywhere but produces larger files. For better compression, re-encode with FFmpeg afterward:

1
ffmpeg -i output_60fps.mp4 -c:v libx264 -crf 18 -preset slow output_60fps_h264.mp4

If you want to preserve the original audio track, merge it from the source:

1
2
ffmpeg -i output_60fps.mp4 -i input_video.mp4 -c:v libx264 -crf 18 \
  -map 0:v:0 -map 1:a:0 -shortest output_with_audio.mp4

Multi-Pass Interpolation for 4x FPS

To go from 30fps to 120fps, run two passes of 2x interpolation. Each pass doubles the frame count.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import torch
import time

def interpolate_multi_pass(model, frames: list, device: torch.device,
                           passes: int = 2, scale: float = 1.0) -> list:
    """Chain multiple 2x interpolation passes for higher multipliers.

    passes=1 -> 2x FPS
    passes=2 -> 4x FPS
    passes=3 -> 8x FPS
    """
    current_frames = frames
    for p in range(passes):
        start = time.monotonic()
        current_frames = interpolate_2x(model, current_frames, device, scale)
        elapsed = time.monotonic() - start
        multiplier = 2 ** (p + 1)
        print(f"Pass {p + 1}: {multiplier}x done, "
              f"{len(current_frames)} frames ({elapsed:.1f}s)")
    return current_frames

# 4x interpolation: 30fps -> 120fps
frames_4x = interpolate_multi_pass(model, frames, device, passes=2, scale=1.0)
output_fps_4x = original_fps * 4
write_video("output_120fps.mp4", frames_4x, output_fps_4x, (w, h))
print(f"4x interpolation: {original_fps:.0f} FPS -> {output_fps_4x:.0f} FPS")

Each pass processes more frames than the last (the second pass has twice as many pairs to interpolate as the first), so 4x takes roughly 3x as long as 2x – not 2x. For a 10-second 30fps clip (300 frames), expect:

Pass 1: 299 interpolations -> 599 frames
Pass 2: 598 interpolations -> 1197 frames

On an RTX 3060 at 720p, this takes about 40 seconds for 300 input frames. GPU memory stays constant across passes since you process one pair at a time.

Going beyond 4x (passes=3 for 8x) is technically possible but rarely worth it. The synthesized frames start to look ghostly at extreme multipliers, especially around fast motion and scene cuts.

A practical tip: detect scene changes before interpolation. When two consecutive frames are from different scenes, skip interpolation and duplicate the first frame instead. You can detect scene cuts by comparing frame histograms:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import cv2
import numpy as np

def is_scene_change(frame1: np.ndarray, frame2: np.ndarray,
                    threshold: float = 0.4) -> bool:
    """Detect scene changes using histogram correlation."""
    hist1 = cv2.calcHist([frame1], [0, 1, 2], None,
                         [8, 8, 8], [0, 256, 0, 256, 0, 256])
    hist2 = cv2.calcHist([frame2], [0, 1, 2], None,
                         [8, 8, 8], [0, 256, 0, 256, 0, 256])
    cv2.normalize(hist1, hist1)
    cv2.normalize(hist2, hist2)
    correlation = cv2.compareHist(hist1, hist2, cv2.HISTCMP_CORREL)
    return correlation < threshold

When is_scene_change returns True, insert a duplicate of frame1 instead of calling model.inference. This prevents the ghosting artifacts that happen when RIFE tries to blend two completely different images.

Common Errors and Fixes

RuntimeError: Sizes of tensors must match except in dimension 1

Your frames have dimensions that are not divisible by 32. RIFE’s IFNet uses multiple downsampling layers that require this alignment. Always pad inputs before inference:

1
2
I0, pad_h, pad_w = pad_to_divisible(I0, divisor=32)
I1, _, _ = pad_to_divisible(I1, divisor=32)

Then crop the output to remove the padding. The pad_to_divisible function shown earlier handles this.

CUDA out of memory on HD or 4K video

RIFE allocates intermediate feature maps proportional to the input resolution. For 1080p, a single pair can use 4-6GB of VRAM. Two fixes:

Set scale=0.5 in the inference call. This halves the internal resolution for flow estimation while still producing a full-resolution output.
Process frames on CPU for very large resolutions (slow but works): set device = torch.device("cpu").

1
2
# For 4K input, use half scale
mid = model.inference(I0, I1, timestep=0.5, scale=0.5)

KeyError: 'module.block0.conv0.0.weight' when loading model

This happens when the model weights were saved from distributed training with DataParallel, which adds a module. prefix to every key. The Practical-RIFE load_model method already strips this prefix. If you are loading weights manually, handle it yourself:

1
2
3
4
5
6
state_dict = torch.load("flownet.pkl", map_location=device)
new_state_dict = {}
for k, v in state_dict.items():
    new_key = k.replace("module.", "")
    new_state_dict[new_key] = v
flownet.load_state_dict(new_state_dict, strict=False)

Output video has no audio

OpenCV’s VideoWriter does not handle audio tracks. After writing the interpolated video, merge audio from the original using FFmpeg:

1
2
ffmpeg -i output_60fps.mp4 -i input_video.mp4 -c:v copy -c:a aac \
  -map 0:v:0 -map 1:a:0 -shortest output_with_audio.mp4

Ghosting artifacts at scene cuts

RIFE blends the two input frames to produce the intermediate result. When those frames come from different scenes (a hard cut), the output is a semi-transparent overlay of both scenes. Use the histogram-based scene detection shown above to skip interpolation at cut points.

Install and Load RIFE#

Read and Prepare Video Frames#

Interpolate Frames#

Write the Output Video#

Multi-Pass Interpolation for 4x FPS#

Common Errors and Fixes#

Related Guides#

About the Author