Frame interpolation synthesizes new frames between existing ones, turning choppy 24fps footage into smooth 60fps or even 120fps video. RIFE (Real-Time Intermediate Flow Estimation) does this by estimating intermediate optical flows directly – no separate flow computation step needed. It runs at 30+ FPS on a 2080 Ti for 720p 2x interpolation, which makes it practical for real video workflows.
Install and Load RIFE#
RIFE is distributed through the Practical-RIFE repository. Clone it, install dependencies, and download the model weights.
1
2
3
4
| git clone https://github.com/hzwer/Practical-RIFE.git
cd Practical-RIFE
pip install -r requirements.txt
pip install opencv-python-headless
|
Download the v4.25 model (recommended for most scenes) from the release page and place the files in train_log/:
1
2
3
| mkdir -p train_log
# Download model v4.25 from the Practical-RIFE model list
# Place RIFE_HDv3.py, IFNet_HDv3.py, and flownet.pkl into train_log/
|
Now load the model in Python:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| import torch
import numpy as np
import cv2
import sys
import os
# Add the Practical-RIFE directory to the path
sys.path.insert(0, os.path.abspath("Practical-RIFE"))
from model.RIFE_HDv3 import Model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.set_grad_enabled(False)
if torch.cuda.is_available():
torch.backends.cudnn.enabled = True
torch.backends.cudnn.benchmark = True
model = Model()
model.load_model("Practical-RIFE/train_log", -1)
model.eval()
model.device()
|
The load_model method reads flownet.pkl from the specified directory. Passing -1 as the rank tells it to load on whatever device is available. The model.device() call moves the network to CUDA if present.
Read and Prepare Video Frames#
Use OpenCV to read video frames and convert them to the tensor format RIFE expects. Frames go from BGR uint8 arrays to normalized float32 tensors in NCHW layout.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
| import torch
import numpy as np
import cv2
from torch.nn import functional as F
def load_video_frames(video_path: str) -> tuple[list[np.ndarray], float, tuple[int, int]]:
"""Read all frames from a video file. Returns frames, FPS, and (height, width)."""
cap = cv2.VideoCapture(video_path)
fps = cap.get(cv2.CAP_PROP_FPS)
frames = []
while True:
ret, frame = cap.read()
if not ret:
break
frames.append(frame)
cap.release()
h, w = frames[0].shape[:2]
return frames, fps, (h, w)
def frame_to_tensor(frame: np.ndarray, device: torch.device) -> torch.Tensor:
"""Convert a BGR uint8 frame to a normalized NCHW float32 tensor."""
# BGR -> RGB, HWC -> CHW, normalize to [0, 1]
tensor = torch.from_numpy(np.transpose(frame, (2, 0, 1))).float() / 255.0
tensor = tensor.unsqueeze(0).to(device, non_blocking=True)
return tensor
def pad_to_divisible(tensor: torch.Tensor, divisor: int = 32) -> tuple[torch.Tensor, int, int]:
"""Pad tensor so H and W are divisible by divisor. Returns padded tensor and pad amounts."""
_, _, h, w = tensor.shape
pad_h = (divisor - h % divisor) % divisor
pad_w = (divisor - w % divisor) % divisor
if pad_h > 0 or pad_w > 0:
tensor = F.pad(tensor, (0, pad_w, 0, pad_h), mode="replicate")
return tensor, pad_h, pad_w
frames, original_fps, (h, w) = load_video_frames("input_video.mp4")
print(f"Loaded {len(frames)} frames at {original_fps:.1f} FPS, resolution {w}x{h}")
|
RIFE’s IFNet requires spatial dimensions divisible by 32. The padding function handles this with replicate padding, which avoids edge artifacts better than zero padding. You crop the output back to the original size after inference.
Interpolate Frames#
With the model loaded and frames prepared, run RIFE to generate one intermediate frame between each consecutive pair. This doubles the frame count.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
| import torch
import numpy as np
import cv2
from torch.nn import functional as F
def interpolate_pair(model, frame1: np.ndarray, frame2: np.ndarray,
device: torch.device, scale: float = 1.0) -> np.ndarray:
"""Generate one intermediate frame between frame1 and frame2."""
I0 = frame_to_tensor(frame1, device)
I1 = frame_to_tensor(frame2, device)
I0, pad_h, pad_w = pad_to_divisible(I0, divisor=32)
I1, _, _ = pad_to_divisible(I1, divisor=32)
# timestep=0.5 generates the exact midpoint frame
mid = model.inference(I0, I1, timestep=0.5, scale=scale)
# Remove padding
_, _, orig_h, orig_w = mid.shape
if pad_h > 0:
mid = mid[:, :, :orig_h - pad_h, :]
if pad_w > 0:
mid = mid[:, :, :, :orig_w - pad_w]
# Convert back to BGR uint8
mid_np = (mid[0] * 255.0).byte().cpu().numpy().transpose(1, 2, 0)
return mid_np
def interpolate_2x(model, frames: list[np.ndarray],
device: torch.device, scale: float = 1.0) -> list[np.ndarray]:
"""Double the frame count by inserting one interpolated frame between each pair."""
output_frames = []
for i in range(len(frames) - 1):
output_frames.append(frames[i])
mid = interpolate_pair(model, frames[i], frames[i + 1], device, scale)
output_frames.append(mid)
if (i + 1) % 50 == 0:
print(f" Interpolated {i + 1}/{len(frames) - 1} pairs")
output_frames.append(frames[-1]) # Don't forget the last frame
return output_frames
interpolated = interpolate_2x(model, frames, device, scale=1.0)
print(f"2x interpolation: {len(frames)} -> {len(interpolated)} frames")
|
The scale parameter controls internal resolution scaling during flow estimation. Use scale=1.0 for 720p and below. For 1080p, try scale=0.5 to cut memory usage. For 4K, scale=0.5 is almost mandatory unless you have 24GB+ of VRAM.
The timestep=0.5 argument tells RIFE to generate the frame exactly halfway between the two inputs. You can pass other values (0.25, 0.75) to generate frames at different temporal positions – useful for variable-rate interpolation.
Write the Output Video#
Assemble the interpolated frames into a new video file with double the original frame rate.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| import cv2
def write_video(output_path: str, frames: list, fps: float,
size: tuple[int, int]) -> None:
"""Write frames to an MP4 video file."""
w, h = size
fourcc = cv2.VideoWriter_fourcc(*"mp4v")
writer = cv2.VideoWriter(output_path, fourcc, fps, (w, h))
for frame in frames:
writer.write(frame)
writer.release()
print(f"Wrote {len(frames)} frames to {output_path} at {fps:.1f} FPS")
# Original was 30 FPS, interpolated is 60 FPS
output_fps = original_fps * 2
write_video("output_60fps.mp4", interpolated, output_fps, (w, h))
|
The codec mp4v works everywhere but produces larger files. For better compression, re-encode with FFmpeg afterward:
1
| ffmpeg -i output_60fps.mp4 -c:v libx264 -crf 18 -preset slow output_60fps_h264.mp4
|
If you want to preserve the original audio track, merge it from the source:
1
2
| ffmpeg -i output_60fps.mp4 -i input_video.mp4 -c:v libx264 -crf 18 \
-map 0:v:0 -map 1:a:0 -shortest output_with_audio.mp4
|
Multi-Pass Interpolation for 4x FPS#
To go from 30fps to 120fps, run two passes of 2x interpolation. Each pass doubles the frame count.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
| import torch
import time
def interpolate_multi_pass(model, frames: list, device: torch.device,
passes: int = 2, scale: float = 1.0) -> list:
"""Chain multiple 2x interpolation passes for higher multipliers.
passes=1 -> 2x FPS
passes=2 -> 4x FPS
passes=3 -> 8x FPS
"""
current_frames = frames
for p in range(passes):
start = time.monotonic()
current_frames = interpolate_2x(model, current_frames, device, scale)
elapsed = time.monotonic() - start
multiplier = 2 ** (p + 1)
print(f"Pass {p + 1}: {multiplier}x done, "
f"{len(current_frames)} frames ({elapsed:.1f}s)")
return current_frames
# 4x interpolation: 30fps -> 120fps
frames_4x = interpolate_multi_pass(model, frames, device, passes=2, scale=1.0)
output_fps_4x = original_fps * 4
write_video("output_120fps.mp4", frames_4x, output_fps_4x, (w, h))
print(f"4x interpolation: {original_fps:.0f} FPS -> {output_fps_4x:.0f} FPS")
|
Each pass processes more frames than the last (the second pass has twice as many pairs to interpolate as the first), so 4x takes roughly 3x as long as 2x – not 2x. For a 10-second 30fps clip (300 frames), expect:
- Pass 1: 299 interpolations -> 599 frames
- Pass 2: 598 interpolations -> 1197 frames
On an RTX 3060 at 720p, this takes about 40 seconds for 300 input frames. GPU memory stays constant across passes since you process one pair at a time.
Going beyond 4x (passes=3 for 8x) is technically possible but rarely worth it. The synthesized frames start to look ghostly at extreme multipliers, especially around fast motion and scene cuts.
A practical tip: detect scene changes before interpolation. When two consecutive frames are from different scenes, skip interpolation and duplicate the first frame instead. You can detect scene cuts by comparing frame histograms:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| import cv2
import numpy as np
def is_scene_change(frame1: np.ndarray, frame2: np.ndarray,
threshold: float = 0.4) -> bool:
"""Detect scene changes using histogram correlation."""
hist1 = cv2.calcHist([frame1], [0, 1, 2], None,
[8, 8, 8], [0, 256, 0, 256, 0, 256])
hist2 = cv2.calcHist([frame2], [0, 1, 2], None,
[8, 8, 8], [0, 256, 0, 256, 0, 256])
cv2.normalize(hist1, hist1)
cv2.normalize(hist2, hist2)
correlation = cv2.compareHist(hist1, hist2, cv2.HISTCMP_CORREL)
return correlation < threshold
|
When is_scene_change returns True, insert a duplicate of frame1 instead of calling model.inference. This prevents the ghosting artifacts that happen when RIFE tries to blend two completely different images.
Common Errors and Fixes#
RuntimeError: Sizes of tensors must match except in dimension 1
Your frames have dimensions that are not divisible by 32. RIFE’s IFNet uses multiple downsampling layers that require this alignment. Always pad inputs before inference:
1
2
| I0, pad_h, pad_w = pad_to_divisible(I0, divisor=32)
I1, _, _ = pad_to_divisible(I1, divisor=32)
|
Then crop the output to remove the padding. The pad_to_divisible function shown earlier handles this.
CUDA out of memory on HD or 4K video
RIFE allocates intermediate feature maps proportional to the input resolution. For 1080p, a single pair can use 4-6GB of VRAM. Two fixes:
- Set
scale=0.5 in the inference call. This halves the internal resolution for flow estimation while still producing a full-resolution output. - Process frames on CPU for very large resolutions (slow but works): set
device = torch.device("cpu").
1
2
| # For 4K input, use half scale
mid = model.inference(I0, I1, timestep=0.5, scale=0.5)
|
KeyError: 'module.block0.conv0.0.weight' when loading model
This happens when the model weights were saved from distributed training with DataParallel, which adds a module. prefix to every key. The Practical-RIFE load_model method already strips this prefix. If you are loading weights manually, handle it yourself:
1
2
3
4
5
6
| state_dict = torch.load("flownet.pkl", map_location=device)
new_state_dict = {}
for k, v in state_dict.items():
new_key = k.replace("module.", "")
new_state_dict[new_key] = v
flownet.load_state_dict(new_state_dict, strict=False)
|
Output video has no audio
OpenCV’s VideoWriter does not handle audio tracks. After writing the interpolated video, merge audio from the original using FFmpeg:
1
2
| ffmpeg -i output_60fps.mp4 -i input_video.mp4 -c:v copy -c:a aac \
-map 0:v:0 -map 1:a:0 -shortest output_with_audio.mp4
|
Ghosting artifacts at scene cuts
RIFE blends the two input frames to produce the intermediate result. When those frames come from different scenes (a hard cut), the output is a semi-transparent overlay of both scenes. Use the histogram-based scene detection shown above to skip interpolation at cut points.