Video object removal is one of those problems that sounds straightforward until you try it frame by frame. Naive per-frame image inpainting produces flickering garbage because each frame gets filled independently with no temporal consistency. The fix: use SAM 2 to propagate a segmentation mask across the entire video, then feed those masks to a video-aware inpainting model like ProPainter that reasons about motion and neighboring frames.

Here is the high-level pipeline you will build:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import cv2
import numpy as np
import torch
from sam2.build_sam import build_sam2_video_predictor

# Step 1: Extract frames
# Step 2: Segment the target object with SAM 2 video predictor
# Step 3: Generate per-frame binary masks
# Step 4: Run ProPainter to inpaint masked regions with temporal coherence
# Step 5: Reassemble frames into output video with OpenCV

The rest of this guide walks through each step with copy-paste ready code.

Extract Frames from the Source Video

Before anything else, pull frames out of the video with OpenCV. SAM 2’s video predictor expects a directory of JPEG frames.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import cv2
import os

def extract_frames(video_path: str, output_dir: str) -> dict:
    """Extract all frames from a video file. Returns metadata."""
    os.makedirs(output_dir, exist_ok=True)
    cap = cv2.VideoCapture(video_path)

    fps = cap.get(cv2.CAP_PROP_FPS)
    width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))

    frame_idx = 0
    while True:
        ret, frame = cap.read()
        if not ret:
            break
        # SAM 2 expects JPEG frames named as zero-padded numbers
        cv2.imwrite(os.path.join(output_dir, f"{frame_idx:05d}.jpg"), frame)
        frame_idx += 1

    cap.release()
    return {"fps": fps, "width": width, "height": height, "total_frames": frame_idx}


video_meta = extract_frames("input.mp4", "frames/")
print(f"Extracted {video_meta['total_frames']} frames at {video_meta['fps']} FPS")

Keep the metadata around – you need the FPS and resolution when you write the output video later.

Segment the Object Across Frames with SAM 2

SAM 2’s video predictor is the key piece here. You give it a single prompt on one frame (a point click, bounding box, or mask), and it propagates the segmentation across every frame in the video. This gives you temporally consistent masks without manual annotation.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import torch
import numpy as np
from sam2.build_sam import build_sam2_video_predictor

# Load SAM 2 video predictor
checkpoint = "./checkpoints/sam2.1_hiera_large.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
predictor = build_sam2_video_predictor(model_cfg, checkpoint)

frames_dir = "frames/"

with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
    # Initialize state from the frames directory
    state = predictor.init_state(video_path=frames_dir)

    # Prompt on frame 0: click on the object you want to remove
    # (x, y) coordinates of the object center, label=1 means foreground
    _, obj_ids, mask_logits = predictor.add_new_points_or_box(
        inference_state=state,
        frame_idx=0,
        obj_id=1,
        points=np.array([[320, 240]], dtype=np.float32),
        labels=np.array([1], dtype=np.int32),
    )

    # Propagate the mask across all frames
    video_masks = {}
    for frame_idx, obj_ids, mask_logits in predictor.propagate_in_video(state):
        # mask_logits shape: (num_objects, 1, H, W)
        # Threshold at 0.0 to get binary mask
        mask = (mask_logits[0] > 0.0).cpu().numpy().squeeze()
        video_masks[frame_idx] = mask

print(f"Generated masks for {len(video_masks)} frames")

If a point prompt is not precise enough, use a bounding box instead:

1
2
3
4
5
6
_, obj_ids, mask_logits = predictor.add_new_points_or_box(
    inference_state=state,
    frame_idx=0,
    obj_id=1,
    box=np.array([200, 150, 440, 330], dtype=np.float32),  # [x1, y1, x2, y2]
)

Bounding boxes work better for objects with complex shapes or thin features where a single point might be ambiguous.

Prepare Masks and Run ProPainter

ProPainter is a video inpainting model that uses dual-domain propagation and temporal transformers to fill masked regions with content that stays consistent across frames. It is the best open-source option for this task right now.

First, save the binary masks as PNGs (white = region to remove, black = keep):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import os

mask_dir = "masks/"
os.makedirs(mask_dir, exist_ok=True)

for frame_idx, mask in video_masks.items():
    # Dilate the mask slightly to cover edge artifacts
    kernel = np.ones((5, 5), np.uint8)
    dilated = cv2.dilate(mask.astype(np.uint8) * 255, kernel, iterations=2)
    cv2.imwrite(os.path.join(mask_dir, f"{frame_idx:05d}.png"), dilated)

Dilating the mask by a few pixels is important. SAM 2 masks are tight to the object boundary, but inpainting works better when the mask slightly overshoots – it prevents ghosting at the edges.

Now run ProPainter. Clone the repo and use its inference script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
git clone https://github.com/sczhou/ProPainter.git
cd ProPainter
pip install -r requirements.txt

# Download pretrained weights
python scripts/download_model.py

# Run inpainting
python inference_propainter.py \
    --video frames/ \
    --mask masks/ \
    --output results/ \
    --height 480 \
    --width 640 \
    --fp16

The --fp16 flag cuts VRAM usage roughly in half. ProPainter processes the video in overlapping windows (default 80 frames with 4-frame overlap), so it can handle long videos without running out of memory.

If you want to call ProPainter from Python directly instead of the CLI:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import sys
sys.path.insert(0, "ProPainter/")

from inference_propainter import inpaint_video

inpaint_video(
    video_path="frames/",
    mask_path="masks/",
    output_path="results/",
    height=480,
    width=640,
    use_fp16=True,
    neighbor_length=10,
    ref_stride=10,
)

Reassemble the Output Video

ProPainter writes inpainted frames to the output directory. Stitch them back into a video with OpenCV:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import cv2
import os
import glob

def frames_to_video(
    frames_dir: str,
    output_path: str,
    fps: float,
    width: int,
    height: int,
) -> None:
    """Combine frames into an MP4 video."""
    frame_paths = sorted(glob.glob(os.path.join(frames_dir, "*.png")))
    if not frame_paths:
        frame_paths = sorted(glob.glob(os.path.join(frames_dir, "*.jpg")))

    fourcc = cv2.VideoWriter_fourcc(*"mp4v")
    writer = cv2.VideoWriter(output_path, fourcc, fps, (width, height))

    for path in frame_paths:
        frame = cv2.imread(path)
        if frame.shape[1] != width or frame.shape[0] != height:
            frame = cv2.resize(frame, (width, height))
        writer.write(frame)

    writer.release()
    print(f"Wrote {len(frame_paths)} frames to {output_path}")


frames_to_video(
    frames_dir="results/",
    output_path="output_clean.mp4",
    fps=video_meta["fps"],
    width=video_meta["width"],
    height=video_meta["height"],
)

If you need the audio from the original video, use ffmpeg to mux it back in:

1
ffmpeg -i output_clean.mp4 -i input.mp4 -c:v copy -c:a aac -map 0:v:0 -map 1:a:0 -shortest output_with_audio.mp4

Common Errors and Fixes

RuntimeError: CUDA out of memory when running SAM 2 on long videos.

SAM 2’s video predictor loads all frame features into GPU memory. For videos longer than a few hundred frames, process in chunks:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
chunk_size = 100
all_masks = {}

for start in range(0, video_meta["total_frames"], chunk_size):
    end = min(start + chunk_size, video_meta["total_frames"])
    # Create a subdirectory with just this chunk's frames
    chunk_dir = f"frames_chunk_{start}/"
    os.makedirs(chunk_dir, exist_ok=True)
    for i in range(start, end):
        src = f"frames/{i:05d}.jpg"
        dst = os.path.join(chunk_dir, f"{i - start:05d}.jpg")
        os.symlink(os.path.abspath(src), dst)

    state = predictor.init_state(video_path=chunk_dir)
    predictor.add_new_points_or_box(
        inference_state=state, frame_idx=0, obj_id=1,
        points=np.array([[320, 240]], dtype=np.float32),
        labels=np.array([1], dtype=np.int32),
    )
    for frame_idx, obj_ids, mask_logits in predictor.propagate_in_video(state):
        mask = (mask_logits[0] > 0.0).cpu().numpy().squeeze()
        all_masks[start + frame_idx] = mask

    predictor.reset_state(state)

cv2.error: (-215:Assertion failed) !_img.empty() in function 'imwrite'

This means OpenCV got an empty frame. Check that your video path is correct and the codec is supported. Try re-encoding with ffmpeg first:

1
ffmpeg -i broken_video.mov -c:v libx264 -crf 18 -pix_fmt yuv420p fixed_video.mp4

ProPainter produces blurry results or visible seams.

Three things to check:

  1. Make sure masks are dilated by at least 3-5 pixels beyond the object boundary
  2. Increase --neighbor_length to 15-20 so ProPainter references more surrounding frames
  3. Run at the video’s native resolution instead of downscaling – ProPainter handles up to 1080p on a 24GB GPU with --fp16

SAM 2 loses track of the object mid-video.

Add correction prompts on frames where tracking drifts. You can add multiple prompts before calling propagate_in_video:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Initial prompt on frame 0
predictor.add_new_points_or_box(
    inference_state=state, frame_idx=0, obj_id=1,
    points=np.array([[320, 240]], dtype=np.float32),
    labels=np.array([1], dtype=np.int32),
)
# Correction prompt on frame 150 where tracking drifted
predictor.add_new_points_or_box(
    inference_state=state, frame_idx=150, obj_id=1,
    points=np.array([[350, 260]], dtype=np.float32),
    labels=np.array([1], dtype=np.int32),
)

SAM 2 uses all prompts jointly when propagating, so correction points anchor the mask where it matters most.