SAM 2 can segment every object in an image without you telling it what to look for. That makes it a strong foundation for semantic segmentation pipelines – you feed it an image, it returns pixel-perfect masks for everything it finds, and you layer classification on top.

Here is the fastest path to full-image segmentation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import numpy as np
import torch
from PIL import Image
from sam2.build_sam import build_sam2
from sam2.automatic_mask_generator import SAM2AutomaticMaskGenerator

DEVICE = torch.device("cuda")
CHECKPOINT = "./checkpoints/sam2.1_hiera_large.pt"
CONFIG = "configs/sam2.1/sam2.1_hiera_l.yaml"

sam2_model = build_sam2(CONFIG, CHECKPOINT, device=DEVICE, apply_postprocessing=False)
mask_generator = SAM2AutomaticMaskGenerator(sam2_model)

image = np.array(Image.open("street_scene.jpg").convert("RGB"))
masks = mask_generator.generate(image)

print(f"Found {len(masks)} segments")
print(f"First mask keys: {masks[0].keys()}")
# dict_keys(['segmentation', 'area', 'bbox', 'predicted_iou', 'stability_score', ...])

Each entry in masks is a dictionary with a binary segmentation array (shape H x W), a bbox in XYWH format, an area count, and quality scores like predicted_iou and stability_score. Sort by area or score to prioritize the most prominent objects.

Installation and Setup

SAM 2 needs Python 3.10+, PyTorch 2.5.1+, and a CUDA GPU. The custom CUDA kernels are optional in SAM 2.1, so installation rarely fails.

1
2
3
4
5
6
git clone https://github.com/facebookresearch/sam2.git
cd sam2
pip install -e ".[notebooks]"

# Download SAM 2.1 checkpoints
cd checkpoints && ./download_ckpts.sh && cd ..

If you want to skip manual checkpoint downloads, load from Hugging Face instead:

1
2
3
from sam2.sam2_image_predictor import SAM2ImagePredictor

predictor = SAM2ImagePredictor.from_pretrained("facebook/sam2-hiera-large")

Four model sizes are available. Pick hiera_large (224M params) for best accuracy, or hiera_tiny (39M params) for speed-sensitive pipelines.

Automatic Mask Generation for Full-Scene Segmentation

The SAM2AutomaticMaskGenerator samples a grid of point prompts across the entire image, generates multiple candidate masks per point, then deduplicates and filters them. You get a complete segmentation map without providing any prompts.

You can tune the generator to control mask density and quality:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
mask_generator = SAM2AutomaticMaskGenerator(
    model=sam2_model,
    points_per_side=32,          # grid density (32x32 = 1024 points sampled)
    points_per_batch=64,         # batch size for inference
    pred_iou_thresh=0.7,         # drop masks below this IoU score
    stability_score_thresh=0.92, # drop unstable masks
    crop_n_layers=1,             # extra crops for small objects
    min_mask_region_area=100,    # remove tiny fragments (in pixels)
)

masks = mask_generator.generate(image)

points_per_side=32 means 1024 seed points. Bump it to 64 for dense scenes with small objects, drop it to 16 for speed. The pred_iou_thresh and stability_score_thresh parameters filter out low-quality masks – raising them produces fewer but cleaner segments.

Visualizing the Segmentation Map

A common pattern is to overlay all masks as colored regions on the original image:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import matplotlib.pyplot as plt

def show_masks(image, masks, ax):
    """Overlay all generated masks on the image with random colors."""
    ax.imshow(image)
    if len(masks) == 0:
        return

    # Sort by area so large masks render first (small ones on top)
    sorted_masks = sorted(masks, key=lambda x: x["area"], reverse=True)

    for mask_data in sorted_masks:
        seg = mask_data["segmentation"]
        color = np.concatenate([np.random.random(3), [0.5]])  # RGBA with 50% opacity
        overlay = np.zeros((*seg.shape, 4))
        overlay[seg] = color
        ax.imshow(overlay)

    ax.axis("off")

fig, axes = plt.subplots(1, 2, figsize=(16, 8))
axes[0].imshow(image)
axes[0].set_title("Original")
axes[0].axis("off")

show_masks(image, masks, axes[1])
axes[1].set_title(f"SAM 2 Segmentation ({len(masks)} masks)")
plt.tight_layout()
plt.savefig("segmentation_output.png", dpi=150)
plt.show()

The masks overlap – SAM 2 is not strictly mutually exclusive like traditional semantic segmentation models. If you need non-overlapping labels, assign each pixel to the mask with the highest predicted_iou or stability_score.

Prompted Segmentation with Points and Boxes

When you know what you want to segment, point and box prompts give you direct control. This is useful for building interactive annotation tools or targeting specific objects in a pipeline.

Point Prompts

Pass one or more (x, y) coordinates and a label for each: 1 means foreground, 0 means background.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from sam2.sam2_image_predictor import SAM2ImagePredictor

predictor = SAM2ImagePredictor(sam2_model)

with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
    predictor.set_image(image)

    # Two foreground points on the target object
    point_coords = np.array([[400, 300], [450, 350]])
    point_labels = np.array([1, 1])

    masks, scores, logits = predictor.predict(
        point_coords=point_coords,
        point_labels=point_labels,
        multimask_output=True,
    )

# Pick the best mask
best_idx = np.argmax(scores)
best_mask = masks[best_idx]
print(f"Best mask score: {scores[best_idx]:.3f}")
print(f"Mask shape: {best_mask.shape}")  # (H, W)

With multimask_output=True, you get three candidate masks ranked by score. The model produces multiple candidates because a single point is ambiguous – it could mean the whole car or just the tire. The top-scoring mask is usually the right interpretation.

Box Prompts

Bounding boxes remove that ambiguity. Pass [x_min, y_min, x_max, y_max]:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
    predictor.set_image(image)

    # Tight bounding box around a car
    input_box = np.array([120, 200, 850, 600])

    masks, scores, logits = predictor.predict(
        box=input_box,
        multimask_output=False,
    )

print(f"Mask confidence: {scores[0]:.3f}")

Use multimask_output=False with boxes – the box already constrains the region, so you want a single confident mask. You can also stack points and boxes together for tricky cases like occluded objects: add a background point on the occluder to tell SAM 2 “that part is not my object.”

Video Segmentation with SAM2VideoPredictor

SAM 2’s headline feature over the original SAM is video support. You prompt a single frame, and the model propagates the segmentation across the entire video using a memory mechanism.

The video predictor expects frames extracted as JPEG files in a directory. Prepare your video first:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import os
import cv2

def extract_frames(video_path, output_dir, max_frames=200):
    """Extract video frames as numbered JPEG files."""
    os.makedirs(output_dir, exist_ok=True)
    cap = cv2.VideoCapture(video_path)
    frame_idx = 0

    while cap.isOpened() and frame_idx < max_frames:
        ret, frame = cap.read()
        if not ret:
            break
        frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        out_path = os.path.join(output_dir, f"{frame_idx:05d}.jpg")
        cv2.imwrite(out_path, cv2.cvtColor(frame_rgb, cv2.COLOR_RGB2BGR))
        frame_idx += 1

    cap.release()
    print(f"Extracted {frame_idx} frames to {output_dir}")
    return frame_idx

extract_frames("input_video.mp4", "./video_frames/")

Now run the video predictor. Prompt with a point or box on a single frame, then propagate:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from sam2.build_sam import build_sam2_video_predictor

video_predictor = build_sam2_video_predictor(
    "configs/sam2.1/sam2.1_hiera_l.yaml",
    "./checkpoints/sam2.1_hiera_large.pt",
)

with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
    # Initialize with the frames directory
    state = video_predictor.init_state(video_path="./video_frames/")

    # Add a point prompt on frame 0 to identify the object
    frame_idx, object_ids, masks = video_predictor.add_new_points_or_box(
        inference_state=state,
        frame_idx=0,
        obj_id=1,
        points=np.array([[400, 300]], dtype=np.float32),
        labels=np.array([1], dtype=np.int32),
    )

    # Propagate the segmentation across all frames
    video_segments = {}
    for frame_idx, object_ids, masks in video_predictor.propagate_in_video(state):
        video_segments[frame_idx] = {
            "object_ids": object_ids,
            "masks": (masks > 0.0).cpu().numpy(),
        }

print(f"Segmented {len(video_segments)} frames")
print(f"Objects tracked: {video_segments[0]['object_ids']}")

You can track multiple objects by calling add_new_points_or_box with different obj_id values. The model maintains separate memory banks for each object and handles occlusions between them.

Visualizing Video Segmentation

Save the segmented frames back as a video:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def save_segmented_video(frames_dir, video_segments, output_path, fps=30):
    """Overlay segmentation masks on frames and write to video."""
    frame_files = sorted(os.listdir(frames_dir))
    first_frame = cv2.imread(os.path.join(frames_dir, frame_files[0]))
    h, w = first_frame.shape[:2]

    fourcc = cv2.VideoWriter_fourcc(*"mp4v")
    writer = cv2.VideoWriter(output_path, fourcc, fps, (w, h))

    for idx, fname in enumerate(frame_files):
        frame = cv2.imread(os.path.join(frames_dir, fname))
        if idx in video_segments:
            for mask in video_segments[idx]["masks"]:
                # Green overlay for the segmented region
                color_mask = np.zeros_like(frame)
                color_mask[mask.squeeze()] = [0, 255, 0]
                frame = cv2.addWeighted(frame, 1.0, color_mask, 0.4, 0)
        writer.write(frame)

    writer.release()
    print(f"Saved segmented video to {output_path}")

save_segmented_video("./video_frames/", video_segments, "segmented_output.mp4")

Building a Semantic Segmentation Pipeline

Raw SAM 2 masks are class-agnostic – they tell you where objects are but not what they are. To build a full semantic segmentation pipeline, combine SAM 2 with a classifier. Here is a practical pattern using CLIP for zero-shot labeling:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
from sam2.build_sam import build_sam2
from sam2.automatic_mask_generator import SAM2AutomaticMaskGenerator
from transformers import CLIPProcessor, CLIPModel

# SAM 2 for segmentation
sam2_model = build_sam2(CONFIG, CHECKPOINT, device=DEVICE, apply_postprocessing=False)
mask_generator = SAM2AutomaticMaskGenerator(sam2_model, points_per_side=32)

# CLIP for classification
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(DEVICE)
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Define your label set
labels = ["car", "person", "tree", "building", "road", "sky", "sidewalk"]

image = np.array(Image.open("city_scene.jpg").convert("RGB"))
masks = mask_generator.generate(image)

labeled_segments = []
for mask_data in masks:
    seg = mask_data["segmentation"]
    bbox = mask_data["bbox"]  # [x, y, w, h]

    # Crop the region for classification
    x, y, w, h = bbox
    crop = image[y:y+h, x:x+w]
    if crop.size == 0:
        continue

    crop_pil = Image.fromarray(crop)
    inputs = clip_processor(
        text=labels,
        images=crop_pil,
        return_tensors="pt",
        padding=True,
    ).to(DEVICE)

    with torch.no_grad():
        outputs = clip_model(**inputs)
        probs = outputs.logits_per_image.softmax(dim=-1)[0]

    best_label_idx = probs.argmax().item()
    labeled_segments.append({
        "label": labels[best_label_idx],
        "confidence": probs[best_label_idx].item(),
        "mask": seg,
        "bbox": bbox,
    })

for seg in labeled_segments[:5]:
    print(f"{seg['label']}: {seg['confidence']:.2f} (area: {seg['mask'].sum()} px)")

This gives you a labeled segmentation map without training a single model on your data. Swap CLIP for a domain-specific classifier when you need higher accuracy on known classes.

Common Errors and Fixes

RuntimeError: No CUDA GPUs are available

SAM 2 requires a CUDA GPU. There is no CPU fallback for the model. If you are on a machine without a GPU, use Google Colab or a cloud instance with at least an NVIDIA T4.

Hydra configuration error when loading the model

This usually means the config path is wrong. SAM 2 uses Hydra internally for config resolution. Make sure you pass the config name relative to the sam2/configs/ directory:

1
2
3
4
5
# Correct -- relative config path
model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"

# Also correct -- if you installed sam2 as a package, Hydra finds it automatically
model_cfg = "sam2.1/sam2.1_hiera_l"

If you get ConfigCompositionException, try adding the config search path:

1
2
3
4
from hydra import initialize_config_dir
from hydra.core.global_hydra import GlobalHydra

GlobalHydra.instance().clear()

torch.OutOfMemoryError with automatic mask generation

Automatic mask generation uses more memory than prompted segmentation because it processes many points in parallel. Reduce points_per_batch from 64 to 32, or lower points_per_side from 32 to 16. Using torch.autocast("cuda", dtype=torch.bfloat16) also helps significantly.

Video predictor expects JPEG frames, not a video file

init_state(video_path=...) expects a directory of JPEG frames, not an MP4 or AVI file. Extract frames first using OpenCV or ffmpeg:

1
ffmpeg -i input.mp4 -q:v 2 -start_number 0 video_frames/%05d.jpg