SAM 2 can segment every object in an image without you telling it what to look for. That makes it a strong foundation for semantic segmentation pipelines – you feed it an image, it returns pixel-perfect masks for everything it finds, and you layer classification on top.
Here is the fastest path to full-image segmentation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| import numpy as np
import torch
from PIL import Image
from sam2.build_sam import build_sam2
from sam2.automatic_mask_generator import SAM2AutomaticMaskGenerator
DEVICE = torch.device("cuda")
CHECKPOINT = "./checkpoints/sam2.1_hiera_large.pt"
CONFIG = "configs/sam2.1/sam2.1_hiera_l.yaml"
sam2_model = build_sam2(CONFIG, CHECKPOINT, device=DEVICE, apply_postprocessing=False)
mask_generator = SAM2AutomaticMaskGenerator(sam2_model)
image = np.array(Image.open("street_scene.jpg").convert("RGB"))
masks = mask_generator.generate(image)
print(f"Found {len(masks)} segments")
print(f"First mask keys: {masks[0].keys()}")
# dict_keys(['segmentation', 'area', 'bbox', 'predicted_iou', 'stability_score', ...])
|
Each entry in masks is a dictionary with a binary segmentation array (shape H x W), a bbox in XYWH format, an area count, and quality scores like predicted_iou and stability_score. Sort by area or score to prioritize the most prominent objects.
Installation and Setup#
SAM 2 needs Python 3.10+, PyTorch 2.5.1+, and a CUDA GPU. The custom CUDA kernels are optional in SAM 2.1, so installation rarely fails.
1
2
3
4
5
6
| git clone https://github.com/facebookresearch/sam2.git
cd sam2
pip install -e ".[notebooks]"
# Download SAM 2.1 checkpoints
cd checkpoints && ./download_ckpts.sh && cd ..
|
If you want to skip manual checkpoint downloads, load from Hugging Face instead:
1
2
3
| from sam2.sam2_image_predictor import SAM2ImagePredictor
predictor = SAM2ImagePredictor.from_pretrained("facebook/sam2-hiera-large")
|
Four model sizes are available. Pick hiera_large (224M params) for best accuracy, or hiera_tiny (39M params) for speed-sensitive pipelines.
Automatic Mask Generation for Full-Scene Segmentation#
The SAM2AutomaticMaskGenerator samples a grid of point prompts across the entire image, generates multiple candidate masks per point, then deduplicates and filters them. You get a complete segmentation map without providing any prompts.
You can tune the generator to control mask density and quality:
1
2
3
4
5
6
7
8
9
10
11
| mask_generator = SAM2AutomaticMaskGenerator(
model=sam2_model,
points_per_side=32, # grid density (32x32 = 1024 points sampled)
points_per_batch=64, # batch size for inference
pred_iou_thresh=0.7, # drop masks below this IoU score
stability_score_thresh=0.92, # drop unstable masks
crop_n_layers=1, # extra crops for small objects
min_mask_region_area=100, # remove tiny fragments (in pixels)
)
masks = mask_generator.generate(image)
|
points_per_side=32 means 1024 seed points. Bump it to 64 for dense scenes with small objects, drop it to 16 for speed. The pred_iou_thresh and stability_score_thresh parameters filter out low-quality masks – raising them produces fewer but cleaner segments.
Visualizing the Segmentation Map#
A common pattern is to overlay all masks as colored regions on the original image:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
| import matplotlib.pyplot as plt
def show_masks(image, masks, ax):
"""Overlay all generated masks on the image with random colors."""
ax.imshow(image)
if len(masks) == 0:
return
# Sort by area so large masks render first (small ones on top)
sorted_masks = sorted(masks, key=lambda x: x["area"], reverse=True)
for mask_data in sorted_masks:
seg = mask_data["segmentation"]
color = np.concatenate([np.random.random(3), [0.5]]) # RGBA with 50% opacity
overlay = np.zeros((*seg.shape, 4))
overlay[seg] = color
ax.imshow(overlay)
ax.axis("off")
fig, axes = plt.subplots(1, 2, figsize=(16, 8))
axes[0].imshow(image)
axes[0].set_title("Original")
axes[0].axis("off")
show_masks(image, masks, axes[1])
axes[1].set_title(f"SAM 2 Segmentation ({len(masks)} masks)")
plt.tight_layout()
plt.savefig("segmentation_output.png", dpi=150)
plt.show()
|
The masks overlap – SAM 2 is not strictly mutually exclusive like traditional semantic segmentation models. If you need non-overlapping labels, assign each pixel to the mask with the highest predicted_iou or stability_score.
Prompted Segmentation with Points and Boxes#
When you know what you want to segment, point and box prompts give you direct control. This is useful for building interactive annotation tools or targeting specific objects in a pipeline.
Point Prompts#
Pass one or more (x, y) coordinates and a label for each: 1 means foreground, 0 means background.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| from sam2.sam2_image_predictor import SAM2ImagePredictor
predictor = SAM2ImagePredictor(sam2_model)
with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
predictor.set_image(image)
# Two foreground points on the target object
point_coords = np.array([[400, 300], [450, 350]])
point_labels = np.array([1, 1])
masks, scores, logits = predictor.predict(
point_coords=point_coords,
point_labels=point_labels,
multimask_output=True,
)
# Pick the best mask
best_idx = np.argmax(scores)
best_mask = masks[best_idx]
print(f"Best mask score: {scores[best_idx]:.3f}")
print(f"Mask shape: {best_mask.shape}") # (H, W)
|
With multimask_output=True, you get three candidate masks ranked by score. The model produces multiple candidates because a single point is ambiguous – it could mean the whole car or just the tire. The top-scoring mask is usually the right interpretation.
Box Prompts#
Bounding boxes remove that ambiguity. Pass [x_min, y_min, x_max, y_max]:
1
2
3
4
5
6
7
8
9
10
11
12
| with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
predictor.set_image(image)
# Tight bounding box around a car
input_box = np.array([120, 200, 850, 600])
masks, scores, logits = predictor.predict(
box=input_box,
multimask_output=False,
)
print(f"Mask confidence: {scores[0]:.3f}")
|
Use multimask_output=False with boxes – the box already constrains the region, so you want a single confident mask. You can also stack points and boxes together for tricky cases like occluded objects: add a background point on the occluder to tell SAM 2 “that part is not my object.”
Video Segmentation with SAM2VideoPredictor#
SAM 2’s headline feature over the original SAM is video support. You prompt a single frame, and the model propagates the segmentation across the entire video using a memory mechanism.
The video predictor expects frames extracted as JPEG files in a directory. Prepare your video first:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| import os
import cv2
def extract_frames(video_path, output_dir, max_frames=200):
"""Extract video frames as numbered JPEG files."""
os.makedirs(output_dir, exist_ok=True)
cap = cv2.VideoCapture(video_path)
frame_idx = 0
while cap.isOpened() and frame_idx < max_frames:
ret, frame = cap.read()
if not ret:
break
frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
out_path = os.path.join(output_dir, f"{frame_idx:05d}.jpg")
cv2.imwrite(out_path, cv2.cvtColor(frame_rgb, cv2.COLOR_RGB2BGR))
frame_idx += 1
cap.release()
print(f"Extracted {frame_idx} frames to {output_dir}")
return frame_idx
extract_frames("input_video.mp4", "./video_frames/")
|
Now run the video predictor. Prompt with a point or box on a single frame, then propagate:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
| from sam2.build_sam import build_sam2_video_predictor
video_predictor = build_sam2_video_predictor(
"configs/sam2.1/sam2.1_hiera_l.yaml",
"./checkpoints/sam2.1_hiera_large.pt",
)
with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
# Initialize with the frames directory
state = video_predictor.init_state(video_path="./video_frames/")
# Add a point prompt on frame 0 to identify the object
frame_idx, object_ids, masks = video_predictor.add_new_points_or_box(
inference_state=state,
frame_idx=0,
obj_id=1,
points=np.array([[400, 300]], dtype=np.float32),
labels=np.array([1], dtype=np.int32),
)
# Propagate the segmentation across all frames
video_segments = {}
for frame_idx, object_ids, masks in video_predictor.propagate_in_video(state):
video_segments[frame_idx] = {
"object_ids": object_ids,
"masks": (masks > 0.0).cpu().numpy(),
}
print(f"Segmented {len(video_segments)} frames")
print(f"Objects tracked: {video_segments[0]['object_ids']}")
|
You can track multiple objects by calling add_new_points_or_box with different obj_id values. The model maintains separate memory banks for each object and handles occlusions between them.
Visualizing Video Segmentation#
Save the segmented frames back as a video:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| def save_segmented_video(frames_dir, video_segments, output_path, fps=30):
"""Overlay segmentation masks on frames and write to video."""
frame_files = sorted(os.listdir(frames_dir))
first_frame = cv2.imread(os.path.join(frames_dir, frame_files[0]))
h, w = first_frame.shape[:2]
fourcc = cv2.VideoWriter_fourcc(*"mp4v")
writer = cv2.VideoWriter(output_path, fourcc, fps, (w, h))
for idx, fname in enumerate(frame_files):
frame = cv2.imread(os.path.join(frames_dir, fname))
if idx in video_segments:
for mask in video_segments[idx]["masks"]:
# Green overlay for the segmented region
color_mask = np.zeros_like(frame)
color_mask[mask.squeeze()] = [0, 255, 0]
frame = cv2.addWeighted(frame, 1.0, color_mask, 0.4, 0)
writer.write(frame)
writer.release()
print(f"Saved segmented video to {output_path}")
save_segmented_video("./video_frames/", video_segments, "segmented_output.mp4")
|
Building a Semantic Segmentation Pipeline#
Raw SAM 2 masks are class-agnostic – they tell you where objects are but not what they are. To build a full semantic segmentation pipeline, combine SAM 2 with a classifier. Here is a practical pattern using CLIP for zero-shot labeling:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
| from sam2.build_sam import build_sam2
from sam2.automatic_mask_generator import SAM2AutomaticMaskGenerator
from transformers import CLIPProcessor, CLIPModel
# SAM 2 for segmentation
sam2_model = build_sam2(CONFIG, CHECKPOINT, device=DEVICE, apply_postprocessing=False)
mask_generator = SAM2AutomaticMaskGenerator(sam2_model, points_per_side=32)
# CLIP for classification
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(DEVICE)
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# Define your label set
labels = ["car", "person", "tree", "building", "road", "sky", "sidewalk"]
image = np.array(Image.open("city_scene.jpg").convert("RGB"))
masks = mask_generator.generate(image)
labeled_segments = []
for mask_data in masks:
seg = mask_data["segmentation"]
bbox = mask_data["bbox"] # [x, y, w, h]
# Crop the region for classification
x, y, w, h = bbox
crop = image[y:y+h, x:x+w]
if crop.size == 0:
continue
crop_pil = Image.fromarray(crop)
inputs = clip_processor(
text=labels,
images=crop_pil,
return_tensors="pt",
padding=True,
).to(DEVICE)
with torch.no_grad():
outputs = clip_model(**inputs)
probs = outputs.logits_per_image.softmax(dim=-1)[0]
best_label_idx = probs.argmax().item()
labeled_segments.append({
"label": labels[best_label_idx],
"confidence": probs[best_label_idx].item(),
"mask": seg,
"bbox": bbox,
})
for seg in labeled_segments[:5]:
print(f"{seg['label']}: {seg['confidence']:.2f} (area: {seg['mask'].sum()} px)")
|
This gives you a labeled segmentation map without training a single model on your data. Swap CLIP for a domain-specific classifier when you need higher accuracy on known classes.
Common Errors and Fixes#
RuntimeError: No CUDA GPUs are available
SAM 2 requires a CUDA GPU. There is no CPU fallback for the model. If you are on a machine without a GPU, use Google Colab or a cloud instance with at least an NVIDIA T4.
Hydra configuration error when loading the model
This usually means the config path is wrong. SAM 2 uses Hydra internally for config resolution. Make sure you pass the config name relative to the sam2/configs/ directory:
1
2
3
4
5
| # Correct -- relative config path
model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
# Also correct -- if you installed sam2 as a package, Hydra finds it automatically
model_cfg = "sam2.1/sam2.1_hiera_l"
|
If you get ConfigCompositionException, try adding the config search path:
1
2
3
4
| from hydra import initialize_config_dir
from hydra.core.global_hydra import GlobalHydra
GlobalHydra.instance().clear()
|
torch.OutOfMemoryError with automatic mask generation
Automatic mask generation uses more memory than prompted segmentation because it processes many points in parallel. Reduce points_per_batch from 64 to 32, or lower points_per_side from 32 to 16. Using torch.autocast("cuda", dtype=torch.bfloat16) also helps significantly.
Video predictor expects JPEG frames, not a video file
init_state(video_path=...) expects a directory of JPEG frames, not an MP4 or AVI file. Extract frames first using OpenCV or ffmpeg:
1
| ffmpeg -i input.mp4 -q:v 2 -start_number 0 video_frames/%05d.jpg
|