The Short Version#
SAM 2 (Segment Anything Model 2) takes an image and a prompt – a point click, a bounding box, or a mask – and returns a pixel-perfect segmentation mask of the object you pointed at. It works on basically anything: people, cars, furniture, animals, text, weird stuff it has never seen before.
Here is the fastest way to get a mask from an image:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| import numpy as np
import torch
from PIL import Image
from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor
# Load the model
checkpoint = "./checkpoints/sam2.1_hiera_large.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
predictor = SAM2ImagePredictor(build_sam2(model_cfg, checkpoint))
# Load your image
image = np.array(Image.open("photo.jpg"))
with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
predictor.set_image(image)
# Click on the object you want — (x, y) coordinates
point_coords = np.array([[500, 375]])
point_labels = np.array([1]) # 1 = foreground, 0 = background
masks, scores, logits = predictor.predict(
point_coords=point_coords,
point_labels=point_labels,
multimask_output=True,
)
# masks.shape: (3, H, W) — three mask candidates ranked by score
best_mask = masks[np.argmax(scores)]
|
SAM 2 returns three mask candidates by default. Pick the one with the highest score and you are good.
Installation#
SAM 2 requires Python 3.10+, PyTorch 2.5.1+, and a CUDA GPU for reasonable speed. The current release is v1.1.0 with SAM 2.1 checkpoints.
1
2
3
4
5
6
7
8
9
| # Clone and install
git clone https://github.com/facebookresearch/sam2.git
cd sam2
pip install -e .
# Download model checkpoints
cd checkpoints
./download_ckpts.sh
cd ..
|
If you prefer the Hugging Face route (no manual checkpoint downloads):
1
| predictor = SAM2ImagePredictor.from_pretrained("facebook/sam2-hiera-large")
|
This pulls the weights automatically. Convenient for quick experiments.
Pick Your Model Size#
SAM 2.1 ships four checkpoints. The tradeoff is the usual speed vs. accuracy:
| Model | Parameters | Speed (FPS) | Best For |
|---|
| sam2.1_hiera_tiny | 38.9M | 91.2 | Real-time apps, edge |
| sam2.1_hiera_small | 46M | 84.8 | Balanced |
| sam2.1_hiera_base_plus | 80.8M | 64.1 | General purpose |
| sam2.1_hiera_large | 224.4M | 39.5 | Maximum accuracy |
Start with hiera_large for quality, drop to hiera_tiny if you need speed.
Segment with a Bounding Box#
Point prompts work well for simple objects, but boxes give you tighter control. Draw a box around the target and SAM 2 figures out exactly which pixels belong to it.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
| import numpy as np
import torch
from PIL import Image
from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor
checkpoint = "./checkpoints/sam2.1_hiera_large.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
predictor = SAM2ImagePredictor(build_sam2(model_cfg, checkpoint))
image = np.array(Image.open("street.jpg"))
with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
predictor.set_image(image)
# Bounding box: [x_min, y_min, x_max, y_max]
input_box = np.array([75, 275, 1725, 850])
masks, scores, logits = predictor.predict(
box=input_box,
multimask_output=False, # single mask when using box prompt
)
print(f"Mask shape: {masks.shape}") # (1, H, W)
print(f"Confidence: {scores[0]:.3f}")
|
Set multimask_output=False with box prompts. The box already constrains the region enough that you typically want a single high-confidence mask.
Combine Points and Boxes#
For tricky cases – like segmenting a person partially hidden behind a fence – stack a box prompt with foreground/background points to guide the model.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
predictor.set_image(image)
# Box around the person
input_box = np.array([100, 50, 600, 900])
# Foreground point on the person, background point on the fence
point_coords = np.array([[350, 400], [250, 500]])
point_labels = np.array([1, 0]) # 1=foreground, 0=background
masks, scores, _ = predictor.predict(
point_coords=point_coords,
point_labels=point_labels,
box=input_box,
multimask_output=False,
)
|
Background points (label 0) tell the model “this is NOT part of the object.” That is incredibly useful for separating overlapping things.
Save the Mask as an Image#
Once you have the mask, you probably want to do something with it – save it, apply it as a cutout, or overlay it on the original.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| from PIL import Image
import numpy as np
def save_mask(mask, output_path):
"""Save a binary mask as a PNG."""
mask_image = Image.fromarray((mask * 255).astype(np.uint8))
mask_image.save(output_path)
def apply_mask_cutout(image_path, mask, output_path):
"""Cut out the segmented object with a transparent background."""
img = Image.open(image_path).convert("RGBA")
mask_rgba = np.zeros((*mask.shape, 4), dtype=np.uint8)
img_array = np.array(img)
mask_rgba[mask] = img_array[mask]
mask_rgba[mask, 3] = 255 # fully opaque where mask is True
Image.fromarray(mask_rgba).save(output_path)
# Usage
save_mask(best_mask, "mask.png")
apply_mask_cutout("photo.jpg", best_mask, "cutout.png")
|
If you already have transformers installed and want to skip cloning the SAM 2 repo, the Hugging Face integration works well:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
| import torch
import requests
from PIL import Image
from transformers import Sam2Processor, Sam2Model
device = "cuda" if torch.cuda.is_available() else "cpu"
model = Sam2Model.from_pretrained("facebook/sam2-hiera-large").to(device)
processor = Sam2Processor.from_pretrained("facebook/sam2-hiera-large")
# Load an image
url = "https://huggingface.co/datasets/hf-internal-testing/sam2-fixtures/resolve/main/truck.jpg"
raw_image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
# Point prompt — note the nested list structure
input_points = [[[[500, 375]]]]
input_labels = [[[1]]]
inputs = processor(
images=raw_image,
input_points=input_points,
input_labels=input_labels,
return_tensors="pt",
).to(device)
with torch.no_grad():
outputs = model(**inputs)
masks = processor.post_process_masks(
outputs.pred_masks.cpu(),
inputs["original_sizes"],
)[0]
print(f"Mask tensor shape: {masks.shape}") # (1, 3, H, W)
|
The nested list structure for input_points looks weird – it is [batch, objects, points_per_object, xy]. You get used to it.
Common Errors and Fixes#
CUDA version mismatch during installation.
1
| The detected CUDA version (11.5) mismatches the version that was used to compile PyTorch (12.1)
|
This happens when your system CUDA toolkit version differs from what PyTorch was built with. Two fixes:
1
2
3
4
5
6
| # Option 1: Point to the right CUDA
export CUDA_HOME=/usr/local/cuda-12.1
pip install -e .
# Option 2: Skip CUDA extensions entirely (still works fine)
pip install -e . --no-build-isolation
|
SAM 2.1 made the custom CUDA kernels optional, so the model runs without them with minimal speed impact.
torch.OutOfMemoryError: CUDA out of memory.
The large model needs around 6-8GB of VRAM. If you are tight on memory:
1
2
3
4
5
6
7
8
| # Use bfloat16 to halve memory usage (already shown above)
with torch.autocast("cuda", dtype=torch.bfloat16):
predictor.set_image(image)
masks, scores, _ = predictor.predict(...)
# Or switch to a smaller model
checkpoint = "./checkpoints/sam2.1_hiera_tiny.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_t.yaml"
|
FileNotFoundError for checkpoints.
You forgot to download them. Run cd checkpoints && ./download_ckpts.sh, or just use from_pretrained() which handles downloads automatically.
SAM 2 vs. SAM 1#
If you are coming from the original SAM, here is what changed:
- Video support. SAM 2 tracks objects across video frames with a memory module. SAM 1 only did single images.
- 6x faster on images. Same task, way less compute.
- Better accuracy. Especially on occluded objects and fine details like hair or fur.
- SAM 2.1 update. Improved handling of visually similar objects and better occlusion reasoning compared to the initial SAM 2.0 release.
The image prediction API is nearly identical, so migrating old SAM code is straightforward – swap the imports and checkpoint paths.
When to Use SAM 2#
SAM 2 is a zero-shot segmentation model. It segments anything without training on your specific objects. That makes it perfect for:
- Interactive annotation tools – let users click to segment, then export masks for training specialized models
- Background removal – segment the subject, invert the mask, done
- Object counting – pair with a detector like YOLOv8 for bounding boxes, feed those to SAM 2 for precise masks
- Medical imaging prototyping – works surprisingly well on X-rays and microscopy images out of the box
For production pipelines where you need consistent results on a fixed set of object types, you will get better performance by fine-tuning SAM 2 on your specific data. The base model is general-purpose by design, so it trades domain-specific accuracy for breadth.