The Short Version

SAM 2 (Segment Anything Model 2) takes an image and a prompt – a point click, a bounding box, or a mask – and returns a pixel-perfect segmentation mask of the object you pointed at. It works on basically anything: people, cars, furniture, animals, text, weird stuff it has never seen before.

Here is the fastest way to get a mask from an image:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import numpy as np
import torch
from PIL import Image
from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor

# Load the model
checkpoint = "./checkpoints/sam2.1_hiera_large.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
predictor = SAM2ImagePredictor(build_sam2(model_cfg, checkpoint))

# Load your image
image = np.array(Image.open("photo.jpg"))

with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
    predictor.set_image(image)

    # Click on the object you want — (x, y) coordinates
    point_coords = np.array([[500, 375]])
    point_labels = np.array([1])  # 1 = foreground, 0 = background

    masks, scores, logits = predictor.predict(
        point_coords=point_coords,
        point_labels=point_labels,
        multimask_output=True,
    )

# masks.shape: (3, H, W) — three mask candidates ranked by score
best_mask = masks[np.argmax(scores)]

SAM 2 returns three mask candidates by default. Pick the one with the highest score and you are good.

Installation

SAM 2 requires Python 3.10+, PyTorch 2.5.1+, and a CUDA GPU for reasonable speed. The current release is v1.1.0 with SAM 2.1 checkpoints.

1
2
3
4
5
6
7
8
9
# Clone and install
git clone https://github.com/facebookresearch/sam2.git
cd sam2
pip install -e .

# Download model checkpoints
cd checkpoints
./download_ckpts.sh
cd ..

If you prefer the Hugging Face route (no manual checkpoint downloads):

1
predictor = SAM2ImagePredictor.from_pretrained("facebook/sam2-hiera-large")

This pulls the weights automatically. Convenient for quick experiments.

Pick Your Model Size

SAM 2.1 ships four checkpoints. The tradeoff is the usual speed vs. accuracy:

Model	Parameters	Speed (FPS)	Best For
sam2.1_hiera_tiny	38.9M	91.2	Real-time apps, edge
sam2.1_hiera_small	46M	84.8	Balanced
sam2.1_hiera_base_plus	80.8M	64.1	General purpose
sam2.1_hiera_large	224.4M	39.5	Maximum accuracy

Start with hiera_large for quality, drop to hiera_tiny if you need speed.

Segment with a Bounding Box

Point prompts work well for simple objects, but boxes give you tighter control. Draw a box around the target and SAM 2 figures out exactly which pixels belong to it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import numpy as np
import torch
from PIL import Image
from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor

checkpoint = "./checkpoints/sam2.1_hiera_large.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
predictor = SAM2ImagePredictor(build_sam2(model_cfg, checkpoint))

image = np.array(Image.open("street.jpg"))

with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
    predictor.set_image(image)

    # Bounding box: [x_min, y_min, x_max, y_max]
    input_box = np.array([75, 275, 1725, 850])

    masks, scores, logits = predictor.predict(
        box=input_box,
        multimask_output=False,  # single mask when using box prompt
    )

print(f"Mask shape: {masks.shape}")  # (1, H, W)
print(f"Confidence: {scores[0]:.3f}")

Set multimask_output=False with box prompts. The box already constrains the region enough that you typically want a single high-confidence mask.

Combine Points and Boxes

For tricky cases – like segmenting a person partially hidden behind a fence – stack a box prompt with foreground/background points to guide the model.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
    predictor.set_image(image)

    # Box around the person
    input_box = np.array([100, 50, 600, 900])

    # Foreground point on the person, background point on the fence
    point_coords = np.array([[350, 400], [250, 500]])
    point_labels = np.array([1, 0])  # 1=foreground, 0=background

    masks, scores, _ = predictor.predict(
        point_coords=point_coords,
        point_labels=point_labels,
        box=input_box,
        multimask_output=False,
    )

Background points (label 0) tell the model “this is NOT part of the object.” That is incredibly useful for separating overlapping things.

Save the Mask as an Image

Once you have the mask, you probably want to do something with it – save it, apply it as a cutout, or overlay it on the original.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from PIL import Image
import numpy as np

def save_mask(mask, output_path):
    """Save a binary mask as a PNG."""
    mask_image = Image.fromarray((mask * 255).astype(np.uint8))
    mask_image.save(output_path)

def apply_mask_cutout(image_path, mask, output_path):
    """Cut out the segmented object with a transparent background."""
    img = Image.open(image_path).convert("RGBA")
    mask_rgba = np.zeros((*mask.shape, 4), dtype=np.uint8)
    img_array = np.array(img)

    mask_rgba[mask] = img_array[mask]
    mask_rgba[mask, 3] = 255  # fully opaque where mask is True

    Image.fromarray(mask_rgba).save(output_path)

# Usage
save_mask(best_mask, "mask.png")
apply_mask_cutout("photo.jpg", best_mask, "cutout.png")

Using the Hugging Face Transformers API

If you already have transformers installed and want to skip cloning the SAM 2 repo, the Hugging Face integration works well:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import torch
import requests
from PIL import Image
from transformers import Sam2Processor, Sam2Model

device = "cuda" if torch.cuda.is_available() else "cpu"
model = Sam2Model.from_pretrained("facebook/sam2-hiera-large").to(device)
processor = Sam2Processor.from_pretrained("facebook/sam2-hiera-large")

# Load an image
url = "https://huggingface.co/datasets/hf-internal-testing/sam2-fixtures/resolve/main/truck.jpg"
raw_image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

# Point prompt — note the nested list structure
input_points = [[[[500, 375]]]]
input_labels = [[[1]]]

inputs = processor(
    images=raw_image,
    input_points=input_points,
    input_labels=input_labels,
    return_tensors="pt",
).to(device)

with torch.no_grad():
    outputs = model(**inputs)

masks = processor.post_process_masks(
    outputs.pred_masks.cpu(),
    inputs["original_sizes"],
)[0]

print(f"Mask tensor shape: {masks.shape}")  # (1, 3, H, W)

The nested list structure for input_points looks weird – it is [batch, objects, points_per_object, xy]. You get used to it.

Common Errors and Fixes

CUDA version mismatch during installation.

1
The detected CUDA version (11.5) mismatches the version that was used to compile PyTorch (12.1)

This happens when your system CUDA toolkit version differs from what PyTorch was built with. Two fixes:

1
2
3
4
5
6
# Option 1: Point to the right CUDA
export CUDA_HOME=/usr/local/cuda-12.1
pip install -e .

# Option 2: Skip CUDA extensions entirely (still works fine)
pip install -e . --no-build-isolation

SAM 2.1 made the custom CUDA kernels optional, so the model runs without them with minimal speed impact.

torch.OutOfMemoryError: CUDA out of memory.

The large model needs around 6-8GB of VRAM. If you are tight on memory:

1
2
3
4
5
6
7
8
# Use bfloat16 to halve memory usage (already shown above)
with torch.autocast("cuda", dtype=torch.bfloat16):
    predictor.set_image(image)
    masks, scores, _ = predictor.predict(...)

# Or switch to a smaller model
checkpoint = "./checkpoints/sam2.1_hiera_tiny.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_t.yaml"

FileNotFoundError for checkpoints.

You forgot to download them. Run cd checkpoints && ./download_ckpts.sh, or just use from_pretrained() which handles downloads automatically.

SAM 2 vs. SAM 1

If you are coming from the original SAM, here is what changed:

Video support. SAM 2 tracks objects across video frames with a memory module. SAM 1 only did single images.
6x faster on images. Same task, way less compute.
Better accuracy. Especially on occluded objects and fine details like hair or fur.
SAM 2.1 update. Improved handling of visually similar objects and better occlusion reasoning compared to the initial SAM 2.0 release.

The image prediction API is nearly identical, so migrating old SAM code is straightforward – swap the imports and checkpoint paths.

When to Use SAM 2

SAM 2 is a zero-shot segmentation model. It segments anything without training on your specific objects. That makes it perfect for:

Interactive annotation tools – let users click to segment, then export masks for training specialized models
Background removal – segment the subject, invert the mask, done
Object counting – pair with a detector like YOLOv8 for bounding boxes, feed those to SAM 2 for precise masks
Medical imaging prototyping – works surprisingly well on X-rays and microscopy images out of the box

For production pipelines where you need consistent results on a fixed set of object types, you will get better performance by fine-tuning SAM 2 on your specific data. The base model is general-purpose by design, so it trades domain-specific accuracy for breadth.

The Short Version#

Installation#

Pick Your Model Size#

Segment with a Bounding Box#

Combine Points and Boxes#

Save the Mask as an Image#

Using the Hugging Face Transformers API#

Common Errors and Fixes#

SAM 2 vs. SAM 1#

When to Use SAM 2#