The Short Version
SAM 2 (Segment Anything Model 2) takes an image and a prompt – a point click, a bounding box, or a mask – and returns a pixel-perfect segmentation mask of the object you pointed at. It works on basically anything: people, cars, furniture, animals, text, weird stuff it has never seen before.
Here is the fastest way to get a mask from an image:
| |
SAM 2 returns three mask candidates by default. Pick the one with the highest score and you are good.
Installation
SAM 2 requires Python 3.10+, PyTorch 2.5.1+, and a CUDA GPU for reasonable speed. The current release is v1.1.0 with SAM 2.1 checkpoints.
| |
If you prefer the Hugging Face route (no manual checkpoint downloads):
| |
This pulls the weights automatically. Convenient for quick experiments.
Pick Your Model Size
SAM 2.1 ships four checkpoints. The tradeoff is the usual speed vs. accuracy:
| Model | Parameters | Speed (FPS) | Best For |
|---|---|---|---|
| sam2.1_hiera_tiny | 38.9M | 91.2 | Real-time apps, edge |
| sam2.1_hiera_small | 46M | 84.8 | Balanced |
| sam2.1_hiera_base_plus | 80.8M | 64.1 | General purpose |
| sam2.1_hiera_large | 224.4M | 39.5 | Maximum accuracy |
Start with hiera_large for quality, drop to hiera_tiny if you need speed.
Segment with a Bounding Box
Point prompts work well for simple objects, but boxes give you tighter control. Draw a box around the target and SAM 2 figures out exactly which pixels belong to it.
| |
Set multimask_output=False with box prompts. The box already constrains the region enough that you typically want a single high-confidence mask.
Combine Points and Boxes
For tricky cases – like segmenting a person partially hidden behind a fence – stack a box prompt with foreground/background points to guide the model.
| |
Background points (label 0) tell the model “this is NOT part of the object.” That is incredibly useful for separating overlapping things.
Save the Mask as an Image
Once you have the mask, you probably want to do something with it – save it, apply it as a cutout, or overlay it on the original.
| |
Using the Hugging Face Transformers API
If you already have transformers installed and want to skip cloning the SAM 2 repo, the Hugging Face integration works well:
| |
The nested list structure for input_points looks weird – it is [batch, objects, points_per_object, xy]. You get used to it.
Common Errors and Fixes
CUDA version mismatch during installation.
| |
This happens when your system CUDA toolkit version differs from what PyTorch was built with. Two fixes:
| |
SAM 2.1 made the custom CUDA kernels optional, so the model runs without them with minimal speed impact.
torch.OutOfMemoryError: CUDA out of memory.
The large model needs around 6-8GB of VRAM. If you are tight on memory:
| |
FileNotFoundError for checkpoints.
You forgot to download them. Run cd checkpoints && ./download_ckpts.sh, or just use from_pretrained() which handles downloads automatically.
SAM 2 vs. SAM 1
If you are coming from the original SAM, here is what changed:
- Video support. SAM 2 tracks objects across video frames with a memory module. SAM 1 only did single images.
- 6x faster on images. Same task, way less compute.
- Better accuracy. Especially on occluded objects and fine details like hair or fur.
- SAM 2.1 update. Improved handling of visually similar objects and better occlusion reasoning compared to the initial SAM 2.0 release.
The image prediction API is nearly identical, so migrating old SAM code is straightforward – swap the imports and checkpoint paths.
When to Use SAM 2
SAM 2 is a zero-shot segmentation model. It segments anything without training on your specific objects. That makes it perfect for:
- Interactive annotation tools – let users click to segment, then export masks for training specialized models
- Background removal – segment the subject, invert the mask, done
- Object counting – pair with a detector like YOLOv8 for bounding boxes, feed those to SAM 2 for precise masks
- Medical imaging prototyping – works surprisingly well on X-rays and microscopy images out of the box
For production pipelines where you need consistent results on a fixed set of object types, you will get better performance by fine-tuning SAM 2 on your specific data. The base model is general-purpose by design, so it trades domain-specific accuracy for breadth.
Related Guides
- How to Build Semantic Segmentation with Segment Anything and SAM 2
- How to Detect Objects in Images with YOLOv8
- How to Upscale and Enhance Images with AI Super Resolution
- How to Build a Lane Detection Pipeline with OpenCV and YOLO
- How to Build Video Analytics Pipelines with OpenCV and Deep Learning
- How to Build Real-Time Object Segmentation with SAM 2 and WebSocket
- How to Classify Images with Vision Transformers in PyTorch
- How to Build a Face Recognition System with InsightFace and Python
- How to Extract Text from Images with Vision LLMs
- How to Detect Anomalies in Images with Vision Models