Get a Depth Map in Three Lines
Depth Anything V2 predicts per-pixel depth from a single RGB image. No stereo cameras, no LiDAR, no multi-view setup. You feed it one photo, it returns a dense depth map. The model shipped as a NeurIPS 2024 paper and is fully integrated into Hugging Face Transformers, so you don’t need to clone any repos or download checkpoints manually.
| |
That depth object is a PIL Image – a grayscale map where brighter pixels are closer and darker pixels are farther. The pipeline handles image resizing, normalization, inference, and interpolation back to original dimensions automatically.
Install the Dependencies
| |
If you have a CUDA GPU, install the CUDA-enabled PyTorch build for a significant speedup:
| |
The Small model downloads about 100MB on first run. The Large model pulls around 1.3GB. Both cache in ~/.cache/huggingface/hub/ so subsequent loads are instant.
Full Control with AutoModel
The pipeline is convenient, but when you need the raw depth tensor (for 3D reconstruction, point clouds, or custom post-processing), use the model and processor directly:
| |
The predicted_depth tensor holds relative depth values (not meters). Higher values mean closer to the camera. You can skip normalization if you’re feeding this into a downstream pipeline that expects raw disparity values.
Pick the Right Model Size
Depth Anything V2 ships four variants built on DINOv2 encoders:
| Model | Params | HF Model ID | License |
|---|---|---|---|
| Small | 25M | depth-anything/Depth-Anything-V2-Small-hf | Apache 2.0 |
| Base | 97M | depth-anything/Depth-Anything-V2-Base-hf | CC-BY-NC-4.0 |
| Large | 335M | depth-anything/Depth-Anything-V2-Large-hf | CC-BY-NC-4.0 |
| Giant | 1.3B | depth-anything/Depth-Anything-V2-Giant-hf | CC-BY-NC-4.0 |
Start with Small for prototyping – it runs comfortably on a CPU and gives surprisingly good results. Use Large for production quality. The Giant model needs a beefy GPU (16GB+ VRAM) but produces the sharpest edges and finest details.
Note the licensing split: Small is Apache 2.0 (use it commercially without restrictions), while Base/Large/Giant are CC-BY-NC-4.0 (non-commercial only).
Relative vs. Metric Depth
The default models output relative depth – they tell you which pixels are closer or farther, but not the actual distance in meters. This is perfect for visual effects, background blur, or depth-based compositing.
If you need real-world distances, use the metric depth models. These are fine-tuned separately for indoor and outdoor scenes because the depth ranges differ dramatically (a living room tops out around 10 meters; an outdoor scene can stretch to 80):
| |
Available metric model IDs follow the pattern Depth-Anything-V2-Metric-{Indoor,Outdoor}-{Small,Base,Large}-hf. Don’t mix them up – using the indoor model on a landscape photo will produce garbage depth values.
Run on GPU
Move the model to CUDA for a 10-20x speedup:
| |
The Large model uses about 1.5GB of VRAM. If you’re tight on memory, use torch.float16:
| |
Common Errors and Fixes
RuntimeError: Expected all tensors to be on the same device
You moved the model to CUDA but forgot to move the inputs. Both need to be on the same device:
| |
KeyError: 'depth' when using the pipeline
This happens on older transformers versions where the pipeline returns a different key. Update transformers:
| |
You need transformers >= 4.42.0 for Depth Anything V2 support.
OutOfMemoryError: CUDA out of memory
Reduce image resolution before inference. The model reshapes inputs internally, but very high-res images (4K+) still consume significant VRAM during the interpolation step:
| |
Or switch to a smaller model variant. Going from Large (335M) to Small (25M) cuts VRAM usage by roughly 5x.
No module named 'depth_anything_v2'
You’re trying to use the standalone repo instead of the Hugging Face integration. If you want the Transformers path (recommended), you don’t need to install depth_anything_v2 at all – just pip install transformers torch.
How It Works Under the Hood
Depth Anything V2 pairs a DINOv2 vision transformer encoder with a DPT (Dense Prediction Transformer) decoder. The encoder extracts multi-scale features from the input image; the decoder fuses those features into a dense depth prediction at the original resolution.
The key architectural insight from V2 is the training pipeline. V1 trained on labeled real images, which are inherently noisy. V2 instead trains a large teacher model on high-quality synthetic data (where ground-truth depth is perfect), then uses that teacher to pseudo-label 62 million real images. The student models learn from these pseudo-labels, getting the best of both worlds: synthetic-quality supervision with real-world visual diversity.
This is why V2 produces noticeably sharper edges and fewer artifacts around object boundaries compared to V1, especially on reflective surfaces and thin structures like fences and poles.
Related Guides
- How to Classify Images with Vision Transformers in PyTorch
- How to Extract Text from Images with Vision LLMs
- How to Detect Anomalies in Images with Vision Models
- How to Segment Images with SAM 2 in Python
- How to Detect Objects in Images with YOLOv8
- How to Build Semantic Segmentation with Segment Anything and SAM 2
- How to Upscale and Enhance Images with AI Super Resolution
- How to Build Multi-Object Tracking with DeepSORT and YOLOv8
- How to Build a Document Comparison Pipeline with Vision Models
- How to Build a Lane Detection Pipeline with OpenCV and YOLO