Get a Depth Map in Three Lines

Depth Anything V2 predicts per-pixel depth from a single RGB image. No stereo cameras, no LiDAR, no multi-view setup. You feed it one photo, it returns a dense depth map. The model shipped as a NeurIPS 2024 paper and is fully integrated into Hugging Face Transformers, so you don’t need to clone any repos or download checkpoints manually.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from transformers import pipeline
from PIL import Image
import requests

pipe = pipeline(task="depth-estimation", model="depth-anything/Depth-Anything-V2-Small-hf")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

depth = pipe(image)["depth"]
depth.save("depth_output.png")

That depth object is a PIL Image – a grayscale map where brighter pixels are closer and darker pixels are farther. The pipeline handles image resizing, normalization, inference, and interpolation back to original dimensions automatically.

Install the Dependencies

1
pip install transformers torch pillow requests

If you have a CUDA GPU, install the CUDA-enabled PyTorch build for a significant speedup:

1
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

The Small model downloads about 100MB on first run. The Large model pulls around 1.3GB. Both cache in ~/.cache/huggingface/hub/ so subsequent loads are instant.

Full Control with AutoModel

The pipeline is convenient, but when you need the raw depth tensor (for 3D reconstruction, point clouds, or custom post-processing), use the model and processor directly:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from transformers import AutoImageProcessor, AutoModelForDepthEstimation
import torch
import numpy as np
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

image_processor = AutoImageProcessor.from_pretrained("depth-anything/Depth-Anything-V2-Large-hf")
model = AutoModelForDepthEstimation.from_pretrained("depth-anything/Depth-Anything-V2-Large-hf")

inputs = image_processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

# Resize prediction back to original image dimensions
post_processed = image_processor.post_process_depth_estimation(
    outputs,
    target_sizes=[(image.height, image.width)],
)

predicted_depth = post_processed[0]["predicted_depth"]  # torch.Tensor (H, W)

# Normalize to 0-255 for visualization
depth_normalized = (predicted_depth - predicted_depth.min()) / (predicted_depth.max() - predicted_depth.min())
depth_image = Image.fromarray((depth_normalized.cpu().numpy() * 255).astype("uint8"))
depth_image.save("depth_map.png")

The predicted_depth tensor holds relative depth values (not meters). Higher values mean closer to the camera. You can skip normalization if you’re feeding this into a downstream pipeline that expects raw disparity values.

Pick the Right Model Size

Depth Anything V2 ships four variants built on DINOv2 encoders:

ModelParamsHF Model IDLicense
Small25Mdepth-anything/Depth-Anything-V2-Small-hfApache 2.0
Base97Mdepth-anything/Depth-Anything-V2-Base-hfCC-BY-NC-4.0
Large335Mdepth-anything/Depth-Anything-V2-Large-hfCC-BY-NC-4.0
Giant1.3Bdepth-anything/Depth-Anything-V2-Giant-hfCC-BY-NC-4.0

Start with Small for prototyping – it runs comfortably on a CPU and gives surprisingly good results. Use Large for production quality. The Giant model needs a beefy GPU (16GB+ VRAM) but produces the sharpest edges and finest details.

Note the licensing split: Small is Apache 2.0 (use it commercially without restrictions), while Base/Large/Giant are CC-BY-NC-4.0 (non-commercial only).

Relative vs. Metric Depth

The default models output relative depth – they tell you which pixels are closer or farther, but not the actual distance in meters. This is perfect for visual effects, background blur, or depth-based compositing.

If you need real-world distances, use the metric depth models. These are fine-tuned separately for indoor and outdoor scenes because the depth ranges differ dramatically (a living room tops out around 10 meters; an outdoor scene can stretch to 80):

1
2
3
4
5
6
7
8
# Metric depth for outdoor scenes (max ~80 meters)
pipe = pipeline(
    task="depth-estimation",
    model="depth-anything/Depth-Anything-V2-Metric-Outdoor-Large-hf",
)

result = pipe(image)
depth = result["depth"]  # Values represent actual depth in meters

Available metric model IDs follow the pattern Depth-Anything-V2-Metric-{Indoor,Outdoor}-{Small,Base,Large}-hf. Don’t mix them up – using the indoor model on a landscape photo will produce garbage depth values.

Run on GPU

Move the model to CUDA for a 10-20x speedup:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import torch
from transformers import AutoImageProcessor, AutoModelForDepthEstimation

device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModelForDepthEstimation.from_pretrained(
    "depth-anything/Depth-Anything-V2-Large-hf"
).to(device)
image_processor = AutoImageProcessor.from_pretrained(
    "depth-anything/Depth-Anything-V2-Large-hf"
)

inputs = image_processor(images=image, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model(**inputs)

The Large model uses about 1.5GB of VRAM. If you’re tight on memory, use torch.float16:

1
2
3
4
model = AutoModelForDepthEstimation.from_pretrained(
    "depth-anything/Depth-Anything-V2-Large-hf",
    torch_dtype=torch.float16,
).to("cuda")

Common Errors and Fixes

RuntimeError: Expected all tensors to be on the same device

You moved the model to CUDA but forgot to move the inputs. Both need to be on the same device:

1
inputs = image_processor(images=image, return_tensors="pt").to(device)

KeyError: 'depth' when using the pipeline

This happens on older transformers versions where the pipeline returns a different key. Update transformers:

1
pip install --upgrade transformers

You need transformers >= 4.42.0 for Depth Anything V2 support.

OutOfMemoryError: CUDA out of memory

Reduce image resolution before inference. The model reshapes inputs internally, but very high-res images (4K+) still consume significant VRAM during the interpolation step:

1
image = image.resize((1024, 768))  # Scale down before inference

Or switch to a smaller model variant. Going from Large (335M) to Small (25M) cuts VRAM usage by roughly 5x.

No module named 'depth_anything_v2'

You’re trying to use the standalone repo instead of the Hugging Face integration. If you want the Transformers path (recommended), you don’t need to install depth_anything_v2 at all – just pip install transformers torch.

How It Works Under the Hood

Depth Anything V2 pairs a DINOv2 vision transformer encoder with a DPT (Dense Prediction Transformer) decoder. The encoder extracts multi-scale features from the input image; the decoder fuses those features into a dense depth prediction at the original resolution.

The key architectural insight from V2 is the training pipeline. V1 trained on labeled real images, which are inherently noisy. V2 instead trains a large teacher model on high-quality synthetic data (where ground-truth depth is perfect), then uses that teacher to pseudo-label 62 million real images. The student models learn from these pseudo-labels, getting the best of both worlds: synthetic-quality supervision with real-world visual diversity.

This is why V2 produces noticeably sharper edges and fewer artifacts around object boundaries compared to V1, especially on reflective surfaces and thin structures like fences and poles.