How to Build a Visual Grounding Pipeline with Grounding DINO

Grounding DINO is the best open-set object detector you can run today. Traditional detectors like YOLO require training on a fixed set of classes. Grounding DINO takes a different approach: you give it a text prompt describing what you want to find, and it localizes those objects in the image. No training data needed. No fine-tuning. Just a sentence and an image.

Here’s a working example that detects cats and dogs in an image using the Hugging Face transformers library:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import torch
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection
from PIL import Image
import requests

model_id = "IDEA-Research/grounding-dino-base"
device = "cuda" if torch.cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

text = "a cat. a remote control."

inputs = processor(images=image, text=text, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model(**inputs)

results = processor.post_process_grounded_object_detection(
    outputs,
    inputs.input_ids,
    box_threshold=0.3,
    text_threshold=0.25,
    target_sizes=[image.size[::-1]],
)

for box, score, label in zip(results[0]["boxes"], results[0]["scores"], results[0]["labels"]):
    print(f"{label}: {score:.2f} at {box.tolist()}")

That’s it. You pass period-separated class names as the text prompt, and Grounding DINO returns bounding boxes with confidence scores. The model handles the text-image matching internally using a DINO-style transformer backbone fused with a text encoder.

Installing Dependencies

You need transformers, torch, Pillow, and requests. If you plan to combine with SAM later, install that too:

1
pip install transformers torch torchvision Pillow requests matplotlib

For GPU inference, make sure you have the right CUDA version of PyTorch installed. Check pytorch.org for the install command matching your CUDA version. CPU works fine for testing but expect 5-10x slower inference.

Detecting Objects with Text Prompts

The key thing to understand about Grounding DINO prompts: classes are separated by periods. Not commas, not newlines. Periods. The model was trained this way and it matters.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import torch
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection
from PIL import Image, ImageDraw, ImageFont
import requests

model_id = "IDEA-Research/grounding-dino-base"
device = "cuda" if torch.cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Period-separated class names — this format is required
text = "a cat. a remote control. a couch."

inputs = processor(images=image, text=text, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model(**inputs)

results = processor.post_process_grounded_object_detection(
    outputs,
    inputs.input_ids,
    box_threshold=0.3,
    text_threshold=0.25,
    target_sizes=[image.size[::-1]],
)

# Draw bounding boxes on the image
draw = ImageDraw.Draw(image)
colors = {"cat": "red", "remote control": "blue", "couch": "green"}

for box, score, label in zip(results[0]["boxes"], results[0]["scores"], results[0]["labels"]):
    x1, y1, x2, y2 = box.tolist()

    color = colors.get(label, "yellow")
    draw.rectangle([x1, y1, x2, y2], outline=color, width=3)
    draw.text((x1, y1 - 12), f"{label}: {score:.2f}", fill=color)

image.save("detected_objects.png")
print(f"Found {len(results[0]['boxes'])} objects, saved to detected_objects.png")

Two thresholds control detection sensitivity. box_threshold filters out low-confidence bounding boxes — lower it to catch more objects at the cost of false positives. text_threshold controls how strongly the detected region must match the text prompt. Start with 0.3/0.25 and adjust from there.

You can also use free-form descriptions instead of simple class names. Prompts like "a person wearing a red hat." or "a wooden chair next to a table." work because the text encoder understands natural language, not just class labels.

Batch Processing Multiple Images

For real workloads you need to process directories of images and save structured results. Here’s a pipeline function that does exactly that:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import json
import os
import torch
from pathlib import Path
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection
from PIL import Image

def build_grounding_pipeline(model_id="IDEA-Research/grounding-dino-base"):
    device = "cuda" if torch.cuda.is_available() else "cpu"
    processor = AutoProcessor.from_pretrained(model_id)
    model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)
    return processor, model, device

def detect_objects(image_path, text_prompt, processor, model, device,
                   box_threshold=0.3, text_threshold=0.25):
    image = Image.open(image_path).convert("RGB")
    inputs = processor(images=image, text=text_prompt, return_tensors="pt").to(device)

    with torch.no_grad():
        outputs = model(**inputs)

    results = processor.post_process_grounded_object_detection(
        outputs,
        inputs.input_ids,
        box_threshold=box_threshold,
        text_threshold=text_threshold,
        target_sizes=[image.size[::-1]],
    )

    detections = []
    for box, score, label in zip(results[0]["boxes"], results[0]["scores"], results[0]["labels"]):
        detections.append({
            "label": label,
            "score": round(score.item(), 4),
            "box": [round(c, 2) for c in box.tolist()],
        })

    return detections

def process_directory(image_dir, text_prompt, output_path="results.json",
                      box_threshold=0.3, text_threshold=0.25):
    processor, model, device = build_grounding_pipeline()

    image_extensions = {".jpg", ".jpeg", ".png", ".webp", ".bmp"}
    image_dir = Path(image_dir)
    all_results = {}

    for image_path in sorted(image_dir.iterdir()):
        if image_path.suffix.lower() not in image_extensions:
            continue

        print(f"Processing {image_path.name}...")
        detections = detect_objects(
            str(image_path), text_prompt, processor, model, device,
            box_threshold, text_threshold,
        )
        all_results[image_path.name] = detections
        print(f"  Found {len(detections)} objects")

    with open(output_path, "w") as f:
        json.dump(all_results, f, indent=2)

    print(f"Saved results for {len(all_results)} images to {output_path}")
    return all_results

# Usage
results = process_directory(
    image_dir="./images",
    text_prompt="a person. a car. a bicycle.",
    output_path="detections.json",
    box_threshold=0.35,
)

The output JSON looks like this for each image:

1
2
3
4
5
6
{
  "street_001.jpg": [
    {"label": "person", "score": 0.8712, "box": [142.5, 89.3, 298.1, 412.7]},
    {"label": "car", "score": 0.9341, "box": [450.0, 200.5, 780.2, 490.1]}
  ]
}

One thing to watch: Grounding DINO loads about 1 GB into GPU memory. If you’re processing thousands of images, the model stays loaded between calls so there’s no repeated loading overhead. But if you’re running multiple pipelines, memory adds up fast.

Combining with SAM for Segmentation

Grounding DINO gives you boxes. SAM gives you pixel-perfect masks. Together they form one of the most powerful zero-shot detection and segmentation pipelines available. You describe what to find in text, DINO localizes it, and SAM segments the exact region.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import torch
import numpy as np
from transformers import (
    AutoProcessor,
    AutoModelForZeroShotObjectDetection,
    SamModel,
    SamProcessor,
)
from PIL import Image
import requests

# Load Grounding DINO
dino_id = "IDEA-Research/grounding-dino-base"
device = "cuda" if torch.cuda.is_available() else "cpu"

dino_processor = AutoProcessor.from_pretrained(dino_id)
dino_model = AutoModelForZeroShotObjectDetection.from_pretrained(dino_id).to(device)

# Load SAM
sam_id = "facebook/sam-vit-base"
sam_processor = SamProcessor.from_pretrained(sam_id)
sam_model = SamModel.from_pretrained(sam_id).to(device)

# Load image
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

# Step 1: Detect with Grounding DINO
text = "a cat."
dino_inputs = dino_processor(images=image, text=text, return_tensors="pt").to(device)

with torch.no_grad():
    dino_outputs = dino_model(**dino_inputs)

dino_results = dino_processor.post_process_grounded_object_detection(
    dino_outputs,
    dino_inputs.input_ids,
    box_threshold=0.3,
    text_threshold=0.25,
    target_sizes=[image.size[::-1]],
)

boxes = dino_results[0]["boxes"].cpu().numpy()
labels = dino_results[0]["labels"]
scores = dino_results[0]["scores"].cpu().numpy()

print(f"Grounding DINO found {len(boxes)} objects")

# Step 2: Segment each detected box with SAM
# SAM expects boxes in [x1, y1, x2, y2] format — same as DINO output
input_boxes = [boxes.tolist()]  # SAM expects a batch of box lists

sam_inputs = sam_processor(
    image,
    input_boxes=input_boxes,
    return_tensors="pt",
).to(device)

with torch.no_grad():
    sam_outputs = sam_model(**sam_inputs)

masks = sam_processor.image_processor.post_process_masks(
    sam_outputs.pred_masks.cpu(),
    sam_inputs["original_sizes"].cpu(),
    sam_inputs["reshaped_input_sizes"].cpu(),
)

# masks[0] has shape [num_boxes, 3, H, W] — pick the best mask per box (index 0 or highest IoU)
final_masks = masks[0][:, 0, :, :]  # shape: [num_boxes, H, W]

# Visualize: overlay masks on the image
image_np = np.array(image)
overlay = image_np.copy()
mask_colors = [(255, 0, 0), (0, 255, 0), (0, 0, 255), (255, 255, 0)]

for i, mask in enumerate(final_masks):
    mask_bool = mask.numpy().astype(bool)
    color = mask_colors[i % len(mask_colors)]
    overlay[mask_bool] = (
        overlay[mask_bool] * 0.5 + np.array(color) * 0.5
    ).astype(np.uint8)
    print(f"Object {i}: {labels[i]} (score: {scores[i]:.2f}), mask pixels: {mask_bool.sum()}")

result_image = Image.fromarray(overlay)
result_image.save("grounded_sam_output.png")
print("Saved segmented output to grounded_sam_output.png")

This pipeline is sometimes called “Grounded SAM” in the research community. It’s the go-to approach for zero-shot instance segmentation. Both models together use about 2.5 GB of GPU memory with the base variants.

Common Errors and Fixes

CUDA out of memory with large images

Grounding DINO resizes inputs internally, but very high-resolution images (4K+) can still blow up memory. Resize before passing to the processor:

1
2
3
max_size = 1024
image = Image.open("large_photo.jpg").convert("RGB")
image.thumbnail((max_size, max_size), Image.LANCZOS)

This keeps aspect ratio intact while capping the longest side at 1024 pixels. You won’t lose meaningful detection accuracy.

“KeyError” or empty results from bad prompt format

Grounding DINO expects period-separated prompts. If you pass comma-separated text or forget trailing periods, you’ll get poor results or nothing at all:

1
2
3
4
5
# Wrong — commas don't work as class separators
text = "a cat, a dog, a person"

# Right — periods separate each class
text = "a cat. a dog. a person."

Each phrase between periods is treated as a separate detection query. The trailing period after the last class is optional but recommended for consistency.

Model loading fails with “Could not find config” or weight mismatch

This usually happens with an outdated transformers version. Grounding DINO support was added in transformers 4.36.0. Check your version and upgrade:

1
pip install --upgrade transformers>=4.36.0

If you’re behind a corporate proxy or firewall, the model download might time out. Set the cache directory explicitly and download ahead of time:

1
2
3
4
5
import os
os.environ["HF_HOME"] = "/path/to/your/cache"

# Or download from CLI first
# huggingface-cli download IDEA-Research/grounding-dino-base

Duplicate detections for the same object

Grounding DINO sometimes returns overlapping boxes for the same object, especially with low thresholds. Apply non-maximum suppression to clean them up:

1
2
3
4
5
6
7
8
9
from torchvision.ops import nms

boxes = results[0]["boxes"]
scores = results[0]["scores"]

keep = nms(boxes, scores, iou_threshold=0.5)
filtered_boxes = boxes[keep]
filtered_scores = scores[keep]
filtered_labels = [results[0]["labels"][i] for i in keep]

Installing Dependencies#

Detecting Objects with Text Prompts#

Batch Processing Multiple Images#

Combining with SAM for Segmentation#

Common Errors and Fixes#

Related Guides#

About the Author

Installing Dependencies

Detecting Objects with Text Prompts

Batch Processing Multiple Images

Combining with SAM for Segmentation

Common Errors and Fixes

Related Guides