How to Build an Image Captioning Pipeline with BLIP and Transformers

BLIP (Bootstrapping Language-Image Pre-training) from Salesforce is the strongest open-source option for image captioning right now. It outperforms older encoder-decoder models by a wide margin, handles diverse image types well, and runs on consumer hardware. The Hugging Face Transformers library makes it dead simple to load BLIP and start generating captions in a few lines of Python.

This guide walks you through building a captioning pipeline from a single image to batch processing an entire directory, then upgrading to BLIP-2 for even better results.

Loading BLIP for Image Captioning

Install the dependencies first:

1
pip install transformers torch pillow requests

Load the model and processor. The base variant is fast and good enough for most tasks. Use large if you need higher quality and have the VRAM for it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from transformers import BlipProcessor, BlipForConditionalGeneration
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained(
    "Salesforce/blip-image-captioning-base"
).to(device)

print(f"Model loaded on {device}")

The base model needs about 1 GB of VRAM. The large variant (Salesforce/blip-image-captioning-large) uses around 2 GB and produces noticeably better captions on complex scenes. Swap the model ID to upgrade:

1
2
3
4
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained(
    "Salesforce/blip-image-captioning-large"
).to(device)

Captioning Single Images

BLIP supports two captioning modes: unconditional (the model decides what to say) and conditional (you provide a text prompt to steer the caption).

Unconditional Captioning

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
from PIL import Image
import requests

url = "https://images.unsplash.com/photo-1574158622682-e40e69881006?w=640"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

inputs = processor(images=image, return_tensors="pt").to(device)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=50)

caption = processor.decode(output_ids[0], skip_special_tokens=True)
print(caption)
# Example output: "a cat sitting on a couch looking at the camera"

The max_new_tokens parameter controls caption length. Setting it too low cuts off the description mid-sentence. 50 tokens is a safe default for single-sentence captions.

Conditional Captioning

Pass a text prompt to guide what the model focuses on. This is useful when you want captions that describe a specific aspect of the image.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
text_prompt = "a photograph of"

inputs = processor(images=image, text=text_prompt, return_tensors="pt").to(device)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=50)

caption = processor.decode(output_ids[0], skip_special_tokens=True)
print(caption)
# Example output: "a photograph of a cat with green eyes sitting on a grey couch"

The model completes your prompt based on what it sees. Try prompts like "this image shows", "a photo of a", or "the scene contains" to get different angles on the same image.

Loading from a Local File

1
2
3
4
5
6
7
8
local_image = Image.open("/path/to/your/image.jpg").convert("RGB")
inputs = processor(images=local_image, return_tensors="pt").to(device)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=50)

caption = processor.decode(output_ids[0], skip_special_tokens=True)
print(caption)

Always call .convert("RGB") on the image. BLIP expects 3-channel RGB input, and PNGs with alpha channels or grayscale images will throw shape mismatch errors without this.

Batch Captioning a Directory of Images

Processing images one at a time is fine for a handful, but for hundreds or thousands you want batched inference. The processor handles padding automatically when you pass a list of images.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
from pathlib import Path
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained(
    "Salesforce/blip-image-captioning-base"
).to(device)

image_dir = Path("/path/to/images")
extensions = {".jpg", ".jpeg", ".png", ".webp", ".bmp"}
image_paths = sorted(
    p for p in image_dir.iterdir() if p.suffix.lower() in extensions
)

batch_size = 8
results = []

for i in range(0, len(image_paths), batch_size):
    batch_paths = image_paths[i : i + batch_size]
    images = [Image.open(p).convert("RGB") for p in batch_paths]

    inputs = processor(images=images, return_tensors="pt", padding=True).to(device)

    with torch.no_grad():
        output_ids = model.generate(**inputs, max_new_tokens=50)

    captions = processor.batch_decode(output_ids, skip_special_tokens=True)

    for path, caption in zip(batch_paths, captions):
        results.append({"file": path.name, "caption": caption})
        print(f"{path.name}: {caption}")

print(f"\nCaptioned {len(results)} images")

A batch size of 8 works well on an 8 GB GPU with the base model. Drop it to 4 if you hit out-of-memory errors, or increase to 16 if you have headroom. On CPU, batch size matters less since computation dominates over memory transfer.

To save results for later use:

1
2
3
4
5
6
7
import json

output_file = image_dir / "captions.json"
with open(output_file, "w") as f:
    json.dump(results, f, indent=2)

print(f"Saved captions to {output_file}")

Using BLIP-2 for Better Captions

BLIP-2 connects a frozen image encoder to a frozen LLM through a lightweight Q-Former bridge. The result is significantly better captions with more detail and fewer hallucinations. The trade-off is higher memory usage.

1
pip install transformers accelerate torch pillow requests

The accelerate package is required for BLIP-2’s model loading.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from transformers import Blip2Processor, Blip2ForConditionalGeneration
from PIL import Image
import requests
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-opt-2.7b",
    torch_dtype=torch.float16 if device == "cuda" else torch.float32,
).to(device)

url = "https://images.unsplash.com/photo-1574158622682-e40e69881006?w=640"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

inputs = processor(images=image, return_tensors="pt").to(device, torch.float16 if device == "cuda" else torch.float32)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=100)

caption = processor.decode(output_ids[0], skip_special_tokens=True).strip()
print(caption)

BLIP-2 with OPT-2.7B needs about 6 GB of VRAM in float16. If that’s too much, use 8-bit quantization:

1
2
3
4
5
model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-opt-2.7b",
    load_in_8bit=True,
    device_map="auto",
)

This drops memory usage to around 3.5 GB with minimal quality loss. You need bitsandbytes installed for quantization:

1
pip install bitsandbytes

BLIP vs BLIP-2 Comparison

Aspect	BLIP (base)	BLIP (large)	BLIP-2 (OPT-2.7B)
VRAM	~1 GB	~2 GB	~6 GB (fp16)
Speed	Fast	Medium	Slower
Caption quality	Good	Better	Best
Detail level	Basic	Moderate	Rich descriptions

Pick BLIP base for high-throughput pipelines where speed matters. Pick BLIP-2 when caption quality is the priority and you can afford the latency.

Common Errors and Fixes

RuntimeError: expected scalar type Float but found Half

This happens when your input tensor dtype doesn’t match the model dtype. If you loaded the model in float16, make sure inputs are also float16:

1
inputs = processor(images=image, return_tensors="pt").to(device, torch.float16)

OSError: Can't load tokenizer for 'Salesforce/blip2-opt-2.7b'

Usually means transformers is outdated. BLIP-2 support was added in v4.27. Upgrade:

1
pip install --upgrade transformers

PIL.UnidentifiedImageError: cannot identify image file

The image file is corrupted or in an unsupported format. Wrap your image loading in a try/except when processing directories:

1
2
3
4
5
try:
    image = Image.open(path).convert("RGB")
except Exception as e:
    print(f"Skipping {path.name}: {e}")
    continue

torch.cuda.OutOfMemoryError during batch processing

Reduce the batch size or switch to float16. You can also free GPU memory between batches:

1
torch.cuda.empty_cache()

Captions start with "arafed" or other gibberish

This is a known BLIP artifact that occasionally appears, especially with the base model. It happens when the model’s beam search lands on a low-probability token path. Fix it by using num_beams=3 to improve generation:

1
output_ids = model.generate(**inputs, max_new_tokens=50, num_beams=3)

Loading BLIP for Image Captioning#

Captioning Single Images#

Unconditional Captioning#

Conditional Captioning#

Loading from a Local File#

Batch Captioning a Directory of Images#

Using BLIP-2 for Better Captions#

BLIP vs BLIP-2 Comparison#

Common Errors and Fixes#

Related Guides#

About the Author