BLIP (Bootstrapping Language-Image Pre-training) from Salesforce is the strongest open-source option for image captioning right now. It outperforms older encoder-decoder models by a wide margin, handles diverse image types well, and runs on consumer hardware. The Hugging Face Transformers library makes it dead simple to load BLIP and start generating captions in a few lines of Python.
This guide walks you through building a captioning pipeline from a single image to batch processing an entire directory, then upgrading to BLIP-2 for even better results.
Loading BLIP for Image Captioning
Install the dependencies first:
| |
Load the model and processor. The base variant is fast and good enough for most tasks. Use large if you need higher quality and have the VRAM for it.
| |
The base model needs about 1 GB of VRAM. The large variant (Salesforce/blip-image-captioning-large) uses around 2 GB and produces noticeably better captions on complex scenes. Swap the model ID to upgrade:
| |
Captioning Single Images
BLIP supports two captioning modes: unconditional (the model decides what to say) and conditional (you provide a text prompt to steer the caption).
Unconditional Captioning
| |
The max_new_tokens parameter controls caption length. Setting it too low cuts off the description mid-sentence. 50 tokens is a safe default for single-sentence captions.
Conditional Captioning
Pass a text prompt to guide what the model focuses on. This is useful when you want captions that describe a specific aspect of the image.
| |
The model completes your prompt based on what it sees. Try prompts like "this image shows", "a photo of a", or "the scene contains" to get different angles on the same image.
Loading from a Local File
| |
Always call .convert("RGB") on the image. BLIP expects 3-channel RGB input, and PNGs with alpha channels or grayscale images will throw shape mismatch errors without this.
Batch Captioning a Directory of Images
Processing images one at a time is fine for a handful, but for hundreds or thousands you want batched inference. The processor handles padding automatically when you pass a list of images.
| |
A batch size of 8 works well on an 8 GB GPU with the base model. Drop it to 4 if you hit out-of-memory errors, or increase to 16 if you have headroom. On CPU, batch size matters less since computation dominates over memory transfer.
To save results for later use:
| |
Using BLIP-2 for Better Captions
BLIP-2 connects a frozen image encoder to a frozen LLM through a lightweight Q-Former bridge. The result is significantly better captions with more detail and fewer hallucinations. The trade-off is higher memory usage.
| |
The accelerate package is required for BLIP-2’s model loading.
| |
BLIP-2 with OPT-2.7B needs about 6 GB of VRAM in float16. If that’s too much, use 8-bit quantization:
| |
This drops memory usage to around 3.5 GB with minimal quality loss. You need bitsandbytes installed for quantization:
| |
BLIP vs BLIP-2 Comparison
| Aspect | BLIP (base) | BLIP (large) | BLIP-2 (OPT-2.7B) |
|---|---|---|---|
| VRAM | ~1 GB | ~2 GB | ~6 GB (fp16) |
| Speed | Fast | Medium | Slower |
| Caption quality | Good | Better | Best |
| Detail level | Basic | Moderate | Rich descriptions |
Pick BLIP base for high-throughput pipelines where speed matters. Pick BLIP-2 when caption quality is the priority and you can afford the latency.
Common Errors and Fixes
RuntimeError: expected scalar type Float but found Half
This happens when your input tensor dtype doesn’t match the model dtype. If you loaded the model in float16, make sure inputs are also float16:
| |
OSError: Can't load tokenizer for 'Salesforce/blip2-opt-2.7b'
Usually means transformers is outdated. BLIP-2 support was added in v4.27. Upgrade:
| |
PIL.UnidentifiedImageError: cannot identify image file
The image file is corrupted or in an unsupported format. Wrap your image loading in a try/except when processing directories:
| |
torch.cuda.OutOfMemoryError during batch processing
Reduce the batch size or switch to float16. You can also free GPU memory between batches:
| |
Captions start with "arafed" or other gibberish
This is a known BLIP artifact that occasionally appears, especially with the base model. It happens when the model’s beam search lands on a low-probability token path. Fix it by using num_beams=3 to improve generation:
| |
Related Guides
- How to Classify Images with Vision Transformers in PyTorch
- How to Build a Visual Grounding Pipeline with Grounding DINO
- How to Build a Video Frame Interpolation Pipeline with RIFE
- How to Build a Document Comparison Pipeline with Vision Models
- How to Build a Lane Detection Pipeline with OpenCV and YOLO
- How to Build a Real-Time Pose Estimation Pipeline with MediaPipe
- How to Build a Vehicle Counting Pipeline with YOLOv8 and OpenCV
- How to Classify Documents with Vision Models and DiT
- How to Build a Scene Text Recognition Pipeline with PaddleOCR
- How to Build Visual Question Answering with BLIP-2 and InstructBLIP