How to Build Visual Question Answering with BLIP-2 and InstructBLIP

Here’s the fastest way to build a visual question answering system with BLIP-2:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
from transformers import Blip2Processor, Blip2ForConditionalGeneration
from PIL import Image
import torch

# Load model and processor
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-opt-2.7b",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Ask questions about an image
image = Image.open("product.jpg")
question = "What color is the product packaging?"

inputs = processor(image, question, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=50)
answer = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()

print(f"Answer: {answer}")

This loads a 2.7B parameter model that can answer questions about any image. The device_map="auto" handles multi-GPU setups automatically, and torch.float16 cuts memory usage in half without losing accuracy.

Why BLIP-2 Beats Older VQA Models

BLIP-2 uses a Querying Transformer (Q-Former) that bridges frozen vision encoders and language models. This architecture means you get GPT-level reasoning about images without training a massive model from scratch. The frozen components also make fine-tuning way faster and cheaper than end-to-end training.

InstructBLIP adds instruction tuning on top of BLIP-2, making it better at following specific instructions like “List all text visible in this image” or “Describe the defects in this product photo.”

For production use cases, stick with InstructBLIP unless you need the absolute lowest latency. The instruction-following capability is worth the tiny performance hit.

Batch Processing Multiple Images

When analyzing hundreds of product images or medical scans, batch processing saves time:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
from transformers import InstructBlipProcessor, InstructBlipForConditionalGeneration
from PIL import Image
import torch
from pathlib import Path

# Load InstructBLIP (better for structured queries)
processor = InstructBlipProcessor.from_pretrained("Salesforce/instructblip-vicuna-7b")
model = InstructBlipForConditionalGeneration.from_pretrained(
    "Salesforce/instructblip-vicuna-7b",
    torch_dtype=torch.float16,
    device_map="auto"
)

def process_batch(image_paths, questions, batch_size=4):
    """Process images in batches with different questions per image."""
    results = []

    for i in range(0, len(image_paths), batch_size):
        batch_paths = image_paths[i:i+batch_size]
        batch_questions = questions[i:i+batch_size]

        # Load images
        images = [Image.open(p).convert("RGB") for p in batch_paths]

        # Process batch
        inputs = processor(
            images=images,
            text=batch_questions,
            return_tensors="pt",
            padding=True
        ).to("cuda", torch.float16)

        # Generate answers
        generated_ids = model.generate(
            **inputs,
            max_new_tokens=100,
            num_beams=3,  # Better quality than greedy decoding
            temperature=0.7
        )

        answers = processor.batch_decode(generated_ids, skip_special_tokens=True)

        # Collect results
        for path, question, answer in zip(batch_paths, batch_questions, answers):
            results.append({
                "image": path,
                "question": question,
                "answer": answer.strip()
            })

    return results

# Example: Analyze product defects
image_paths = list(Path("products/").glob("*.jpg"))
questions = ["Describe any visible defects or damage in this product."] * len(image_paths)

results = process_batch(image_paths, questions, batch_size=8)

# Filter products with issues
defective = [r for r in results if "defect" in r["answer"].lower() or "damage" in r["answer"].lower()]
print(f"Found {len(defective)} potentially defective products")

The num_beams=3 parameter runs beam search instead of greedy decoding, which produces more accurate answers at the cost of 3x slower inference. For most production use cases, the quality improvement is worth it.

Fine-Tuning on Custom VQA Datasets

Out-of-the-box models work great for general questions, but fine-tuning on domain-specific data dramatically improves accuracy for specialized tasks like medical imaging or technical documentation.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
from transformers import Blip2Processor, Blip2ForConditionalGeneration, TrainingArguments, Trainer
from datasets import load_dataset
from torch.utils.data import Dataset
import torch

class VQADataset(Dataset):
    def __init__(self, data, processor, image_root):
        self.data = data
        self.processor = processor
        self.image_root = Path(image_root)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]
        image = Image.open(self.image_root / item["image_path"]).convert("RGB")

        # Prepare inputs
        encoding = self.processor(
            images=image,
            text=item["question"],
            return_tensors="pt",
            padding="max_length",
            truncation=True,
            max_length=512
        )

        # Prepare labels (answer)
        labels = self.processor.tokenizer(
            item["answer"],
            return_tensors="pt",
            padding="max_length",
            truncation=True,
            max_length=128
        ).input_ids

        # Replace padding token id with -100 so it's ignored in loss
        labels[labels == self.processor.tokenizer.pad_token_id] = -100

        encoding["labels"] = labels.squeeze()

        return {k: v.squeeze() for k, v in encoding.items()}

# Load your custom dataset (JSONL format: {"image_path": "...", "question": "...", "answer": "..."})
import json
with open("medical_vqa_train.jsonl") as f:
    train_data = [json.loads(line) for line in f]

# Initialize model and processor
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-opt-2.7b",
    torch_dtype=torch.float16
)

# Create dataset
train_dataset = VQADataset(train_data, processor, image_root="medical_images/")

# Training arguments
training_args = TrainingArguments(
    output_dir="./blip2-medical-vqa",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,  # Effective batch size = 16
    learning_rate=5e-5,
    num_train_epochs=3,
    fp16=True,
    save_strategy="epoch",
    logging_steps=50,
    remove_unused_columns=False,
    dataloader_pin_memory=False  # Prevents OOM on some GPUs
)

# Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

trainer.train()
trainer.save_model("./blip2-medical-vqa-final")

The key to successful fine-tuning is proper label preparation. The -100 token ID tells PyTorch to ignore padding tokens during loss calculation, preventing the model from learning to predict padding.

Document Understanding and OCR + Reasoning

BLIP-2 and InstructBLIP can read text in images AND reason about it, making them perfect for document analysis:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Analyze invoice or receipt
image = Image.open("invoice.jpg")
questions = [
    "What is the total amount on this invoice?",
    "Who is the vendor or company name?",
    "What is the invoice date?",
    "List all line items with quantities and prices."
]

for question in questions:
    inputs = processor(image, question, return_tensors="pt").to("cuda", torch.float16)
    generated_ids = model.generate(**inputs, max_new_tokens=100)
    answer = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
    print(f"Q: {question}")
    print(f"A: {answer}\n")

For complex documents with tables or multi-column layouts, ask multiple targeted questions instead of one generic “extract all information” query. The models handle specific questions much better than broad requests.

Common Errors and Fixes

CUDA out of memory errors: Use torch.float16 and reduce per_device_train_batch_size to 1 or 2. Enable gradient checkpointing with model.gradient_checkpointing_enable() before training.

Poor answer quality on fine-tuned models: You’re probably overfitting. Reduce num_train_epochs to 1-2, or use a smaller learning rate like 1e-5. Also verify your training data quality - garbage in, garbage out.

Slow inference on CPU: Don’t use CPU for BLIP-2. These models require GPU. Even a single RTX 3060 will be 50x faster than CPU inference. For CPU-only environments, use smaller VQA models like ViLT or CLIP with a lightweight LLM.

“Model outputs are empty or repetitive”: Lower the temperature parameter (try 0.3-0.7) and increase num_beams to 3-5. Also check that your input image isn’t corrupted or too low resolution - BLIP-2 works best with images at least 224x224 pixels.

Import errors for transformers: You need transformers >= 4.30.0. Upgrade with pip install --upgrade transformers accelerate.

InstructBLIP ignores instructions: Make sure you’re phrasing questions as clear imperative instructions, not vague prompts. “List all visible defects” works better than “Tell me about problems.”

Why BLIP-2 Beats Older VQA Models#

Batch Processing Multiple Images#

Fine-Tuning on Custom VQA Datasets#

Document Understanding and OCR + Reasoning#

Common Errors and Fixes#

Related Guides#

About the Author

Why BLIP-2 Beats Older VQA Models

Batch Processing Multiple Images

Fine-Tuning on Custom VQA Datasets

Document Understanding and OCR + Reasoning

Common Errors and Fixes

Related Guides