The One-Liner Answer
Send a base64-encoded image to any vision LLM with the prompt “Extract all text from this image” and you get back clean, structured text. No Tesseract config files, no preprocessing, no layout analysis pipeline. Vision LLMs handle skewed photos, handwriting, multi-column layouts, and tables in a single API call.
| |
That works for receipts, screenshots, whiteboards, handwritten notes, scanned PDFs – basically anything a human can read.
Why Vision LLMs Beat Traditional OCR
Tesseract and PaddleOCR are great when you have clean, high-resolution scans of printed text. But they fall apart fast in the real world:
- Skewed or rotated text requires preprocessing with OpenCV before Tesseract even looks at it
- Multi-column layouts produce garbled output without external layout detection
- Handwriting is effectively unusable with Tesseract (it was designed for printed text)
- Tables lose their structure entirely – you get a flat string with no column alignment
Vision LLMs skip all of this. They see the image the way you do: understanding that a heading relates to the paragraph below it, that columns are separate, and that a label belongs to a specific form field. GPT-4o, Claude Sonnet, and Gemini 2.5 Pro all score above 95% character accuracy on standard OCR benchmarks, and they handle messy real-world documents that Tesseract chokes on.
The tradeoff is cost and speed. Tesseract processes pages in milliseconds for free. Vision LLMs take 2-5 seconds per image and cost a few cents per call. For batch processing millions of clean documents, stick with Tesseract. For everything else, vision LLMs win.
Extracting Structured Data with Claude
Raw text extraction is useful, but the real power is asking the model to return structured data directly. Here is how to pull line items from a receipt using Claude:
| |
This returns a dictionary you can feed directly into a database or spreadsheet. No regex parsing, no post-processing. The model understands that “2x” means quantity 2, that “DISC” means a discount, and that the number at the bottom is the total.
Using Gemini for Large Documents
Google’s Gemini models accept up to 3,600 image pages in a single request, making them the best option for multi-page document extraction:
| |
For PDFs specifically, Gemini handles them natively – no need to split into individual page images first.
Common Errors and Fixes
openai.BadRequestError: Invalid image.
This usually means the base64 string is malformed or the image format is not supported. GPT-4o accepts PNG, JPEG, GIF, and WebP. Check your encoding:
| |
anthropic.BadRequestError: Could not process image.
Claude has a max image size of 5MB per image. Resize before sending:
| |
Hallucinated text in low-quality images
Vision LLMs sometimes invent text that is not in the image, especially for blurry or low-resolution photos. Always set temperature to 0 for extraction tasks, and add “If you cannot read a word, write [illegible] instead of guessing” to your prompt.
Rate limiting on large batches
If you are processing hundreds of images, you will hit API rate limits. Use asyncio with a semaphore to throttle concurrent requests:
| |
Which Model to Pick
| Model | Best For | Cost per Image | Speed |
|---|---|---|---|
| GPT-4o | General-purpose, highest accuracy on printed text | ~$0.01-0.03 | 2-4s |
| GPT-4o-mini | High-volume, budget-conscious batches | ~$0.002-0.005 | 1-2s |
| Claude Sonnet | Complex layouts, forms, structured extraction | ~$0.01-0.03 | 2-4s |
| Gemini 2.5 Pro | Multi-page PDFs, large documents | ~$0.01-0.02 | 2-5s |
| Tesseract | Clean scans, offline processing, zero cost | Free | <0.1s |
For most use cases, start with GPT-4o-mini. It handles 90% of OCR tasks at a fraction of the cost. Upgrade to GPT-4o or Claude Sonnet when you need better accuracy on handwriting, complex layouts, or structured JSON output. Use Gemini when you are dealing with multi-page PDFs and want native support without page splitting.
Related Guides
- How to Build a Document Comparison Pipeline with Vision Models
- How to Build a Scene Text Recognition Pipeline with PaddleOCR
- How to Detect Anomalies in Images with Vision Models
- How to Classify Images with Vision Transformers in PyTorch
- How to Build a Receipt Scanner with OCR and Structured Extraction
- How to Build an OCR Pipeline with PaddleOCR and Tesseract
- How to Classify Documents with Vision Models and DiT
- How to Segment Images with SAM 2 in Python
- How to Detect Objects in Images with YOLOv8
- How to Upscale and Enhance Images with AI Super Resolution