How to Extract Text from Images with Vision LLMs

The One-Liner Answer

Send a base64-encoded image to any vision LLM with the prompt “Extract all text from this image” and you get back clean, structured text. No Tesseract config files, no preprocessing, no layout analysis pipeline. Vision LLMs handle skewed photos, handwriting, multi-column layouts, and tables in a single API call.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import base64
from openai import OpenAI

client = OpenAI()

def extract_text(image_path: str) -> str:
    with open(image_path, "rb") as f:
        b64 = base64.b64encode(f.read()).decode("utf-8")

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": "Extract all text from this image. Return only the extracted text, preserving the original layout as much as possible."},
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}}
            ]
        }],
        max_tokens=4096,
    )
    return response.choices[0].message.content

text = extract_text("receipt.png")
print(text)

That works for receipts, screenshots, whiteboards, handwritten notes, scanned PDFs – basically anything a human can read.

Why Vision LLMs Beat Traditional OCR

Tesseract and PaddleOCR are great when you have clean, high-resolution scans of printed text. But they fall apart fast in the real world:

Skewed or rotated text requires preprocessing with OpenCV before Tesseract even looks at it
Multi-column layouts produce garbled output without external layout detection
Handwriting is effectively unusable with Tesseract (it was designed for printed text)
Tables lose their structure entirely – you get a flat string with no column alignment

Vision LLMs skip all of this. They see the image the way you do: understanding that a heading relates to the paragraph below it, that columns are separate, and that a label belongs to a specific form field. GPT-4o, Claude Sonnet, and Gemini 2.5 Pro all score above 95% character accuracy on standard OCR benchmarks, and they handle messy real-world documents that Tesseract chokes on.

The tradeoff is cost and speed. Tesseract processes pages in milliseconds for free. Vision LLMs take 2-5 seconds per image and cost a few cents per call. For batch processing millions of clean documents, stick with Tesseract. For everything else, vision LLMs win.

Extracting Structured Data with Claude

Raw text extraction is useful, but the real power is asking the model to return structured data directly. Here is how to pull line items from a receipt using Claude:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import anthropic
import base64
import json

client = anthropic.Anthropic()

def extract_receipt(image_path: str) -> dict:
    with open(image_path, "rb") as f:
        b64 = base64.b64encode(f.read()).decode("utf-8")

    # Determine media type from extension
    ext = image_path.rsplit(".", 1)[-1].lower()
    media_type = {
        "jpg": "image/jpeg", "jpeg": "image/jpeg",
        "png": "image/png", "webp": "image/webp", "gif": "image/gif"
    }.get(ext, "image/png")

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {"type": "base64", "media_type": media_type, "data": b64}
                },
                {
                    "type": "text",
                    "text": """Extract all items from this receipt as JSON. Use this exact format:
{
  "store_name": "string",
  "date": "YYYY-MM-DD",
  "items": [{"name": "string", "quantity": 1, "price": 0.00}],
  "subtotal": 0.00,
  "tax": 0.00,
  "total": 0.00
}
Return only valid JSON, no markdown fences."""
                }
            ]
        }]
    )
    return json.loads(response.content[0].text)

receipt_data = extract_receipt("grocery_receipt.jpg")
print(json.dumps(receipt_data, indent=2))

This returns a dictionary you can feed directly into a database or spreadsheet. No regex parsing, no post-processing. The model understands that “2x” means quantity 2, that “DISC” means a discount, and that the number at the bottom is the total.

Using Gemini for Large Documents

Google’s Gemini models accept up to 3,600 image pages in a single request, making them the best option for multi-page document extraction:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import google.generativeai as genai
from pathlib import Path

genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-2.5-pro")

# Upload a multi-page PDF or a batch of images
pdf_file = genai.upload_file(Path("contract.pdf"))

response = model.generate_content([
    pdf_file,
    "Extract all text from this document. Preserve headings, paragraph structure, and any tables as markdown."
])

print(response.text)

For PDFs specifically, Gemini handles them natively – no need to split into individual page images first.

Common Errors and Fixes

openai.BadRequestError: Invalid image.

This usually means the base64 string is malformed or the image format is not supported. GPT-4o accepts PNG, JPEG, GIF, and WebP. Check your encoding:

1
2
3
4
5
6
7
# Wrong -- this reads as text, corrupting binary data
with open("image.png", "r") as f:  # Bug: should be "rb"
    data = f.read()

# Correct
with open("image.png", "rb") as f:
    data = base64.b64encode(f.read()).decode("utf-8")

anthropic.BadRequestError: Could not process image.

Claude has a max image size of 5MB per image. Resize before sending:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from PIL import Image
import io
import base64

def resize_for_claude(image_path: str, max_size: int = 4_000_000) -> str:
    img = Image.open(image_path)
    quality = 95

    while True:
        buffer = io.BytesIO()
        img.save(buffer, format="JPEG", quality=quality)
        if buffer.tell() <= max_size or quality <= 20:
            break
        quality -= 10

    return base64.b64encode(buffer.getvalue()).decode("utf-8")

Hallucinated text in low-quality images

Vision LLMs sometimes invent text that is not in the image, especially for blurry or low-resolution photos. Always set temperature to 0 for extraction tasks, and add “If you cannot read a word, write [illegible] instead of guessing” to your prompt.

Rate limiting on large batches

If you are processing hundreds of images, you will hit API rate limits. Use asyncio with a semaphore to throttle concurrent requests:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI()
semaphore = asyncio.Semaphore(5)  # Max 5 concurrent requests

async def extract_text_async(image_path: str) -> str:
    async with semaphore:
        b64 = encode_image(image_path)  # your base64 function
        response = await client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "user",
                "content": [
                    {"type": "text", "text": "Extract all text from this image."},
                    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}}
                ]
            }],
            max_tokens=4096,
            temperature=0,
        )
        return response.choices[0].message.content

Which Model to Pick

Model	Best For	Cost per Image	Speed
GPT-4o	General-purpose, highest accuracy on printed text	~$0.01-0.03	2-4s
GPT-4o-mini	High-volume, budget-conscious batches	~$0.002-0.005	1-2s
Claude Sonnet	Complex layouts, forms, structured extraction	~$0.01-0.03	2-4s
Gemini 2.5 Pro	Multi-page PDFs, large documents	~$0.01-0.02	2-5s
Tesseract	Clean scans, offline processing, zero cost	Free	<0.1s

For most use cases, start with GPT-4o-mini. It handles 90% of OCR tasks at a fraction of the cost. Upgrade to GPT-4o or Claude Sonnet when you need better accuracy on handwriting, complex layouts, or structured JSON output. Use Gemini when you are dealing with multi-page PDFs and want native support without page splitting.

The One-Liner Answer#

Why Vision LLMs Beat Traditional OCR#

Extracting Structured Data with Claude#

Using Gemini for Large Documents#

Common Errors and Fixes#

Which Model to Pick#

Related Guides#

About the Author