How to Extract Structured Data from PDFs with LLMs

Most PDF extraction code is a mess of regex and prayer. You parse the text, hope the layout is consistent, write 40 rules for 40 document formats, and still miss edge cases. LLMs are genuinely better at this. They handle layout variation, interpret context, and map messy text to clean schemas without brittle pattern matching.

The best approach right now: extract raw text with pdfplumber, define your output shape with Pydantic, and use the instructor library to get validated, typed responses from OpenAI or Anthropic models. Here is the full pipeline.

Quick Setup

Install the three core libraries:

1
pip install pdfplumber instructor openai pydantic

If you want to use Anthropic instead of OpenAI:

1
pip install instructor anthropic

For scanned PDFs where text extraction fails (image-based documents), you will also want:

1
pip install pymupdf pytesseract

End-to-End Invoice Extraction

This is the complete pipeline. It reads a PDF, extracts text, sends it to an LLM with a strict Pydantic schema, and returns validated JSON.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import pdfplumber
import instructor
from openai import OpenAI
from pydantic import BaseModel, Field
from datetime import date


# 1. Define your output schema
class LineItem(BaseModel):
    description: str
    quantity: int
    unit_price: float
    total: float


class Invoice(BaseModel):
    vendor_name: str
    invoice_number: str
    invoice_date: date
    due_date: date | None = None
    line_items: list[LineItem]
    subtotal: float
    tax: float
    total: float
    currency: str = Field(default="USD", description="ISO 4217 currency code")


# 2. Extract text from PDF
def extract_pdf_text(path: str) -> str:
    text_parts = []
    with pdfplumber.open(path) as pdf:
        for page in pdf.pages:
            page_text = page.extract_text()
            if page_text:
                text_parts.append(page_text)
    return "\n\n".join(text_parts)


# 3. Send to LLM with instructor
client = instructor.from_openai(OpenAI())

def extract_invoice(pdf_path: str) -> Invoice:
    raw_text = extract_pdf_text(pdf_path)

    invoice = client.chat.completions.create(
        model="gpt-4o",
        response_model=Invoice,
        messages=[
            {
                "role": "system",
                "content": "Extract invoice data from the provided document text. "
                           "Be precise with numbers. If a field is missing, use your "
                           "best judgment based on available context.",
            },
            {"role": "user", "content": raw_text},
        ],
    )
    return invoice


# 4. Use it
result = extract_invoice("invoice.pdf")
print(result.model_dump_json(indent=2))

That is it. The response_model=Invoice parameter tells instructor to constrain the LLM output to your exact schema. If the model returns something that does not validate against your Pydantic model, instructor automatically retries with the validation error in the prompt. No parsing, no regex, no post-processing.

Why pdfplumber Over PyMuPDF

Both libraries extract text from PDFs. I recommend pdfplumber for structured document extraction because it preserves layout information better. It understands columns, tables, and spatial positioning – which matters when your invoice has a table of line items.

PyMuPDF (fitz) is faster and handles scanned PDFs better when combined with OCR. Use it when you need raw speed or when pdfplumber returns empty text (usually means the PDF is image-based).

1
2
3
4
5
6
7
8
9
import fitz  # PyMuPDF

def extract_with_pymupdf(path: str) -> str:
    doc = fitz.open(path)
    text_parts = []
    for page in doc:
        text_parts.append(page.get_text())
    doc.close()
    return "\n\n".join(text_parts)

If both return empty strings, you are dealing with a scanned image PDF and need OCR:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import fitz
import pytesseract
from PIL import Image
import io

def ocr_pdf(path: str) -> str:
    doc = fitz.open(path)
    text_parts = []
    for page in doc:
        pix = page.get_pixmap(dpi=300)
        img = Image.open(io.BytesIO(pix.tobytes("png")))
        text_parts.append(pytesseract.image_to_string(img))
    doc.close()
    return "\n\n".join(text_parts)

Using Anthropic Instead of OpenAI

Instructor works with multiple providers. Switching to Claude takes two lines:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import instructor
from anthropic import Anthropic

client = instructor.from_anthropic(Anthropic())

invoice = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=4096,
    response_model=Invoice,
    messages=[
        {"role": "user", "content": raw_text},
    ],
)

Claude tends to be more careful about not hallucinating values it cannot find in the text. GPT-4o is faster and cheaper for high-volume extraction. Pick based on your accuracy vs. cost tradeoff.

Handling Multi-Page Documents

Long documents can exceed context windows or produce worse results because the model loses track of details buried in pages of text. Process them in chunks and merge.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
from pydantic import BaseModel


class ExtractedPage(BaseModel):
    page_number: int
    line_items: list[LineItem]
    notes: str | None = None


def extract_long_pdf(pdf_path: str) -> Invoice:
    with pdfplumber.open(pdf_path) as pdf:
        # First and last pages usually have header/footer info
        header_text = pdf.pages[0].extract_text() or ""
        footer_text = pdf.pages[-1].extract_text() or "" if len(pdf.pages) > 1 else ""

        # Extract line items from each page
        all_items = []
        for i, page in enumerate(pdf.pages):
            page_text = page.extract_text()
            if not page_text:
                continue

            page_data = client.chat.completions.create(
                model="gpt-4o",
                response_model=ExtractedPage,
                messages=[
                    {
                        "role": "system",
                        "content": f"Extract line items from page {i + 1} of an invoice. "
                                   "Only extract items that appear on this specific page.",
                    },
                    {"role": "user", "content": page_text},
                ],
            )
            all_items.extend(page_data.line_items)

        # Final pass: extract header info and combine
        combined_context = f"HEADER:\n{header_text}\n\nFOOTER:\n{footer_text}"
        invoice = client.chat.completions.create(
            model="gpt-4o",
            response_model=Invoice,
            messages=[
                {
                    "role": "system",
                    "content": "Extract invoice metadata from the header and footer. "
                               "I will provide the line items separately.",
                },
                {"role": "user", "content": combined_context},
            ],
        )
        invoice.line_items = all_items
        return invoice

This two-pass approach works well. The first pass extracts line items page by page (where the model only needs to focus on a few rows at a time), and the second pass grabs vendor info, dates, and totals from the header/footer.

Validating Outputs with Pydantic

Pydantic does the heavy lifting for validation. Add constraints directly to your schema:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from pydantic import BaseModel, Field, field_validator


class StrictInvoice(BaseModel):
    vendor_name: str = Field(min_length=1)
    invoice_number: str = Field(min_length=1)
    invoice_date: date
    line_items: list[LineItem] = Field(min_length=1)
    subtotal: float = Field(ge=0)
    tax: float = Field(ge=0)
    total: float = Field(ge=0)

    @field_validator("total")
    @classmethod
    def total_must_match(cls, v, info):
        data = info.data
        if "subtotal" in data and "tax" in data:
            expected = round(data["subtotal"] + data["tax"], 2)
            if abs(v - expected) > 0.01:
                raise ValueError(
                    f"Total {v} does not match subtotal + tax ({expected})"
                )
        return v

When this validator fails, instructor sends the error message back to the LLM and asks it to fix its answer. The model sees "Total 150.0 does not match subtotal + tax (145.50)" and corrects itself. You get self-healing extraction for free.

Control the retry behavior:

1
2
3
4
5
6
7
8
invoice = client.chat.completions.create(
    model="gpt-4o",
    response_model=StrictInvoice,
    max_retries=3,  # default is 1
    messages=[
        {"role": "user", "content": raw_text},
    ],
)

Common Errors and Fixes

`pdfplumber.open()` returns empty text

This means the PDF contains scanned images instead of selectable text. Fall back to OCR:

1
2
3
# Symptom
text = page.extract_text()
# text is None or ""

Fix: use PyMuPDF with pytesseract as shown in the OCR section above. Check if text extraction returns content before sending to the LLM – sending empty strings wastes API calls and returns hallucinated data.

`instructor.exceptions.InstructorRetryException: max retries reached`

The LLM failed validation on every attempt. This usually means either your schema is too strict for the data, or the PDF text is too garbled for the model to extract reliably.

1
instructor.exceptions.InstructorRetryException: max retries (3) reached

Fix: increase max_retries to 5, loosen validators that are too aggressive, or improve the text extraction step. Sometimes running the raw text through a quick cleanup prompt first helps.

`openai.BadRequestError: maximum context length exceeded`

Your PDF text is too long for the model’s context window.

1
2
openai.BadRequestError: This model's maximum context length is 128000 tokens.
You have requested 145231 tokens.

Fix: use the multi-page chunking approach from the section above, or truncate/summarize pages before sending. You can also count tokens beforehand with tiktoken:

1
2
3
4
5
6
7
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")
token_count = len(enc.encode(raw_text))
if token_count > 120000:
    # Switch to chunked processing
    pass

`ValidationError: 1 validation error for Invoice`

Pydantic rejected the LLM’s output. The error message tells you exactly what field failed:

1
2
3
pydantic_core._pydantic_core.ValidationError: 1 validation error for Invoice
invoice_date
  Input should be a valid date [type=date_type, input_value='February 14, 2026', input_type=str]

Fix: instructor handles most of these automatically by retrying. If it persists, make the field type more flexible (use str instead of date and parse it yourself), or add a @field_validator that handles multiple date formats.

`anthropic.BadRequestError: messages: text content blocks must be non-empty`

You sent an empty string as the message content to Anthropic.

Fix: always check that your extracted text is non-empty before making the API call:

1
2
3
raw_text = extract_pdf_text(pdf_path)
if not raw_text.strip():
    raise ValueError(f"No text extracted from {pdf_path}. PDF may be image-based.")

Batch Processing a Folder of PDFs

Real pipelines process hundreds of documents. Here is a pattern with error handling and progress tracking:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import json
from pathlib import Path


def process_pdf_folder(folder: str, output_file: str) -> None:
    pdf_dir = Path(folder)
    results = []
    errors = []

    pdf_files = list(pdf_dir.glob("*.pdf"))
    print(f"Found {len(pdf_files)} PDFs")

    for i, pdf_path in enumerate(pdf_files):
        print(f"[{i + 1}/{len(pdf_files)}] {pdf_path.name}")
        try:
            invoice = extract_invoice(str(pdf_path))
            results.append({
                "file": pdf_path.name,
                "data": invoice.model_dump(mode="json"),
            })
        except Exception as e:
            errors.append({"file": pdf_path.name, "error": str(e)})
            print(f"  FAILED: {e}")

    with open(output_file, "w") as f:
        json.dump({"results": results, "errors": errors}, f, indent=2)

    print(f"\nDone: {len(results)} succeeded, {len(errors)} failed")


process_pdf_folder("./invoices", "extracted.json")

Add tenacity for rate limit handling if you are processing thousands of files against the OpenAI API. Instructor already uses tenacity internally, but you may want backoff on the outer loop too.

When to Skip the LLM

Not every PDF needs a language model. If your documents are highly consistent (same vendor, same template every time), a direct pdfplumber table extraction is faster and cheaper:

1
2
3
4
5
with pdfplumber.open("invoice.pdf") as pdf:
    page = pdf.pages[0]
    table = page.extract_table()
    for row in table:
        print(row)

Use the LLM approach when you have variable layouts, multiple vendors, or documents where the structure changes between files. That is where pattern matching breaks down and the model’s flexibility pays for itself.

Quick Setup#

End-to-End Invoice Extraction#

Why pdfplumber Over PyMuPDF#

Using Anthropic Instead of OpenAI#

Handling Multi-Page Documents#

Validating Outputs with Pydantic#

Common Errors and Fixes#

pdfplumber.open() returns empty text#

instructor.exceptions.InstructorRetryException: max retries reached#

openai.BadRequestError: maximum context length exceeded#

ValidationError: 1 validation error for Invoice#

anthropic.BadRequestError: messages: text content blocks must be non-empty#

Batch Processing a Folder of PDFs#

When to Skip the LLM#

Related Guides#

About the Author