Most PDF extraction code is a mess of regex and prayer. You parse the text, hope the layout is consistent, write 40 rules for 40 document formats, and still miss edge cases. LLMs are genuinely better at this. They handle layout variation, interpret context, and map messy text to clean schemas without brittle pattern matching.
The best approach right now: extract raw text with pdfplumber, define your output shape with Pydantic, and use the instructor library to get validated, typed responses from OpenAI or Anthropic models. Here is the full pipeline.
Quick Setup
Install the three core libraries:
| |
If you want to use Anthropic instead of OpenAI:
| |
For scanned PDFs where text extraction fails (image-based documents), you will also want:
| |
End-to-End Invoice Extraction
This is the complete pipeline. It reads a PDF, extracts text, sends it to an LLM with a strict Pydantic schema, and returns validated JSON.
| |
That is it. The response_model=Invoice parameter tells instructor to constrain the LLM output to your exact schema. If the model returns something that does not validate against your Pydantic model, instructor automatically retries with the validation error in the prompt. No parsing, no regex, no post-processing.
Why pdfplumber Over PyMuPDF
Both libraries extract text from PDFs. I recommend pdfplumber for structured document extraction because it preserves layout information better. It understands columns, tables, and spatial positioning – which matters when your invoice has a table of line items.
PyMuPDF (fitz) is faster and handles scanned PDFs better when combined with OCR. Use it when you need raw speed or when pdfplumber returns empty text (usually means the PDF is image-based).
| |
If both return empty strings, you are dealing with a scanned image PDF and need OCR:
| |
Using Anthropic Instead of OpenAI
Instructor works with multiple providers. Switching to Claude takes two lines:
| |
Claude tends to be more careful about not hallucinating values it cannot find in the text. GPT-4o is faster and cheaper for high-volume extraction. Pick based on your accuracy vs. cost tradeoff.
Handling Multi-Page Documents
Long documents can exceed context windows or produce worse results because the model loses track of details buried in pages of text. Process them in chunks and merge.
| |
This two-pass approach works well. The first pass extracts line items page by page (where the model only needs to focus on a few rows at a time), and the second pass grabs vendor info, dates, and totals from the header/footer.
Validating Outputs with Pydantic
Pydantic does the heavy lifting for validation. Add constraints directly to your schema:
| |
When this validator fails, instructor sends the error message back to the LLM and asks it to fix its answer. The model sees "Total 150.0 does not match subtotal + tax (145.50)" and corrects itself. You get self-healing extraction for free.
Control the retry behavior:
| |
Common Errors and Fixes
pdfplumber.open() returns empty text
This means the PDF contains scanned images instead of selectable text. Fall back to OCR:
| |
Fix: use PyMuPDF with pytesseract as shown in the OCR section above. Check if text extraction returns content before sending to the LLM – sending empty strings wastes API calls and returns hallucinated data.
instructor.exceptions.InstructorRetryException: max retries reached
The LLM failed validation on every attempt. This usually means either your schema is too strict for the data, or the PDF text is too garbled for the model to extract reliably.
| |
Fix: increase max_retries to 5, loosen validators that are too aggressive, or improve the text extraction step. Sometimes running the raw text through a quick cleanup prompt first helps.
openai.BadRequestError: maximum context length exceeded
Your PDF text is too long for the model’s context window.
| |
Fix: use the multi-page chunking approach from the section above, or truncate/summarize pages before sending. You can also count tokens beforehand with tiktoken:
| |
ValidationError: 1 validation error for Invoice
Pydantic rejected the LLM’s output. The error message tells you exactly what field failed:
| |
Fix: instructor handles most of these automatically by retrying. If it persists, make the field type more flexible (use str instead of date and parse it yourself), or add a @field_validator that handles multiple date formats.
anthropic.BadRequestError: messages: text content blocks must be non-empty
You sent an empty string as the message content to Anthropic.
Fix: always check that your extracted text is non-empty before making the API call:
| |
Batch Processing a Folder of PDFs
Real pipelines process hundreds of documents. Here is a pattern with error handling and progress tracking:
| |
Add tenacity for rate limit handling if you are processing thousands of files against the OpenAI API. Instructor already uses tenacity internally, but you may want backoff on the outer loop too.
When to Skip the LLM
Not every PDF needs a language model. If your documents are highly consistent (same vendor, same template every time), a direct pdfplumber table extraction is faster and cheaper:
| |
Use the LLM approach when you have variable layouts, multiple vendors, or documents where the structure changes between files. That is where pattern matching breaks down and the model’s flexibility pays for itself.
Related Guides
- How to Build a Text-to-SQL Pipeline with LLMs
- How to Summarize Long Documents with LLMs and Map-Reduce
- How to Classify Text with Zero-Shot and Few-Shot LLMs
- How to Build a Text Correction and Grammar Checking Pipeline
- How to Build a Text Style Transfer Pipeline with Transformers
- How to Build a Named Entity Recognition Pipeline with spaCy and Transformers
- How to Build a Spell Checking and Autocorrect Pipeline with Python
- How to Build a Sentiment Analysis API with Transformers and FastAPI
- How to Build a Text Classification Pipeline with SetFit
- How to Build a Text Similarity API with Cross-Encoders