The Quick Version
LayoutLM models understand documents the way humans do — they look at what the text says, where it sits on the page, and what the page looks like visually. This makes them far better at extracting data from invoices, receipts, forms, and reports than plain OCR followed by regex.
| |
| |
This runs OCR automatically (via Tesseract), combines text with spatial layout information, and classifies each token as a field type like HEADER, QUESTION, or ANSWER.
Understanding the Pipeline
LayoutLMv3 processes three types of input simultaneously:
- Text tokens — what the words say (from OCR or a text PDF)
- Bounding boxes — where each word sits on the page (x, y, width, height)
- Page image — a visual representation of the document
The model learns that “Total” near the bottom-right of an invoice usually precedes a dollar amount, even if the exact format varies across documents.
When to Use OCR vs. Text PDFs
If your documents are scanned images, you need OCR. Set apply_ocr=True in the processor and it runs Tesseract automatically.
For digital PDFs where text is already selectable, extract text and bounding boxes directly with pdfplumber — it’s faster and more accurate than OCR:
| |
Fine-Tuning on Custom Documents
The pre-trained FUNSD model handles generic forms. For your specific document types (invoices, medical records, contracts), fine-tuning dramatically improves accuracy.
| |
How Much Training Data Do You Need?
LayoutLMv3 fine-tunes well with surprisingly little data. For a single document type (like a specific vendor’s invoices), 50-100 annotated examples get you to 90%+ accuracy. For diverse document layouts, aim for 200-500 examples.
Use Label Studio for annotation — it has a built-in OCR template that lets you draw bounding boxes and assign labels visually.
Extracting Key-Value Pairs
Once you have token-level predictions, group them into meaningful key-value pairs:
| |
Common Errors and Fixes
TesseractNotFoundError: tesseract is not installed
Install the system package, not just the Python wrapper:
| |
Poor OCR quality on low-resolution scans
Upscale the image before processing. LayoutLM works best with images at 150-300 DPI:
| |
Token limit exceeded for long documents
LayoutLMv3 has a 512-token limit. For multi-page documents, process each page separately and merge results. Or use sliding window with overlap on dense single pages.
Bounding box coordinates are wrong
LayoutLM expects boxes normalized to a 0-1000 range. If you’re providing raw pixel coordinates, scale them: box = [int(x / page_width * 1000) for x in raw_box].
Low accuracy after fine-tuning
Check your label alignment. The processor’s tokenizer may split words into subword tokens, and each subtoken needs the correct label. Use word_labels parameter (not labels) so the processor handles alignment automatically.
LayoutLM vs. Vision LLMs
GPT-4V and Claude can also extract data from document images, but they’re slower and more expensive per page. LayoutLM wins when you have a consistent document format and need to process thousands of pages cheaply. Vision LLMs win when document formats vary wildly and you can’t invest in fine-tuning.
For high-volume production (10K+ pages/day), fine-tuned LayoutLM is the right choice. For ad-hoc extraction or prototyping, send it to GPT-4V with a structured output schema.
Related Guides
- How to Build an Extractive Question Answering System with Transformers
- How to Build a Text Summarization Pipeline with Sumy and Transformers
- How to Build a Text Entailment and Contradiction Detection Pipeline
- How to Implement Topic Modeling with BERTopic
- How to Build a RAG Pipeline with Hugging Face Transformers v5
- How to Build a Resume Parser with spaCy and Transformers
- How to Build an Emotion Detection Pipeline with GoEmotions and Transformers
- How to Build a Named Entity Linking Pipeline with Wikipedia and Transformers
- How to Build an Abstractive Summarization Pipeline with PEGASUS
- How to Build a Named Entity Recognition Pipeline with spaCy and Transformers