The Quick Version

LayoutLM models understand documents the way humans do — they look at what the text says, where it sits on the page, and what the page looks like visually. This makes them far better at extracting data from invoices, receipts, forms, and reports than plain OCR followed by regex.

1
2
pip install transformers torch torchvision pillow pytesseract
# Also install Tesseract OCR: sudo apt install tesseract-ocr
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from transformers import AutoProcessor, AutoModelForTokenClassification
from PIL import Image
import torch

processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=True)
model = AutoModelForTokenClassification.from_pretrained(
    "nielsr/layoutlmv3-finetuned-funsd"  # pre-trained on form understanding
)

image = Image.open("invoice.png").convert("RGB")
encoding = processor(image, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    outputs = model(**encoding)

predictions = outputs.logits.argmax(-1).squeeze().tolist()
tokens = processor.tokenizer.convert_ids_to_tokens(encoding["input_ids"].squeeze().tolist())

# Map predictions to labels
labels = model.config.id2label
for token, pred in zip(tokens, predictions):
    if token not in ["<s>", "</s>", "<pad>"]:
        label = labels[pred]
        if label != "O":  # skip non-entity tokens
            print(f"{token:20s}{label}")

This runs OCR automatically (via Tesseract), combines text with spatial layout information, and classifies each token as a field type like HEADER, QUESTION, or ANSWER.

Understanding the Pipeline

LayoutLMv3 processes three types of input simultaneously:

  1. Text tokens — what the words say (from OCR or a text PDF)
  2. Bounding boxes — where each word sits on the page (x, y, width, height)
  3. Page image — a visual representation of the document

The model learns that “Total” near the bottom-right of an invoice usually precedes a dollar amount, even if the exact format varies across documents.

When to Use OCR vs. Text PDFs

If your documents are scanned images, you need OCR. Set apply_ocr=True in the processor and it runs Tesseract automatically.

For digital PDFs where text is already selectable, extract text and bounding boxes directly with pdfplumber — it’s faster and more accurate than OCR:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import pdfplumber

def extract_words_from_pdf(pdf_path: str) -> tuple[list[str], list[list[int]]]:
    """Extract words and their bounding boxes from a PDF."""
    words = []
    boxes = []

    with pdfplumber.open(pdf_path) as pdf:
        page = pdf.pages[0]
        width, height = int(page.width), int(page.height)

        for word_info in page.extract_words():
            words.append(word_info["text"])
            # Normalize to 0-1000 range (LayoutLM convention)
            box = [
                int(word_info["x0"] / width * 1000),
                int(word_info["top"] / height * 1000),
                int(word_info["x1"] / width * 1000),
                int(word_info["bottom"] / height * 1000),
            ]
            boxes.append(box)

    return words, boxes

Fine-Tuning on Custom Documents

The pre-trained FUNSD model handles generic forms. For your specific document types (invoices, medical records, contracts), fine-tuning dramatically improves accuracy.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
from transformers import AutoProcessor, AutoModelForTokenClassification, TrainingArguments, Trainer
from datasets import Dataset
import json

# Your annotated data: list of {image_path, words, boxes, labels}
with open("annotations.json") as f:
    data = json.load(f)

# Define your label set
label_list = ["O", "B-VENDOR", "I-VENDOR", "B-DATE", "I-DATE",
              "B-TOTAL", "I-TOTAL", "B-ITEM", "I-ITEM", "B-AMOUNT", "I-AMOUNT"]
label2id = {l: i for i, l in enumerate(label_list)}
id2label = {i: l for l, i in label2id.items()}

processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)
model = AutoModelForTokenClassification.from_pretrained(
    "microsoft/layoutlmv3-base",
    num_labels=len(label_list),
    id2label=id2label,
    label2id=label2id,
)

def preprocess(example):
    image = Image.open(example["image_path"]).convert("RGB")
    encoding = processor(
        image,
        text=example["words"],
        boxes=example["boxes"],
        word_labels=example["labels"],
        truncation=True,
        max_length=512,
        padding="max_length",
        return_tensors="pt",
    )
    return {k: v.squeeze() for k, v in encoding.items()}

dataset = Dataset.from_list(data).map(preprocess)
split = dataset.train_test_split(test_size=0.2)

training_args = TrainingArguments(
    output_dir="./layoutlm-invoices",
    num_train_epochs=20,
    per_device_train_batch_size=4,
    learning_rate=5e-5,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=split["train"],
    eval_dataset=split["test"],
)
trainer.train()

How Much Training Data Do You Need?

LayoutLMv3 fine-tunes well with surprisingly little data. For a single document type (like a specific vendor’s invoices), 50-100 annotated examples get you to 90%+ accuracy. For diverse document layouts, aim for 200-500 examples.

Use Label Studio for annotation — it has a built-in OCR template that lets you draw bounding boxes and assign labels visually.

Extracting Key-Value Pairs

Once you have token-level predictions, group them into meaningful key-value pairs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def extract_entities(tokens: list[str], predictions: list[int], id2label: dict) -> dict:
    """Group BIO-tagged tokens into entities."""
    entities = {}
    current_entity = None
    current_tokens = []

    for token, pred_id in zip(tokens, predictions):
        label = id2label[pred_id]

        if label.startswith("B-"):
            # Save previous entity
            if current_entity:
                text = " ".join(current_tokens).replace(" ##", "")
                entities[current_entity] = text
            current_entity = label[2:]
            current_tokens = [token]
        elif label.startswith("I-") and current_entity:
            current_tokens.append(token)
        else:
            if current_entity:
                text = " ".join(current_tokens).replace(" ##", "")
                entities[current_entity] = text
                current_entity = None
                current_tokens = []

    return entities

# Example output:
# {"VENDOR": "Acme Corp", "DATE": "2026-01-15", "TOTAL": "$1,234.56"}

Common Errors and Fixes

TesseractNotFoundError: tesseract is not installed

Install the system package, not just the Python wrapper:

1
2
3
4
5
# Ubuntu/Debian
sudo apt install tesseract-ocr

# macOS
brew install tesseract

Poor OCR quality on low-resolution scans

Upscale the image before processing. LayoutLM works best with images at 150-300 DPI:

1
2
3
from PIL import Image
image = Image.open("scan.png")
image = image.resize((image.width * 2, image.height * 2), Image.LANCZOS)

Token limit exceeded for long documents

LayoutLMv3 has a 512-token limit. For multi-page documents, process each page separately and merge results. Or use sliding window with overlap on dense single pages.

Bounding box coordinates are wrong

LayoutLM expects boxes normalized to a 0-1000 range. If you’re providing raw pixel coordinates, scale them: box = [int(x / page_width * 1000) for x in raw_box].

Low accuracy after fine-tuning

Check your label alignment. The processor’s tokenizer may split words into subword tokens, and each subtoken needs the correct label. Use word_labels parameter (not labels) so the processor handles alignment automatically.

LayoutLM vs. Vision LLMs

GPT-4V and Claude can also extract data from document images, but they’re slower and more expensive per page. LayoutLM wins when you have a consistent document format and need to process thousands of pages cheaply. Vision LLMs win when document formats vary wildly and you can’t invest in fine-tuning.

For high-volume production (10K+ pages/day), fine-tuned LayoutLM is the right choice. For ad-hoc extraction or prototyping, send it to GPT-4V with a structured output schema.