Classify a Document in 10 Lines
The Document Image Transformer (DiT) from Microsoft treats document pages as images and classifies them into types like invoice, receipt, letter, or form. No OCR required – the model learns visual layout patterns directly from the pixel data.
Here is how to classify a scanned document with a pretrained DiT model:
| |
The microsoft/dit-base-finetuned-rvlcdip checkpoint is trained on RVL-CDIP, a dataset of 400,000 grayscale document images across 16 categories: letter, memo, email, file folder, form, handwritten, invoice, advertisement, budget, news article, presentation, scientific publication, questionnaire, resume, scientific report, and specification. It hits around 92% accuracy out of the box.
Why DiT Over OCR-Based Classifiers
Traditional document classification extracts text with OCR, then feeds it to a text classifier. That pipeline is brittle. OCR errors propagate, handwritten documents choke the recognizer, and you need language-specific models for every locale.
DiT skips all of that. It is a vision-only model based on the BEiT architecture. It learns from the visual structure of documents – where headers sit, how tables are laid out, the density of text blocks. This makes it language-agnostic and robust to poor scan quality.
LayoutLMv3 is the other strong option. It combines visual features with text embeddings (from an internal OCR step), so it performs better when text content matters for classification. Use DiT when you want a pure vision approach with no OCR dependency. Use LayoutLMv3 when the text content distinguishes document types that look visually similar.
Preprocessing Document Images
Document scans come in all shapes. Some are skewed, some have borders, some are 300 DPI TIFFs. The processor handles resizing and normalization, but you should clean up the input for best results.
| |
A few tips on input quality:
- Resolution: DiT was trained on 224x224 images. The processor downscales for you, but feeding in extremely low-resolution scans (under 100 DPI) will lose structural details.
- Color mode: Always convert to RGB. Grayscale inputs work but the model expects 3 channels.
- Multi-page PDFs: Extract pages individually with
pdf2imageorPyMuPDFand classify each page separately.
Fine-Tune on Custom Document Types
The RVL-CDIP categories probably don’t match your use case. If you need to classify purchase orders vs. packing slips vs. customs declarations, you need to fine-tune.
| |
A few hundred labeled examples per class is usually enough to get strong accuracy when fine-tuning from the RVL-CDIP checkpoint, especially if your document types are visually similar to the original 16 categories. If your domain is very different (medical forms, architectural plans), expect to need more data. If you have fewer than 50 samples per class, consider augmenting with rotations, slight skew, and brightness jitter.
Batch Processing a Directory of Scans
For production workloads, you need to process thousands of files. Here is a batch pipeline that classifies every document in a directory and writes results to a CSV:
| |
For large batches, speed this up by batching multiple images into a single forward pass:
| |
This runs one forward pass per batch of 32 instead of per image, cutting wall-clock time dramatically on a GPU.
Common Errors and Fixes
RuntimeError: expected scalar type Float but found Byte
You passed raw pixel data without running it through the processor. The processor converts uint8 pixel values to normalized floats. Always use the AutoImageProcessor before feeding images to the model.
ValueError: ignore_mismatched_sizes not recognized
You are on an old version of transformers. Update with pip install --upgrade transformers. The ignore_mismatched_sizes parameter was added in v4.22.
Model predicts the same class for every input
This usually means you forgot ignore_mismatched_sizes=True when loading a model with a different number of labels. Without it, the old classification head weights get loaded into the new head, and the model converges to the majority class. Check your training logs – if training loss is not decreasing, the head is likely misconfigured.
PIL.UnidentifiedImageError: cannot identify image file
The file is corrupted or is not actually an image (sometimes PDFs get mixed in). Wrap your loading in a try/except and log the failures:
| |
Out of memory on GPU
DiT-base is ~86M parameters and fits comfortably on a 4GB GPU for inference. If you are fine-tuning and hitting OOM, reduce per_device_train_batch_size to 8 or 4. If that is still too much, enable gradient checkpointing by adding gradient_checkpointing=True in your TrainingArguments.
Low accuracy on colored or glossy documents
RVL-CDIP is a grayscale dataset. If your documents are colorful (marketing materials, brochures), the pretrained model may struggle. Fine-tuning on your actual document distribution fixes this quickly.
Related Guides
- How to Classify Images with Vision Transformers in PyTorch
- How to Build a Document Comparison Pipeline with Vision Models
- How to Detect Anomalies in Images with Vision Models
- How to Build a Document Table Extraction Pipeline with Vision Models
- How to Build an Image Captioning Pipeline with BLIP and Transformers
- How to Extract Text from Images with Vision LLMs
- How to Estimate Depth from Images with Depth Anything V2
- How to Build a Visual Grounding Pipeline with Grounding DINO
- How to Build Multi-Object Tracking with DeepSORT and YOLOv8
- How to Build Video Action Recognition with SlowFast and PyTorch