Tables trapped inside PDFs and scanned documents are one of the most annoying data engineering problems. You can see the data right there, but getting it into a DataFrame feels like pulling teeth. Microsoft’s Table Transformer fixes this. It is a DETR-based model trained on PubTables-1M that detects tables in documents and recognizes their internal structure – rows, columns, headers – all from a single image.
Here is the fastest path to a working table detection pipeline.
Detect Tables in a Document Image#
Install the dependencies first:
1
| pip install transformers torch torchvision pillow opencv-python pandas pdf2image
|
Now detect tables in a document image using the microsoft/table-transformer-detection model:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
| import torch
from transformers import AutoImageProcessor, TableTransformerForObjectDetection
from PIL import Image
# Load detection model and processor
detection_processor = AutoImageProcessor.from_pretrained(
"microsoft/table-transformer-detection"
)
detection_model = TableTransformerForObjectDetection.from_pretrained(
"microsoft/table-transformer-detection"
)
# Load your document image
image = Image.open("document_page.png").convert("RGB")
# Run inference
inputs = detection_processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = detection_model(**inputs)
# Post-process to get bounding boxes in pixel coordinates
target_sizes = torch.tensor([image.size[::-1]]) # (height, width)
results = detection_processor.post_process_object_detection(
outputs, threshold=0.9, target_sizes=target_sizes
)[0]
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
box = [round(i, 2) for i in box.tolist()]
print(f"Detected {detection_model.config.id2label[label.item()]} "
f"with confidence {round(score.item(), 3)} at {box}")
|
The output gives you bounding boxes in [xmin, ymin, xmax, ymax] format. Each box wraps a detected table in the document. The threshold of 0.9 keeps only high-confidence detections – lower it to 0.7 if you are missing tables, but expect more false positives.
Preprocess Documents with OpenCV#
Real-world documents are messy. Scanned pages come in at odd angles, with noise, shadows, and uneven lighting. Preprocessing makes a measurable difference in detection accuracy.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
| import cv2
import numpy as np
def preprocess_document(image_path: str) -> np.ndarray:
"""Clean up a scanned document image for table detection."""
img = cv2.imread(image_path)
# Convert to grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Adaptive thresholding handles uneven lighting better than global
binary = cv2.adaptiveThreshold(
gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY, blockSize=15, C=10
)
# Denoise while preserving edges
denoised = cv2.fastNlMeansDenoising(binary, h=10)
# Deskew: find the dominant angle and rotate
coords = np.column_stack(np.where(denoised < 128))
if len(coords) > 100:
angle = cv2.minAreaRect(coords)[-1]
if angle < -45:
angle = 90 + angle
if abs(angle) > 0.5:
h, w = denoised.shape
center = (w // 2, h // 2)
matrix = cv2.getRotationMatrix2D(center, angle, 1.0)
denoised = cv2.warpAffine(
denoised, matrix, (w, h),
flags=cv2.INTER_CUBIC,
borderMode=cv2.BORDER_REPLICATE
)
return denoised
|
The key steps here: adaptive thresholding beats global thresholding on scanned docs because lighting is never uniform. The deskew step corrects rotated scans – even a 2-degree tilt can throw off structure recognition downstream. Feed the cleaned image to the Table Transformer as an RGB PIL image by converting back with Image.fromarray(cv2.cvtColor(denoised, cv2.COLOR_GRAY2RGB)).
Convert PDFs to Images#
Most tables live in PDFs, not image files. Use pdf2image to convert pages:
1
2
3
4
5
6
7
8
| from pdf2image import convert_from_path
# Convert all pages at 300 DPI -- this resolution works well for Table Transformer
pages = convert_from_path("report.pdf", dpi=300)
for i, page in enumerate(pages):
page.save(f"page_{i}.png", "PNG")
print(f"Saved page {i}: {page.size[0]}x{page.size[1]} pixels")
|
Use 300 DPI. Going lower hurts detection accuracy. Going higher wastes memory without much benefit. If your PDF has 50+ pages, process them in batches to avoid running out of RAM.
Recognize Table Structure#
Detecting the table is step one. Knowing which pixels are rows, columns, and headers is step two. The structure recognition model handles this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
| from transformers import AutoImageProcessor, TableTransformerForObjectDetection
import torch
from PIL import Image
# Load structure recognition model
structure_processor = AutoImageProcessor.from_pretrained(
"microsoft/table-transformer-structure-recognition"
)
structure_model = TableTransformerForObjectDetection.from_pretrained(
"microsoft/table-transformer-structure-recognition"
)
def recognize_table_structure(table_image: Image.Image) -> dict:
"""Detect rows, columns, and headers within a cropped table image."""
inputs = structure_processor(images=table_image, return_tensors="pt")
with torch.no_grad():
outputs = structure_model(**inputs)
target_sizes = torch.tensor([table_image.size[::-1]])
results = structure_processor.post_process_object_detection(
outputs, threshold=0.6, target_sizes=target_sizes
)[0]
# Group detections by type
structure = {"rows": [], "columns": [], "headers": []}
for score, label, box in zip(
results["scores"], results["labels"], results["boxes"]
):
label_name = structure_model.config.id2label[label.item()]
bbox = [round(i, 2) for i in box.tolist()]
if "row" in label_name and "header" not in label_name:
structure["rows"].append({"bbox": bbox, "score": round(score.item(), 3)})
elif "column" in label_name and "header" not in label_name:
structure["columns"].append({"bbox": bbox, "score": round(score.item(), 3)})
elif "header" in label_name:
structure["headers"].append({"bbox": bbox, "score": round(score.item(), 3)})
# Sort rows top-to-bottom, columns left-to-right
structure["rows"].sort(key=lambda r: r["bbox"][1])
structure["columns"].sort(key=lambda c: c["bbox"][0])
return structure
|
The structure model uses these labels: table, table row, table column, table column header, table projected row header, and table spanning cell. The threshold is lower here (0.6) because internal structure elements overlap more and the model is less confident on individual rows/columns than it is on whole tables.
Feed it a cropped table image, not the full document page. Crop the table using the bounding box from the detection step:
1
2
3
4
5
6
7
8
9
10
11
| # Crop detected table from the full page
def crop_table(page_image: Image.Image, table_box: list) -> Image.Image:
"""Crop a table region from the document page with padding."""
xmin, ymin, xmax, ymax = table_box
# Add small padding around the table
pad = 10
xmin = max(0, xmin - pad)
ymin = max(0, ymin - pad)
xmax = min(page_image.width, xmax + pad)
ymax = min(page_image.height, ymax + pad)
return page_image.crop((xmin, ymin, xmax, ymax))
|
Build Structured DataFrames from Detected Cells#
This is where it all comes together. You have row and column bounding boxes – now intersect them to find individual cells, then read the text with OCR:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
| import pandas as pd
from PIL import Image
import pytesseract
def cells_to_dataframe(
table_image: Image.Image,
structure: dict,
) -> pd.DataFrame:
"""Convert detected rows and columns into a pandas DataFrame using OCR."""
rows = structure["rows"]
columns = structure["columns"]
if not rows or not columns:
return pd.DataFrame()
# Build cell grid by intersecting row and column bounding boxes
data = []
for row in rows:
row_data = []
ry1, ry2 = row["bbox"][1], row["bbox"][3]
for col in columns:
cx1, cx2 = col["bbox"][0], col["bbox"][2]
# Cell region is the intersection of row and column
cell_box = (
int(cx1),
int(ry1),
int(cx2),
int(ry2),
)
# Crop cell and OCR it
cell_image = table_image.crop(cell_box)
text = pytesseract.image_to_string(cell_image, config="--psm 6").strip()
row_data.append(text)
data.append(row_data)
# Use header row if detected
if structure["headers"] and len(data) > 0:
header_y = structure["headers"][0]["bbox"][1]
# Find which data row aligns with the header
for i, row in enumerate(rows):
if abs(row["bbox"][1] - header_y) < 20:
df = pd.DataFrame(data[i + 1:], columns=data[i])
return df
# Fall back to numeric column names
df = pd.DataFrame(data)
return df
|
The --psm 6 flag tells Tesseract to treat the input as a single block of text, which works well for table cells. For cells with only numbers, --psm 7 (single line) sometimes gives better results.
Full Pipeline: PDF to DataFrames#
Here is the complete pipeline wired together:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
| import torch
from transformers import AutoImageProcessor, TableTransformerForObjectDetection
from PIL import Image
from pdf2image import convert_from_path
import pandas as pd
import pytesseract
# Load both models once
det_processor = AutoImageProcessor.from_pretrained(
"microsoft/table-transformer-detection"
)
det_model = TableTransformerForObjectDetection.from_pretrained(
"microsoft/table-transformer-detection"
)
str_processor = AutoImageProcessor.from_pretrained(
"microsoft/table-transformer-structure-recognition"
)
str_model = TableTransformerForObjectDetection.from_pretrained(
"microsoft/table-transformer-structure-recognition"
)
def extract_tables_from_pdf(pdf_path: str, dpi: int = 300) -> list[pd.DataFrame]:
"""Extract all tables from a PDF as DataFrames."""
pages = convert_from_path(pdf_path, dpi=dpi)
all_tables = []
for page_idx, page_image in enumerate(pages):
# Step 1: Detect tables on this page
inputs = det_processor(images=page_image, return_tensors="pt")
with torch.no_grad():
outputs = det_model(**inputs)
target_sizes = torch.tensor([page_image.size[::-1]])
detections = det_processor.post_process_object_detection(
outputs, threshold=0.9, target_sizes=target_sizes
)[0]
for table_idx, box in enumerate(detections["boxes"]):
box_coords = box.tolist()
# Step 2: Crop the table
table_img = crop_table(page_image, box_coords)
# Step 3: Recognize structure
structure = recognize_table_structure(table_img)
# Step 4: Convert to DataFrame
df = cells_to_dataframe(table_img, structure)
if not df.empty:
print(f"Page {page_idx}, Table {table_idx}: "
f"{len(df)} rows x {len(df.columns)} cols")
all_tables.append(df)
return all_tables
# Run it
tables = extract_tables_from_pdf("quarterly_report.pdf")
for i, df in enumerate(tables):
print(f"\nTable {i}:")
print(df.head())
df.to_csv(f"table_{i}.csv", index=False)
|
Use GPU if available. Move models to CUDA and the inference time drops from seconds to milliseconds per page:
1
2
3
4
5
6
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
det_model.to(device)
str_model.to(device)
# When running inference, move inputs too
inputs = {k: v.to(device) for k, v in inputs.items()}
|
Batch your pages. The processor accepts a list of images. Processing 4 pages at once is faster than 4 separate calls on GPU.
Use the v1.1 model for better accuracy. Microsoft released an updated structure recognition model that handles more table types:
1
2
3
4
5
6
7
| # v1.1 model trained on more diverse data
str_processor = AutoImageProcessor.from_pretrained(
"microsoft/table-transformer-structure-recognition-v1.1-all"
)
str_model = TableTransformerForObjectDetection.from_pretrained(
"microsoft/table-transformer-structure-recognition-v1.1-all"
)
|
The v1.1 model is better at handling spanning cells and complex headers. Use it unless you have a specific reason to stick with v1.0.
Common Errors and Fixes#
RuntimeError: Expected all tensors to be on the same device – You moved the model to GPU but forgot the inputs. Both must be on the same device. Use inputs = {k: v.to(device) for k, v in inputs.items()} before calling the model.
Empty detection results – Lower the threshold from 0.9 to 0.7. If you still get nothing, check that your image is RGB (not grayscale) and at least 640px on the shorter side. The model was trained on 800px images. Tiny inputs produce bad results.
Tesseract returns garbled text – Increase your PDF-to-image DPI from 150 to 300. Low-resolution cell crops give Tesseract too few pixels to work with. Also check that tesseract is installed system-wide: sudo apt install tesseract-ocr on Ubuntu.
pdf2image.exceptions.PDFInfoNotInstalledError – You need poppler-utils installed. On Ubuntu: sudo apt install poppler-utils. On macOS: brew install poppler.
Overlapping or duplicate bounding boxes – The model sometimes predicts multiple overlapping boxes for the same table. Apply non-maximum suppression (NMS) to filter duplicates:
1
2
3
4
5
6
7
| from torchvision.ops import nms
boxes = results["boxes"]
scores = results["scores"]
keep = nms(boxes, scores, iou_threshold=0.5)
filtered_boxes = boxes[keep]
filtered_scores = scores[keep]
|
Structure model misses rows in dense tables – Drop the structure threshold to 0.4. Dense tables with thin row separators are harder to detect. You can also try padding the cropped table image by 20-30 pixels on each side before feeding it to the structure model.
Out of memory on large PDFs – Process one page at a time instead of loading all pages into memory. Replace convert_from_path(pdf_path, dpi=300) with a loop using convert_from_path(pdf_path, dpi=300, first_page=i, last_page=i) for each page index.