How to Build a Document Table Extraction Pipeline with Vision Models

Tables trapped inside PDFs and scanned documents are one of the most annoying data engineering problems. You can see the data right there, but getting it into a DataFrame feels like pulling teeth. Microsoft’s Table Transformer fixes this. It is a DETR-based model trained on PubTables-1M that detects tables in documents and recognizes their internal structure – rows, columns, headers – all from a single image.

Here is the fastest path to a working table detection pipeline.

Detect Tables in a Document Image

Install the dependencies first:

1
pip install transformers torch torchvision pillow opencv-python pandas pdf2image

Now detect tables in a document image using the microsoft/table-transformer-detection model:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import torch
from transformers import AutoImageProcessor, TableTransformerForObjectDetection
from PIL import Image

# Load detection model and processor
detection_processor = AutoImageProcessor.from_pretrained(
    "microsoft/table-transformer-detection"
)
detection_model = TableTransformerForObjectDetection.from_pretrained(
    "microsoft/table-transformer-detection"
)

# Load your document image
image = Image.open("document_page.png").convert("RGB")

# Run inference
inputs = detection_processor(images=image, return_tensors="pt")
with torch.no_grad():
    outputs = detection_model(**inputs)

# Post-process to get bounding boxes in pixel coordinates
target_sizes = torch.tensor([image.size[::-1]])  # (height, width)
results = detection_processor.post_process_object_detection(
    outputs, threshold=0.9, target_sizes=target_sizes
)[0]

for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
    box = [round(i, 2) for i in box.tolist()]
    print(f"Detected {detection_model.config.id2label[label.item()]} "
          f"with confidence {round(score.item(), 3)} at {box}")

The output gives you bounding boxes in [xmin, ymin, xmax, ymax] format. Each box wraps a detected table in the document. The threshold of 0.9 keeps only high-confidence detections – lower it to 0.7 if you are missing tables, but expect more false positives.

Preprocess Documents with OpenCV

Real-world documents are messy. Scanned pages come in at odd angles, with noise, shadows, and uneven lighting. Preprocessing makes a measurable difference in detection accuracy.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import cv2
import numpy as np

def preprocess_document(image_path: str) -> np.ndarray:
    """Clean up a scanned document image for table detection."""
    img = cv2.imread(image_path)

    # Convert to grayscale
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # Adaptive thresholding handles uneven lighting better than global
    binary = cv2.adaptiveThreshold(
        gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY, blockSize=15, C=10
    )

    # Denoise while preserving edges
    denoised = cv2.fastNlMeansDenoising(binary, h=10)

    # Deskew: find the dominant angle and rotate
    coords = np.column_stack(np.where(denoised < 128))
    if len(coords) > 100:
        angle = cv2.minAreaRect(coords)[-1]
        if angle < -45:
            angle = 90 + angle
        if abs(angle) > 0.5:
            h, w = denoised.shape
            center = (w // 2, h // 2)
            matrix = cv2.getRotationMatrix2D(center, angle, 1.0)
            denoised = cv2.warpAffine(
                denoised, matrix, (w, h),
                flags=cv2.INTER_CUBIC,
                borderMode=cv2.BORDER_REPLICATE
            )

    return denoised

The key steps here: adaptive thresholding beats global thresholding on scanned docs because lighting is never uniform. The deskew step corrects rotated scans – even a 2-degree tilt can throw off structure recognition downstream. Feed the cleaned image to the Table Transformer as an RGB PIL image by converting back with Image.fromarray(cv2.cvtColor(denoised, cv2.COLOR_GRAY2RGB)).

Convert PDFs to Images

Most tables live in PDFs, not image files. Use pdf2image to convert pages:

1
2
3
4
5
6
7
8
from pdf2image import convert_from_path

# Convert all pages at 300 DPI -- this resolution works well for Table Transformer
pages = convert_from_path("report.pdf", dpi=300)

for i, page in enumerate(pages):
    page.save(f"page_{i}.png", "PNG")
    print(f"Saved page {i}: {page.size[0]}x{page.size[1]} pixels")

Use 300 DPI. Going lower hurts detection accuracy. Going higher wastes memory without much benefit. If your PDF has 50+ pages, process them in batches to avoid running out of RAM.

Recognize Table Structure

Detecting the table is step one. Knowing which pixels are rows, columns, and headers is step two. The structure recognition model handles this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
from transformers import AutoImageProcessor, TableTransformerForObjectDetection
import torch
from PIL import Image

# Load structure recognition model
structure_processor = AutoImageProcessor.from_pretrained(
    "microsoft/table-transformer-structure-recognition"
)
structure_model = TableTransformerForObjectDetection.from_pretrained(
    "microsoft/table-transformer-structure-recognition"
)

def recognize_table_structure(table_image: Image.Image) -> dict:
    """Detect rows, columns, and headers within a cropped table image."""
    inputs = structure_processor(images=table_image, return_tensors="pt")

    with torch.no_grad():
        outputs = structure_model(**inputs)

    target_sizes = torch.tensor([table_image.size[::-1]])
    results = structure_processor.post_process_object_detection(
        outputs, threshold=0.6, target_sizes=target_sizes
    )[0]

    # Group detections by type
    structure = {"rows": [], "columns": [], "headers": []}
    for score, label, box in zip(
        results["scores"], results["labels"], results["boxes"]
    ):
        label_name = structure_model.config.id2label[label.item()]
        bbox = [round(i, 2) for i in box.tolist()]

        if "row" in label_name and "header" not in label_name:
            structure["rows"].append({"bbox": bbox, "score": round(score.item(), 3)})
        elif "column" in label_name and "header" not in label_name:
            structure["columns"].append({"bbox": bbox, "score": round(score.item(), 3)})
        elif "header" in label_name:
            structure["headers"].append({"bbox": bbox, "score": round(score.item(), 3)})

    # Sort rows top-to-bottom, columns left-to-right
    structure["rows"].sort(key=lambda r: r["bbox"][1])
    structure["columns"].sort(key=lambda c: c["bbox"][0])

    return structure

The structure model uses these labels: table, table row, table column, table column header, table projected row header, and table spanning cell. The threshold is lower here (0.6) because internal structure elements overlap more and the model is less confident on individual rows/columns than it is on whole tables.

Feed it a cropped table image, not the full document page. Crop the table using the bounding box from the detection step:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Crop detected table from the full page
def crop_table(page_image: Image.Image, table_box: list) -> Image.Image:
    """Crop a table region from the document page with padding."""
    xmin, ymin, xmax, ymax = table_box
    # Add small padding around the table
    pad = 10
    xmin = max(0, xmin - pad)
    ymin = max(0, ymin - pad)
    xmax = min(page_image.width, xmax + pad)
    ymax = min(page_image.height, ymax + pad)
    return page_image.crop((xmin, ymin, xmax, ymax))

Build Structured DataFrames from Detected Cells

This is where it all comes together. You have row and column bounding boxes – now intersect them to find individual cells, then read the text with OCR:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import pandas as pd
from PIL import Image
import pytesseract

def cells_to_dataframe(
    table_image: Image.Image,
    structure: dict,
) -> pd.DataFrame:
    """Convert detected rows and columns into a pandas DataFrame using OCR."""
    rows = structure["rows"]
    columns = structure["columns"]

    if not rows or not columns:
        return pd.DataFrame()

    # Build cell grid by intersecting row and column bounding boxes
    data = []
    for row in rows:
        row_data = []
        ry1, ry2 = row["bbox"][1], row["bbox"][3]

        for col in columns:
            cx1, cx2 = col["bbox"][0], col["bbox"][2]

            # Cell region is the intersection of row and column
            cell_box = (
                int(cx1),
                int(ry1),
                int(cx2),
                int(ry2),
            )

            # Crop cell and OCR it
            cell_image = table_image.crop(cell_box)
            text = pytesseract.image_to_string(cell_image, config="--psm 6").strip()
            row_data.append(text)

        data.append(row_data)

    # Use header row if detected
    if structure["headers"] and len(data) > 0:
        header_y = structure["headers"][0]["bbox"][1]
        # Find which data row aligns with the header
        for i, row in enumerate(rows):
            if abs(row["bbox"][1] - header_y) < 20:
                df = pd.DataFrame(data[i + 1:], columns=data[i])
                return df

    # Fall back to numeric column names
    df = pd.DataFrame(data)
    return df

The --psm 6 flag tells Tesseract to treat the input as a single block of text, which works well for table cells. For cells with only numbers, --psm 7 (single line) sometimes gives better results.

Full Pipeline: PDF to DataFrames

Here is the complete pipeline wired together:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import torch
from transformers import AutoImageProcessor, TableTransformerForObjectDetection
from PIL import Image
from pdf2image import convert_from_path
import pandas as pd
import pytesseract

# Load both models once
det_processor = AutoImageProcessor.from_pretrained(
    "microsoft/table-transformer-detection"
)
det_model = TableTransformerForObjectDetection.from_pretrained(
    "microsoft/table-transformer-detection"
)
str_processor = AutoImageProcessor.from_pretrained(
    "microsoft/table-transformer-structure-recognition"
)
str_model = TableTransformerForObjectDetection.from_pretrained(
    "microsoft/table-transformer-structure-recognition"
)

def extract_tables_from_pdf(pdf_path: str, dpi: int = 300) -> list[pd.DataFrame]:
    """Extract all tables from a PDF as DataFrames."""
    pages = convert_from_path(pdf_path, dpi=dpi)
    all_tables = []

    for page_idx, page_image in enumerate(pages):
        # Step 1: Detect tables on this page
        inputs = det_processor(images=page_image, return_tensors="pt")
        with torch.no_grad():
            outputs = det_model(**inputs)

        target_sizes = torch.tensor([page_image.size[::-1]])
        detections = det_processor.post_process_object_detection(
            outputs, threshold=0.9, target_sizes=target_sizes
        )[0]

        for table_idx, box in enumerate(detections["boxes"]):
            box_coords = box.tolist()

            # Step 2: Crop the table
            table_img = crop_table(page_image, box_coords)

            # Step 3: Recognize structure
            structure = recognize_table_structure(table_img)

            # Step 4: Convert to DataFrame
            df = cells_to_dataframe(table_img, structure)
            if not df.empty:
                print(f"Page {page_idx}, Table {table_idx}: "
                      f"{len(df)} rows x {len(df.columns)} cols")
                all_tables.append(df)

    return all_tables

# Run it
tables = extract_tables_from_pdf("quarterly_report.pdf")
for i, df in enumerate(tables):
    print(f"\nTable {i}:")
    print(df.head())
    df.to_csv(f"table_{i}.csv", index=False)

Performance Tips

Use GPU if available. Move models to CUDA and the inference time drops from seconds to milliseconds per page:

1
2
3
4
5
6
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
det_model.to(device)
str_model.to(device)

# When running inference, move inputs too
inputs = {k: v.to(device) for k, v in inputs.items()}

Batch your pages. The processor accepts a list of images. Processing 4 pages at once is faster than 4 separate calls on GPU.

Use the v1.1 model for better accuracy. Microsoft released an updated structure recognition model that handles more table types:

1
2
3
4
5
6
7
# v1.1 model trained on more diverse data
str_processor = AutoImageProcessor.from_pretrained(
    "microsoft/table-transformer-structure-recognition-v1.1-all"
)
str_model = TableTransformerForObjectDetection.from_pretrained(
    "microsoft/table-transformer-structure-recognition-v1.1-all"
)

The v1.1 model is better at handling spanning cells and complex headers. Use it unless you have a specific reason to stick with v1.0.

Common Errors and Fixes

RuntimeError: Expected all tensors to be on the same device – You moved the model to GPU but forgot the inputs. Both must be on the same device. Use inputs = {k: v.to(device) for k, v in inputs.items()} before calling the model.

Empty detection results – Lower the threshold from 0.9 to 0.7. If you still get nothing, check that your image is RGB (not grayscale) and at least 640px on the shorter side. The model was trained on 800px images. Tiny inputs produce bad results.

Tesseract returns garbled text – Increase your PDF-to-image DPI from 150 to 300. Low-resolution cell crops give Tesseract too few pixels to work with. Also check that tesseract is installed system-wide: sudo apt install tesseract-ocr on Ubuntu.

pdf2image.exceptions.PDFInfoNotInstalledError – You need poppler-utils installed. On Ubuntu: sudo apt install poppler-utils. On macOS: brew install poppler.

Overlapping or duplicate bounding boxes – The model sometimes predicts multiple overlapping boxes for the same table. Apply non-maximum suppression (NMS) to filter duplicates:

1
2
3
4
5
6
7
from torchvision.ops import nms

boxes = results["boxes"]
scores = results["scores"]
keep = nms(boxes, scores, iou_threshold=0.5)
filtered_boxes = boxes[keep]
filtered_scores = scores[keep]

Structure model misses rows in dense tables – Drop the structure threshold to 0.4. Dense tables with thin row separators are harder to detect. You can also try padding the cropped table image by 20-30 pixels on each side before feeding it to the structure model.

Out of memory on large PDFs – Process one page at a time instead of loading all pages into memory. Replace convert_from_path(pdf_path, dpi=300) with a loop using convert_from_path(pdf_path, dpi=300, first_page=i, last_page=i) for each page index.

Detect Tables in a Document Image#

Preprocess Documents with OpenCV#

Convert PDFs to Images#

Recognize Table Structure#

Build Structured DataFrames from Detected Cells#

Full Pipeline: PDF to DataFrames#

Performance Tips#

Common Errors and Fixes#

Related Guides#

About the Author