How to Build a Document Comparison Pipeline with Vision Models

Comparing two versions of a document visually is one of those problems that sounds simple until you try it. Text diffs work for source code, but documents have layout, fonts, images, and formatting that a plain diff command misses entirely. You need a pipeline that works at both the pixel level and the text level.

Here is the core idea: render each document page as an image, compute structural similarity to find changed regions, extract text from both versions with OCR, and generate a unified diff report. Install the dependencies first:

1
pip install opencv-python-headless numpy scikit-image paddlepaddle paddleocr pdf2image

You also need poppler-utils for PDF rendering:

1
2
3
4
5
# Ubuntu/Debian
sudo apt install poppler-utils

# macOS
brew install poppler

Render PDF Pages as Images

The first step converts PDF pages to images so OpenCV can process them. pdf2image wraps Poppler’s pdftoppm command and gives you PIL images directly.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from pdf2image import convert_from_path

def pdf_to_images(pdf_path, dpi=200):
    """Convert each page of a PDF to a PIL image."""
    pages = convert_from_path(pdf_path, dpi=dpi)
    return pages

old_pages = pdf_to_images("contract_v1.pdf")
new_pages = pdf_to_images("contract_v2.pdf")

print(f"Old document: {len(old_pages)} pages")
print(f"New document: {len(new_pages)} pages")

A DPI of 200 balances quality and speed. Bump it to 300 if you need to catch small font changes. Drop to 150 for faster processing on large documents where you only care about major layout shifts.

If your inputs are already images (scanned documents, screenshots), skip this step and load them directly with cv2.imread.

Compute SSIM and Pixel-Level Diffs

Structural Similarity Index (SSIM) gives you a score between -1 and 1 measuring how similar two images are. More importantly, it returns a per-pixel difference map that tells you exactly where the changes are.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import cv2
import numpy as np
from skimage.metrics import structural_similarity as ssim

def compute_visual_diff(img_old, img_new):
    """Compute SSIM and return a diff mask highlighting changed regions."""
    # Convert PIL images to OpenCV format
    old_cv = cv2.cvtColor(np.array(img_old), cv2.COLOR_RGB2BGR)
    new_cv = cv2.cvtColor(np.array(img_new), cv2.COLOR_RGB2BGR)

    # Resize to match dimensions if pages differ slightly
    h = max(old_cv.shape[0], new_cv.shape[0])
    w = max(old_cv.shape[1], new_cv.shape[1])
    old_cv = cv2.resize(old_cv, (w, h))
    new_cv = cv2.resize(new_cv, (w, h))

    # Convert to grayscale for SSIM
    old_gray = cv2.cvtColor(old_cv, cv2.COLOR_BGR2GRAY)
    new_gray = cv2.cvtColor(new_cv, cv2.COLOR_BGR2GRAY)

    # Compute SSIM with full difference map
    score, diff_map = ssim(old_gray, new_gray, full=True)
    print(f"SSIM score: {score:.4f}")

    # Convert diff map to uint8 (0-255 range)
    diff_map = (255 - (diff_map * 255)).astype("uint8")

    # Threshold to isolate significant changes
    _, thresh = cv2.threshold(diff_map, 30, 255, cv2.THRESH_BINARY)

    # Dilate to merge nearby change regions
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 5))
    thresh = cv2.dilate(thresh, kernel, iterations=2)

    return score, thresh, old_cv, new_cv

score, change_mask, old_cv, new_cv = compute_visual_diff(old_pages[0], new_pages[0])

The threshold value of 30 controls sensitivity. Lower it to catch subtle changes like font weight differences. Raise it to ignore compression artifacts and anti-aliasing noise. The dilation step merges nearby changed pixels into coherent regions so you get clean bounding boxes instead of scattered dots.

An SSIM score above 0.95 usually means the pages are visually identical for practical purposes. Below 0.8 means significant changes. Between 0.8 and 0.95 is where the interesting diffs live.

Highlight Changed Regions on the Document

Once you have the change mask, find contours to draw bounding boxes around every modified area. This gives you a visual diff overlay that makes changes immediately obvious.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def highlight_changes(old_cv, new_cv, change_mask, min_area=100):
    """Draw bounding boxes around changed regions on both document versions."""
    overlay_old = old_cv.copy()
    overlay_new = new_cv.copy()

    contours, _ = cv2.findContours(
        change_mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE
    )

    change_regions = []
    for contour in contours:
        area = cv2.contourArea(contour)
        if area < min_area:
            continue

        x, y, w, h = cv2.boundingRect(contour)
        change_regions.append((x, y, w, h))

        # Red box on old version (removed/changed content)
        cv2.rectangle(overlay_old, (x, y), (x + w, y + h), (0, 0, 255), 2)
        # Green box on new version (added/changed content)
        cv2.rectangle(overlay_new, (x, y), (x + w, y + h), (0, 255, 0), 2)

    print(f"Found {len(change_regions)} changed regions")

    cv2.imwrite("diff_old.png", overlay_old)
    cv2.imwrite("diff_new.png", overlay_new)

    return change_regions

regions = highlight_changes(old_cv, new_cv, change_mask)

The min_area=100 filter drops tiny change regions that are usually noise – single-pixel shifts from PDF rendering differences, slight anti-aliasing changes, or JPEG compression artifacts. Increase this to 500 if you only care about paragraph-level changes.

Extract Text and Generate a Diff Report

Visual diffs show you where something changed but not what changed. PaddleOCR extracts the actual text from both versions, and Python’s difflib produces a human-readable diff report.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from paddleocr import PaddleOCR
import difflib

ocr = PaddleOCR(use_angle_cls=True, lang="en", show_log=False)

def extract_text_by_region(image_cv):
    """Extract text from an image, sorted top-to-bottom by position."""
    # Save to temp file since PaddleOCR expects a file path or numpy array
    results = ocr.ocr(image_cv, cls=True)

    lines = []
    for line in results[0] or []:
        bbox, (text, conf) = line
        if conf < 0.7:
            continue
        # Use top-left y-coordinate for vertical sorting
        y_pos = bbox[0][1]
        lines.append((y_pos, text))

    # Sort by vertical position to reconstruct reading order
    lines.sort(key=lambda x: x[0])
    return [text for _, text in lines]

old_lines = extract_text_by_region(old_cv)
new_lines = extract_text_by_region(new_cv)

# Generate unified diff
diff = difflib.unified_diff(
    old_lines, new_lines,
    fromfile="v1", tofile="v2",
    lineterm=""
)

diff_report = "\n".join(diff)
print(diff_report)

PaddleOCR returns text regions with bounding boxes. Sorting by the y-coordinate reconstructs the natural reading order from top to bottom. The confidence threshold of 0.7 filters out garbage detections that would pollute the diff.

For a more detailed report that maps text changes back to specific regions on the page, combine the OCR bounding boxes with the visual diff regions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def map_text_changes_to_regions(old_cv, new_cv, change_regions):
    """Extract text from changed regions only and report what changed."""
    report = []

    for i, (x, y, w, h) in enumerate(change_regions):
        # Crop the changed region from both versions
        old_crop = old_cv[y:y+h, x:x+w]
        new_crop = new_cv[y:y+h, x:x+w]

        # OCR each crop
        old_result = ocr.ocr(old_crop, cls=True)
        new_result = ocr.ocr(new_crop, cls=True)

        old_text = " ".join(
            text for line in (old_result[0] or [])
            for _, (text, conf) in [line] if conf > 0.7
        )
        new_text = " ".join(
            text for line in (new_result[0] or [])
            for _, (text, conf) in [line] if conf > 0.7
        )

        if old_text != new_text:
            report.append({
                "region": i + 1,
                "position": {"x": x, "y": y, "width": w, "height": h},
                "old_text": old_text or "(empty)",
                "new_text": new_text or "(empty)",
            })

    return report

changes = map_text_changes_to_regions(old_cv, new_cv, regions)
for change in changes:
    print(f"Region {change['region']} at ({change['position']['x']}, {change['position']['y']}):")
    print(f"  Old: {change['old_text']}")
    print(f"  New: {change['new_text']}")
    print()

This approach is more precise than diffing the entire page text because it only runs OCR on the regions that actually changed. On a 10-page document with 3 small edits, you run OCR on 6 small crops instead of 20 full pages.

Full Multi-Page Pipeline

Putting it all together into a function that processes an entire document:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import json
from pdf2image import convert_from_path

def compare_documents(old_pdf, new_pdf, output_dir="diff_output", dpi=200):
    """Compare two PDF documents page by page and generate a full diff report."""
    import os
    os.makedirs(output_dir, exist_ok=True)

    old_pages = convert_from_path(old_pdf, dpi=dpi)
    new_pages = convert_from_path(new_pdf, dpi=dpi)

    max_pages = max(len(old_pages), len(new_pages))
    full_report = []

    for page_num in range(max_pages):
        page_result = {"page": page_num + 1, "status": "unknown", "changes": []}

        if page_num >= len(old_pages):
            page_result["status"] = "added"
            full_report.append(page_result)
            continue

        if page_num >= len(new_pages):
            page_result["status"] = "removed"
            full_report.append(page_result)
            continue

        score, mask, old_cv, new_cv = compute_visual_diff(
            old_pages[page_num], new_pages[page_num]
        )

        if score > 0.98:
            page_result["status"] = "unchanged"
            page_result["ssim"] = round(score, 4)
        else:
            page_result["status"] = "modified"
            page_result["ssim"] = round(score, 4)
            regions = highlight_changes(old_cv, new_cv, mask)
            changes = map_text_changes_to_regions(old_cv, new_cv, regions)
            page_result["changes"] = changes

            cv2.imwrite(f"{output_dir}/page_{page_num + 1}_old.png", old_cv)
            cv2.imwrite(f"{output_dir}/page_{page_num + 1}_new.png", new_cv)

        full_report.append(page_result)

    # Save JSON report
    report_path = f"{output_dir}/diff_report.json"
    with open(report_path, "w") as f:
        json.dump(full_report, f, indent=2)

    print(f"Comparison complete. Report saved to {report_path}")
    return full_report

report = compare_documents("contract_v1.pdf", "contract_v2.pdf")

The pipeline handles added and removed pages gracefully. If one version has more pages than the other, those extra pages are flagged as “added” or “removed” without crashing.

Common Errors and Fixes

PDFInfoNotInstalledError: Unable to get page count

pdf2image cannot find the pdfinfo binary from Poppler. Install it:

1
2
3
4
5
# Ubuntu/Debian
sudo apt install poppler-utils

# macOS
brew install poppler

On Windows, download Poppler binaries and add the bin/ directory to your PATH, or pass the path explicitly:

1
pages = convert_from_path("doc.pdf", poppler_path=r"C:\poppler\bin")

ValueError: operands could not be broadcast together with shapes (1200,800) (1190,800)

The two page images have different dimensions. This happens when PDFs have slightly different page sizes or margins. The compute_visual_diff function above handles this with cv2.resize, but if you’re computing SSIM directly, resize first:

1
2
old_gray = cv2.resize(old_gray, (w, h))
new_gray = cv2.resize(new_gray, (w, h))

PaddleOCR returns empty results on cropped regions

Small crops (under ~50px in either dimension) don’t have enough context for the text detector. Add padding around the crop before running OCR:

1
2
3
4
5
6
pad = 20
y1 = max(0, y - pad)
x1 = max(0, x - pad)
y2 = min(old_cv.shape[0], y + h + pad)
x2 = min(old_cv.shape[1], x + w + pad)
crop = old_cv[y1:y2, x1:x2]

SSIM score is very low (< 0.5) even though documents look similar

Usually caused by a global shift – one version has a slightly different margin or header offset. Register the images before computing SSIM using feature-based alignment:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Use ORB feature matching to align before SSIM
orb = cv2.ORB_create(500)
kp1, des1 = orb.detectAndCompute(old_gray, None)
kp2, des2 = orb.detectAndCompute(new_gray, None)

bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)
matches = bf.match(des1, des2)
matches = sorted(matches, key=lambda m: m.distance)[:50]

src_pts = np.float32([kp1[m.queryIdx].pt for m in matches]).reshape(-1, 1, 2)
dst_pts = np.float32([kp2[m.trainIdx].pt for m in matches]).reshape(-1, 1, 2)

matrix, _ = cv2.findHomography(dst_pts, src_pts, cv2.RANSAC, 5.0)
aligned_new = cv2.warpPerspective(new_gray, matrix, (old_gray.shape[1], old_gray.shape[0]))

No module named 'paddle'

PaddleOCR needs PaddlePaddle installed separately. They are different packages:

1
pip install paddlepaddle paddleocr

For GPU support with CUDA 11.8, use paddlepaddle-gpu instead.

Render PDF Pages as Images#

Compute SSIM and Pixel-Level Diffs#

Highlight Changed Regions on the Document#

Extract Text and Generate a Diff Report#

Full Multi-Page Pipeline#

Common Errors and Fixes#

Related Guides#

About the Author