How to Build an OCR Pipeline with PaddleOCR and Tesseract

The Quick Version

PaddleOCR gives you a three-stage pipeline out of the box: text detection (DB algorithm), direction classification, and text recognition (SVTR/CRNN). Tesseract is the old workhorse – simpler to set up, good enough for clean documents, but falls apart on complex layouts. Use PaddleOCR when you need accuracy on messy real-world images. Use Tesseract when you want zero Python dependencies beyond a system package.

Here is the fastest path to extracting text from an image with both tools:

1
2
3
4
5
6
7
8
9
# PaddleOCR -- 3 lines to full OCR
from paddleocr import PaddleOCR

ocr = PaddleOCR(use_angle_cls=True, lang="en")
results = ocr.ocr("invoice.png", cls=True)

for line in results[0]:
    bbox, (text, confidence) = line
    print(f"[{confidence:.2f}] {text}")

1
2
3
4
5
6
# Tesseract -- equally simple for clean docs
import pytesseract
from PIL import Image

text = pytesseract.image_to_string(Image.open("invoice.png"))
print(text)

Both work. The difference shows up when your input images get ugly – skewed scans, mixed fonts, tables, or non-English text.

Installation

PaddleOCR

PaddleOCR depends on PaddlePaddle, Baidu’s deep learning framework. Install the CPU version first, then PaddleOCR itself:

1
pip install paddlepaddle paddleocr

For GPU acceleration with CUDA 11.8:

1
pip install paddlepaddle-gpu paddleocr

PaddleOCR downloads detection, classification, and recognition models on first run (~100MB total). They cache to ~/.paddleocr/.

Tesseract

Tesseract is a C++ binary with a Python wrapper. Install the system package first:

1
2
3
4
5
6
7
8
# Ubuntu/Debian
sudo apt install tesseract-ocr tesseract-ocr-eng

# macOS
brew install tesseract

# Then the Python bindings
pip install pytesseract Pillow

For additional languages, install the corresponding data packages. For example, tesseract-ocr-deu for German or tesseract-ocr-chi-sim for Simplified Chinese.

Processing Scanned Documents

Scanned documents are where OCR pipelines earn their keep. Raw scans often have noise, skew, and uneven lighting. A preprocessing step makes a massive difference.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import cv2
import numpy as np
from paddleocr import PaddleOCR

def preprocess_scan(image_path):
    """Clean up a scanned document for better OCR accuracy."""
    img = cv2.imread(image_path)

    # Convert to grayscale
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # Adaptive thresholding handles uneven lighting better than global
    binary = cv2.adaptiveThreshold(
        gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY, 11, 2
    )

    # Denoise without destroying text edges
    denoised = cv2.fastNlMeansDenoising(binary, h=10)

    return denoised

ocr = PaddleOCR(use_angle_cls=True, lang="en")

# Preprocess, save temp file, then OCR
cleaned = preprocess_scan("messy_scan.png")
cv2.imwrite("/tmp/cleaned.png", cleaned)
results = ocr.ocr("/tmp/cleaned.png", cls=True)

for line in results[0]:
    bbox, (text, conf) = line
    if conf > 0.8:  # Only keep high-confidence detections
        print(text)

The use_angle_cls=True flag enables PaddleOCR’s text direction classifier, which handles rotated text (0 and 180 degrees). For documents scanned upside down, this is the difference between garbage output and correct text.

Handling Tables

Tables are the hardest part of document OCR. Neither PaddleOCR nor Tesseract natively understand table structure – they return text regions, not cells. You need to reconstruct the grid yourself.

PaddleOCR’s PP-Structure module handles this directly:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from paddleocr import PPStructure

table_engine = PPStructure(show_log=False, layout=False)

result = table_engine("table_image.png")

for item in result:
    if item["type"] == "table":
        # Returns HTML table markup with cell contents
        print(item["res"]["html"])

PP-Structure detects table regions and returns structured HTML. For complex tables with merged cells, accuracy drops, but it handles standard grid tables well.

With Tesseract, you can use --psm 6 (assume a uniform block of text) and combine it with contour detection to approximate cell boundaries:

1
2
3
4
5
6
7
import pytesseract
from PIL import Image

# PSM 6: Assume a single uniform block of text
# Useful for individual table cells after you've cropped them
config = "--psm 6 --oem 3"
text = pytesseract.image_to_string(Image.open("cell_crop.png"), config=config)

For production table extraction, I recommend PaddleOCR’s PP-Structure. Tesseract requires too much manual geometry work to reliably parse tables.

Multi-Language OCR

PaddleOCR supports 80+ languages. Pass the language code during initialization:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from paddleocr import PaddleOCR

# Chinese + English (PaddleOCR's strongest language)
ocr_cn = PaddleOCR(use_angle_cls=True, lang="ch")

# Japanese
ocr_ja = PaddleOCR(use_angle_cls=True, lang="japan")

# Arabic (right-to-left handled automatically)
ocr_ar = PaddleOCR(use_angle_cls=True, lang="ar")

results = ocr_cn.ocr("chinese_receipt.png", cls=True)

Each language downloads its own recognition model. PaddleOCR was originally built for Chinese text, so CJK languages get the best accuracy.

For Tesseract, specify language with the -l flag:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import pytesseract
from PIL import Image

# German OCR (requires tesseract-ocr-deu package)
text = pytesseract.image_to_string(
    Image.open("german_doc.png"), lang="deu"
)

# Multiple languages in one document
text = pytesseract.image_to_string(
    Image.open("mixed_doc.png"), lang="eng+deu+fra"
)

Confidence Scores and Filtering

PaddleOCR returns per-line confidence scores. Use them aggressively – low-confidence results are usually noise or misreads.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from paddleocr import PaddleOCR

ocr = PaddleOCR(use_angle_cls=True, lang="en")
results = ocr.ocr("receipt.png", cls=True)

high_confidence = []
low_confidence = []

for line in results[0]:
    bbox, (text, conf) = line
    if conf >= 0.85:
        high_confidence.append(text)
    else:
        low_confidence.append((text, conf))

print("Extracted text:")
print("\n".join(high_confidence))

if low_confidence:
    print("\nLow-confidence (review manually):")
    for text, conf in low_confidence:
        print(f"  [{conf:.2f}] {text}")

Tesseract provides word-level confidence through its image_to_data function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import pytesseract
from PIL import Image
import pandas as pd

data = pytesseract.image_to_data(
    Image.open("receipt.png"), output_type=pytesseract.Output.DATAFRAME
)

# Filter to actual text (conf > 0, non-empty)
words = data[(data["conf"] > 60) & (data["text"].str.strip().astype(bool))]
print(words[["text", "conf"]].to_string(index=False))

Tesseract confidence ranges from 0 to 100. Anything below 60 is suspect.

Batch Processing Document Images

When you have hundreds of document pages to process, parallelize and cache results:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import json
from pathlib import Path
from paddleocr import PaddleOCR
from concurrent.futures import ProcessPoolExecutor

def process_single(image_path: str) -> dict:
    """Process one image and return structured results."""
    ocr = PaddleOCR(use_angle_cls=True, lang="en", show_log=False)
    results = ocr.ocr(image_path, cls=True)

    extracted = []
    for line in results[0] or []:
        bbox, (text, conf) = line
        extracted.append({
            "text": text,
            "confidence": round(conf, 3),
            "bbox": [coord for point in bbox for coord in point],
        })

    return {"file": image_path, "lines": extracted}

# Gather all images
image_dir = Path("scanned_docs/")
images = sorted(str(p) for p in image_dir.glob("*.png"))

# Process in parallel -- PaddleOCR is CPU-heavy, so workers help
with ProcessPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(process_single, images))

# Save structured output
output_path = Path("ocr_results.json")
output_path.write_text(json.dumps(results, indent=2, ensure_ascii=False))
print(f"Processed {len(results)} documents -> {output_path}")

Each worker initializes its own PaddleOCR instance. The model loads once per process, so the overhead is the initial ~2 seconds per worker, then inference runs in parallel.

Post-Processing Extracted Text

Raw OCR output is messy. Common issues: extra whitespace, broken words across lines, misread characters. A post-processing step cleans things up:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import re

def clean_ocr_text(raw_lines: list[str]) -> str:
    """Clean and reassemble OCR output into readable text."""
    text = "\n".join(raw_lines)

    # Fix common OCR misreads
    replacements = {
        "0": "O",   # zero vs capital O -- context-dependent
        "l": "1",   # lowercase L vs one -- context-dependent
        "|": "I",   # pipe vs capital I
    }
    # Only apply these in numeric or alphabetic contexts respectively
    # For general use, skip character-level replacements

    # Remove excessive whitespace
    text = re.sub(r"[ \t]+", " ", text)

    # Rejoin hyphenated line breaks (e.g., "docu-\nment" -> "document")
    text = re.sub(r"(\w)-\n(\w)", r"\1\2", text)

    # Collapse multiple blank lines
    text = re.sub(r"\n{3,}", "\n\n", text)

    return text.strip()

cleaned = clean_ocr_text(["  Invoice   Number: 12345  ", "  Date:  2026-02-15 "])
print(cleaned)
# Output: Invoice Number: 12345
#         Date: 2026-02-15

For domain-specific documents (invoices, medical records, legal contracts), build a custom dictionary of expected terms and use fuzzy matching to correct OCR errors.

When to Use Which Tool

Criteria	PaddleOCR	Tesseract
Accuracy on messy images	Excellent	Fair
Setup complexity	pip install (downloads models)	System package + pip
Speed (CPU)	Slower (~1-3s per page)	Faster (~0.3-1s per page)
GPU support	Yes (PaddlePaddle GPU)	No
Table extraction	Built-in (PP-Structure)	Manual work required
CJK languages	Best in class	Acceptable
Latin-script docs	Great	Great
Rotated text	Handles via angle classifier	Needs manual deskewing
Dependency footprint	~500MB (PaddlePaddle + models)	~30MB (system package)

My recommendation: start with PaddleOCR for anything beyond clean, English-only printed documents. The accuracy gap is real, especially on receipts, handwritten annotations, or documents with mixed layouts. Fall back to Tesseract for lightweight pipelines where you need minimal dependencies and the input quality is consistently high.

Common Errors and Fixes

PaddleOCR: No module named 'paddle'

You installed paddleocr but not paddlepaddle. They are separate packages:

1
pip install paddlepaddle paddleocr

PaddleOCR: results[0] returns None

This means the detector found no text regions. Check that your image is not too small (upscale to at least 640px on the shortest side) and that it actually contains text. Preprocessing with binarization helps on low-contrast images.

Tesseract: TesseractNotFoundError

The Python wrapper cannot find the tesseract binary. Either install the system package or set the path manually:

1
pytesseract.pytesseract.tesseract_cmd = r"/usr/bin/tesseract"

Tesseract: garbled output on clean images

Wrong page segmentation mode. Tesseract defaults to --psm 3 (fully automatic), which sometimes misclassifies the layout. Try --psm 6 for a single block or --psm 4 for a single column:

1
text = pytesseract.image_to_string(img, config="--psm 6")

PaddleOCR: extremely slow on CPU

PaddleOCR runs the full detection + classification + recognition pipeline. Disable angle classification if you know all text is upright:

1
ocr = PaddleOCR(use_angle_cls=False, lang="en")

Also consider setting det_db_box_thresh=0.5 (default 0.3) to skip marginal detections early, reducing recognition calls.

Both tools: text from adjacent columns merges into one line

OCR engines read left-to-right by default. For multi-column layouts, segment the image into columns first, then run OCR on each column separately. PaddleOCR’s PP-Structure layout analysis mode can detect column regions automatically.

The Quick Version#

Installation#

PaddleOCR#

Tesseract#

Processing Scanned Documents#

Handling Tables#

Multi-Language OCR#

Confidence Scores and Filtering#

Batch Processing Document Images#

Post-Processing Extracted Text#

When to Use Which Tool#

Common Errors and Fixes#

Related Guides#

About the Author