The Quick Version
PaddleOCR gives you a three-stage pipeline out of the box: text detection (DB algorithm), direction classification, and text recognition (SVTR/CRNN). Tesseract is the old workhorse – simpler to set up, good enough for clean documents, but falls apart on complex layouts. Use PaddleOCR when you need accuracy on messy real-world images. Use Tesseract when you want zero Python dependencies beyond a system package.
Here is the fastest path to extracting text from an image with both tools:
| |
| |
Both work. The difference shows up when your input images get ugly – skewed scans, mixed fonts, tables, or non-English text.
Installation
PaddleOCR
PaddleOCR depends on PaddlePaddle, Baidu’s deep learning framework. Install the CPU version first, then PaddleOCR itself:
| |
For GPU acceleration with CUDA 11.8:
| |
PaddleOCR downloads detection, classification, and recognition models on first run (~100MB total). They cache to ~/.paddleocr/.
Tesseract
Tesseract is a C++ binary with a Python wrapper. Install the system package first:
| |
For additional languages, install the corresponding data packages. For example, tesseract-ocr-deu for German or tesseract-ocr-chi-sim for Simplified Chinese.
Processing Scanned Documents
Scanned documents are where OCR pipelines earn their keep. Raw scans often have noise, skew, and uneven lighting. A preprocessing step makes a massive difference.
| |
The use_angle_cls=True flag enables PaddleOCR’s text direction classifier, which handles rotated text (0 and 180 degrees). For documents scanned upside down, this is the difference between garbage output and correct text.
Handling Tables
Tables are the hardest part of document OCR. Neither PaddleOCR nor Tesseract natively understand table structure – they return text regions, not cells. You need to reconstruct the grid yourself.
PaddleOCR’s PP-Structure module handles this directly:
| |
PP-Structure detects table regions and returns structured HTML. For complex tables with merged cells, accuracy drops, but it handles standard grid tables well.
With Tesseract, you can use --psm 6 (assume a uniform block of text) and combine it with contour detection to approximate cell boundaries:
| |
For production table extraction, I recommend PaddleOCR’s PP-Structure. Tesseract requires too much manual geometry work to reliably parse tables.
Multi-Language OCR
PaddleOCR supports 80+ languages. Pass the language code during initialization:
| |
Each language downloads its own recognition model. PaddleOCR was originally built for Chinese text, so CJK languages get the best accuracy.
For Tesseract, specify language with the -l flag:
| |
Confidence Scores and Filtering
PaddleOCR returns per-line confidence scores. Use them aggressively – low-confidence results are usually noise or misreads.
| |
Tesseract provides word-level confidence through its image_to_data function:
| |
Tesseract confidence ranges from 0 to 100. Anything below 60 is suspect.
Batch Processing Document Images
When you have hundreds of document pages to process, parallelize and cache results:
| |
Each worker initializes its own PaddleOCR instance. The model loads once per process, so the overhead is the initial ~2 seconds per worker, then inference runs in parallel.
Post-Processing Extracted Text
Raw OCR output is messy. Common issues: extra whitespace, broken words across lines, misread characters. A post-processing step cleans things up:
| |
For domain-specific documents (invoices, medical records, legal contracts), build a custom dictionary of expected terms and use fuzzy matching to correct OCR errors.
When to Use Which Tool
| Criteria | PaddleOCR | Tesseract |
|---|---|---|
| Accuracy on messy images | Excellent | Fair |
| Setup complexity | pip install (downloads models) | System package + pip |
| Speed (CPU) | Slower (~1-3s per page) | Faster (~0.3-1s per page) |
| GPU support | Yes (PaddlePaddle GPU) | No |
| Table extraction | Built-in (PP-Structure) | Manual work required |
| CJK languages | Best in class | Acceptable |
| Latin-script docs | Great | Great |
| Rotated text | Handles via angle classifier | Needs manual deskewing |
| Dependency footprint | ~500MB (PaddlePaddle + models) | ~30MB (system package) |
My recommendation: start with PaddleOCR for anything beyond clean, English-only printed documents. The accuracy gap is real, especially on receipts, handwritten annotations, or documents with mixed layouts. Fall back to Tesseract for lightweight pipelines where you need minimal dependencies and the input quality is consistently high.
Common Errors and Fixes
PaddleOCR: No module named 'paddle'
You installed paddleocr but not paddlepaddle. They are separate packages:
| |
PaddleOCR: results[0] returns None
This means the detector found no text regions. Check that your image is not too small (upscale to at least 640px on the shortest side) and that it actually contains text. Preprocessing with binarization helps on low-contrast images.
Tesseract: TesseractNotFoundError
The Python wrapper cannot find the tesseract binary. Either install the system package or set the path manually:
| |
Tesseract: garbled output on clean images
Wrong page segmentation mode. Tesseract defaults to --psm 3 (fully automatic), which sometimes misclassifies the layout. Try --psm 6 for a single block or --psm 4 for a single column:
| |
PaddleOCR: extremely slow on CPU
PaddleOCR runs the full detection + classification + recognition pipeline. Disable angle classification if you know all text is upright:
| |
Also consider setting det_db_box_thresh=0.5 (default 0.3) to skip marginal detections early, reducing recognition calls.
Both tools: text from adjacent columns merges into one line
OCR engines read left-to-right by default. For multi-column layouts, segment the image into columns first, then run OCR on each column separately. PaddleOCR’s PP-Structure layout analysis mode can detect column regions automatically.
Related Guides
- How to Build a Document Comparison Pipeline with Vision Models
- How to Build a Receipt Scanner with OCR and Structured Extraction
- How to Build a Scene Text Recognition Pipeline with PaddleOCR
- How to Build a Lane Detection Pipeline with OpenCV and YOLO
- How to Build a Real-Time Pose Estimation Pipeline with MediaPipe
- How to Build a Vehicle Counting Pipeline with YOLOv8 and OpenCV
- How to Build a Video Shot Boundary Detection Pipeline with PySceneDetect
- How to Build a Video Surveillance Analytics Pipeline with YOLOv8
- How to Extract Text from Images with Vision LLMs
- How to Build a Visual Inspection Pipeline with Anomaly Detection