Scene Text vs Document OCR
Scene text recognition is a different beast from scanning a clean PDF. You are dealing with text on storefronts, road signs, restaurant menus, product labels, and receipts crumpled in someone’s pocket. The text is at odd angles, partially occluded, warped on curved surfaces, and fighting for attention against cluttered backgrounds.
PaddleOCR handles this well because its detection model (DB++) was trained on scene text datasets like ICDAR and Total-Text, not just scanned documents. The recognition model uses SVTR, which handles variable-length text on irregular backgrounds.
Here is the fastest way to read text from a photo of a street sign:
| |
That gets you detected text, bounding box coordinates, and a confidence score for each text region. The rest of this guide covers how to turn that into a real pipeline.
Installation
PaddleOCR needs PaddlePaddle as its backend. Install the CPU version to start:
| |
For GPU acceleration with CUDA 11.8:
| |
PaddleOCR downloads pretrained models on first run. They cache to ~/.paddleocr/ and total around 100MB for one language. You do not need to manage model files manually.
Verify the installation works:
| |
Processing Scene Images
Scene images need different handling than documents. You are not dealing with a flat white page. The text might be on a curved bottle, a tilted sign, or painted on a wall at an angle. PaddleOCR’s angle classifier helps, but you also want to think about image resolution and preprocessing.
| |
The det_db_unclip_ratio parameter matters for scene text. Setting it higher (1.5-2.0) expands detected bounding boxes, which catches text that bleeds to the edge of the detection region. For tightly packed signs, lower it to 1.2 to avoid merging adjacent text blocks.
Drawing Results on Images
Visualizing detections is critical for debugging scene text pipelines. You need to see where the model thinks text is and what it read.
| |
PaddleOCR returns four-point polygon bounding boxes, not axis-aligned rectangles. This is important for scene text because text on signs is often rotated or in perspective. Use cv2.polylines instead of cv2.rectangle to draw them accurately.
Extracting Structured Results
Raw OCR output is a nested list. For downstream processing, flatten it into something usable:
| |
Sorting by center coordinates gives you a natural reading order for scene text. For documents you would sort by line position, but scene text is scattered across the image, so center-based sorting is more predictable.
Handling Different Languages
Scene text in the real world is multilingual. A photo from a Tokyo street has Japanese, English, and sometimes Chinese all in one frame. PaddleOCR supports 80+ languages, but you need to pick the right one at initialization.
| |
There is no reliable automatic language detection for scene text. If your images contain multiple scripts, run multiple recognizers and pick the result with the highest average confidence:
| |
This is slow because it runs the full pipeline once per language. For production, train a lightweight script classifier or use the detection model alone first, then route to the right recognizer.
Batch Processing Scene Images
Processing a folder of photos from a field survey, a set of product images, or frames extracted from video:
| |
For large batches, initialize PaddleOCR once and reuse it. Creating a new instance per image wastes 2-3 seconds loading models each time. If you need parallelism, use ProcessPoolExecutor with one PaddleOCR instance per worker process – the models are not thread-safe.
Tuning Detection for Scene Text
The default PaddleOCR settings work well for documents but can miss small or low-contrast scene text. Adjust these parameters:
| |
The det_limit_side_len parameter controls the maximum dimension the detector sees. PaddleOCR resizes images to this limit before detection. For scene images with small text far away (like street signs in a wide shot), bump it to 1280 or even 1920. This uses more memory but catches text the default 960px limit would miss.
Common Errors and Fixes
results[0] is None or empty list
PaddleOCR found no text in the image. Common causes: image too small (upscale to 640px minimum), text is too faint against the background, or the image is mostly non-text. Lower det_db_thresh to 0.2 and det_db_box_thresh to 0.3 to make detection more aggressive.
Detected boxes are too tight, cutting off characters
Increase det_db_unclip_ratio from the default 1.5 to 2.0 or 2.5. This expands each detected polygon outward. Scene text on signs often has tight spacing between the text and the sign edge.
Wrong text on rotated signs
Make sure use_angle_cls=True is set and pass cls=True to the ocr() call. PaddleOCR’s angle classifier only handles 0 and 180 degree rotation. For text at 90 degrees (vertical signs), rotate the image manually first:
| |
No module named 'paddle'
You installed paddleocr but forgot paddlepaddle. They are separate packages:
| |
Recognition returns garbled text for non-Latin scripts
You initialized PaddleOCR with the wrong language. The recognition model is language-specific. If you pass a Japanese image to an English recognizer, you get nonsense. Always match the lang parameter to the text in your images.
Out of memory on large images
Lower det_limit_side_len to 960 or 640. Large images eat GPU/CPU memory during detection. Alternatively, tile the image into overlapping crops and merge results:
| |
Slow inference on CPU
Disable the angle classifier if all your text is upright (use_angle_cls=False). Set rec_batch_num to a lower value like 4 to reduce memory pressure. For production, use the GPU version of PaddlePaddle – scene text inference is 5-10x faster on a GPU.
Related Guides
- How to Build a Document Comparison Pipeline with Vision Models
- How to Build a Lane Detection Pipeline with OpenCV and YOLO
- How to Build a Real-Time Pose Estimation Pipeline with MediaPipe
- How to Build a Vehicle Counting Pipeline with YOLOv8 and OpenCV
- How to Build an OCR Pipeline with PaddleOCR and Tesseract
- How to Build a Video Shot Boundary Detection Pipeline with PySceneDetect
- How to Build a Video Surveillance Analytics Pipeline with YOLOv8
- How to Build Hand Gesture Recognition with MediaPipe and Python
- How to Build Video Analytics Pipelines with OpenCV and Deep Learning
- How to Build a Receipt Scanner with OCR and Structured Extraction