How to Build a Scene Text Recognition Pipeline with PaddleOCR

Scene Text vs Document OCR

Scene text recognition is a different beast from scanning a clean PDF. You are dealing with text on storefronts, road signs, restaurant menus, product labels, and receipts crumpled in someone’s pocket. The text is at odd angles, partially occluded, warped on curved surfaces, and fighting for attention against cluttered backgrounds.

PaddleOCR handles this well because its detection model (DB++) was trained on scene text datasets like ICDAR and Total-Text, not just scanned documents. The recognition model uses SVTR, which handles variable-length text on irregular backgrounds.

Here is the fastest way to read text from a photo of a street sign:

1
2
3
4
5
6
7
8
from paddleocr import PaddleOCR

ocr = PaddleOCR(use_angle_cls=True, lang="en")
result = ocr.ocr("street_sign.jpg", cls=True)

for line in result[0]:
    bbox, (text, confidence) = line
    print(f"{text} ({confidence:.2f})")

That gets you detected text, bounding box coordinates, and a confidence score for each text region. The rest of this guide covers how to turn that into a real pipeline.

Installation

PaddleOCR needs PaddlePaddle as its backend. Install the CPU version to start:

1
pip install paddlepaddle paddleocr opencv-python numpy

For GPU acceleration with CUDA 11.8:

1
pip install paddlepaddle-gpu paddleocr opencv-python numpy

PaddleOCR downloads pretrained models on first run. They cache to ~/.paddleocr/ and total around 100MB for one language. You do not need to manage model files manually.

Verify the installation works:

1
python -c "from paddleocr import PaddleOCR; print('PaddleOCR ready')"

Processing Scene Images

Scene images need different handling than documents. You are not dealing with a flat white page. The text might be on a curved bottle, a tilted sign, or painted on a wall at an angle. PaddleOCR’s angle classifier helps, but you also want to think about image resolution and preprocessing.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import cv2
import numpy as np
from paddleocr import PaddleOCR

ocr = PaddleOCR(
    use_angle_cls=True,
    lang="en",
    det_db_thresh=0.3,       # detection binarization threshold
    det_db_box_thresh=0.5,   # minimum score for a detected box
    det_db_unclip_ratio=1.6, # expand detected regions slightly
    show_log=False,
)

img = cv2.imread("shop_front.jpg")

# Upscale small images -- PaddleOCR struggles below 640px
height, width = img.shape[:2]
if max(height, width) < 640:
    scale = 640 / max(height, width)
    img = cv2.resize(img, None, fx=scale, fy=scale, interpolation=cv2.INTER_CUBIC)
    cv2.imwrite("/tmp/upscaled.jpg", img)
    result = ocr.ocr("/tmp/upscaled.jpg", cls=True)
else:
    result = ocr.ocr("shop_front.jpg", cls=True)

for line in result[0]:
    bbox, (text, conf) = line
    if conf > 0.6:
        coords = np.array(bbox).astype(int)
        top_left = tuple(coords[0])
        print(f"Text: '{text}' at {top_left} confidence={conf:.2f}")

The det_db_unclip_ratio parameter matters for scene text. Setting it higher (1.5-2.0) expands detected bounding boxes, which catches text that bleeds to the edge of the detection region. For tightly packed signs, lower it to 1.2 to avoid merging adjacent text blocks.

Drawing Results on Images

Visualizing detections is critical for debugging scene text pipelines. You need to see where the model thinks text is and what it read.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import cv2
import numpy as np
from paddleocr import PaddleOCR

ocr = PaddleOCR(use_angle_cls=True, lang="en", show_log=False)
img = cv2.imread("restaurant_menu.jpg")
result = ocr.ocr("restaurant_menu.jpg", cls=True)

for line in result[0]:
    bbox, (text, conf) = line
    if conf < 0.5:
        continue

    # Draw the polygon bounding box
    pts = np.array(bbox, dtype=np.int32)
    cv2.polylines(img, [pts], isClosed=True, color=(0, 255, 0), thickness=2)

    # Put the recognized text above the bounding box
    x, y = int(pts[0][0]), int(pts[0][1]) - 10
    label = f"{text} ({conf:.0%})"
    cv2.putText(img, label, (x, y), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 1)

cv2.imwrite("menu_annotated.jpg", img)
print("Saved annotated image to menu_annotated.jpg")

PaddleOCR returns four-point polygon bounding boxes, not axis-aligned rectangles. This is important for scene text because text on signs is often rotated or in perspective. Use cv2.polylines instead of cv2.rectangle to draw them accurately.

Extracting Structured Results

Raw OCR output is a nested list. For downstream processing, flatten it into something usable:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
from dataclasses import dataclass
from paddleocr import PaddleOCR

@dataclass
class TextRegion:
    text: str
    confidence: float
    bbox: list  # [[x1,y1], [x2,y2], [x3,y3], [x4,y4]]
    center_x: float
    center_y: float

def extract_regions(image_path: str, min_confidence: float = 0.6) -> list[TextRegion]:
    ocr = PaddleOCR(use_angle_cls=True, lang="en", show_log=False)
    result = ocr.ocr(image_path, cls=True)

    regions = []
    for line in result[0] or []:
        bbox, (text, conf) = line
        if conf < min_confidence:
            continue

        cx = sum(p[0] for p in bbox) / 4
        cy = sum(p[1] for p in bbox) / 4
        regions.append(TextRegion(
            text=text.strip(),
            confidence=round(conf, 3),
            bbox=bbox,
            center_x=round(cx, 1),
            center_y=round(cy, 1),
        ))

    # Sort top-to-bottom, then left-to-right
    regions.sort(key=lambda r: (r.center_y, r.center_x))
    return regions

regions = extract_regions("storefront.jpg", min_confidence=0.7)
for r in regions:
    print(f"[{r.confidence:.0%}] {r.text} at ({r.center_x}, {r.center_y})")

Sorting by center coordinates gives you a natural reading order for scene text. For documents you would sort by line position, but scene text is scattered across the image, so center-based sorting is more predictable.

Handling Different Languages

Scene text in the real world is multilingual. A photo from a Tokyo street has Japanese, English, and sometimes Chinese all in one frame. PaddleOCR supports 80+ languages, but you need to pick the right one at initialization.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from paddleocr import PaddleOCR

# Japanese scene text (signs, menus, product labels)
ocr_ja = PaddleOCR(use_angle_cls=True, lang="japan", show_log=False)
result = ocr_ja.ocr("tokyo_street.jpg", cls=True)

for line in result[0] or []:
    bbox, (text, conf) = line
    print(f"{text} ({conf:.2f})")

# Korean scene text
ocr_ko = PaddleOCR(use_angle_cls=True, lang="korean", show_log=False)

# Chinese scene text (PaddleOCR's strongest language)
ocr_ch = PaddleOCR(use_angle_cls=True, lang="ch", show_log=False)

There is no reliable automatic language detection for scene text. If your images contain multiple scripts, run multiple recognizers and pick the result with the highest average confidence:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from paddleocr import PaddleOCR

def detect_best_language(image_path: str, languages: list[str]) -> tuple[str, list]:
    best_lang = ""
    best_score = 0.0
    best_result = []

    for lang in languages:
        ocr = PaddleOCR(use_angle_cls=True, lang=lang, show_log=False)
        result = ocr.ocr(image_path, cls=True)
        lines = result[0] or []

        if not lines:
            continue

        avg_conf = sum(line[1][1] for line in lines) / len(lines)
        if avg_conf > best_score:
            best_score = avg_conf
            best_lang = lang
            best_result = lines

    return best_lang, best_result

lang, lines = detect_best_language("mystery_sign.jpg", ["en", "ch", "japan", "korean"])
print(f"Detected language: {lang}")
for line in lines:
    print(f"  {line[1][0]}")

This is slow because it runs the full pipeline once per language. For production, train a lightweight script classifier or use the detection model alone first, then route to the right recognizer.

Batch Processing Scene Images

Processing a folder of photos from a field survey, a set of product images, or frames extracted from video:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import json
from pathlib import Path
from paddleocr import PaddleOCR

def process_scene_batch(image_dir: str, output_path: str, lang: str = "en"):
    ocr = PaddleOCR(use_angle_cls=True, lang=lang, show_log=False)
    image_files = sorted(Path(image_dir).glob("*.jpg")) + sorted(Path(image_dir).glob("*.png"))

    all_results = []
    for img_path in image_files:
        result = ocr.ocr(str(img_path), cls=True)
        lines = result[0] or []

        extracted = []
        for line in lines:
            bbox, (text, conf) = line
            if conf < 0.5:
                continue
            extracted.append({
                "text": text.strip(),
                "confidence": round(conf, 3),
                "bbox": [[round(c, 1) for c in pt] for pt in bbox],
            })

        all_results.append({
            "file": img_path.name,
            "text_count": len(extracted),
            "regions": extracted,
        })
        print(f"Processed {img_path.name}: {len(extracted)} text regions")

    Path(output_path).write_text(json.dumps(all_results, indent=2, ensure_ascii=False))
    print(f"Saved {len(all_results)} results to {output_path}")

process_scene_batch("field_photos/", "scene_text_results.json")

For large batches, initialize PaddleOCR once and reuse it. Creating a new instance per image wastes 2-3 seconds loading models each time. If you need parallelism, use ProcessPoolExecutor with one PaddleOCR instance per worker process – the models are not thread-safe.

Tuning Detection for Scene Text

The default PaddleOCR settings work well for documents but can miss small or low-contrast scene text. Adjust these parameters:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
from paddleocr import PaddleOCR

ocr = PaddleOCR(
    use_angle_cls=True,
    lang="en",
    det_db_thresh=0.2,         # lower = more sensitive detection (default 0.3)
    det_db_box_thresh=0.4,     # lower = keep weaker detections (default 0.5)
    det_db_unclip_ratio=2.0,   # higher = larger bounding boxes (default 1.5)
    rec_batch_num=16,          # batch size for recognition -- higher uses more RAM
    det_limit_side_len=1280,   # max image dimension for detection (default 960)
    show_log=False,
)

# Process a high-res scene image without downscaling
result = ocr.ocr("billboard_4k.jpg", cls=True)

for line in result[0] or []:
    bbox, (text, conf) = line
    print(f"[{conf:.2f}] {text}")

The det_limit_side_len parameter controls the maximum dimension the detector sees. PaddleOCR resizes images to this limit before detection. For scene images with small text far away (like street signs in a wide shot), bump it to 1280 or even 1920. This uses more memory but catches text the default 960px limit would miss.

Common Errors and Fixes

results[0] is None or empty list

PaddleOCR found no text in the image. Common causes: image too small (upscale to 640px minimum), text is too faint against the background, or the image is mostly non-text. Lower det_db_thresh to 0.2 and det_db_box_thresh to 0.3 to make detection more aggressive.

Detected boxes are too tight, cutting off characters

Increase det_db_unclip_ratio from the default 1.5 to 2.0 or 2.5. This expands each detected polygon outward. Scene text on signs often has tight spacing between the text and the sign edge.

Wrong text on rotated signs

Make sure use_angle_cls=True is set and pass cls=True to the ocr() call. PaddleOCR’s angle classifier only handles 0 and 180 degree rotation. For text at 90 degrees (vertical signs), rotate the image manually first:

1
2
3
4
5
import cv2

img = cv2.imread("vertical_sign.jpg")
rotated = cv2.rotate(img, cv2.ROTATE_90_COUNTERCLOCKWISE)
cv2.imwrite("/tmp/rotated.jpg", rotated)

No module named 'paddle'

You installed paddleocr but forgot paddlepaddle. They are separate packages:

1
pip install paddlepaddle paddleocr

Recognition returns garbled text for non-Latin scripts

You initialized PaddleOCR with the wrong language. The recognition model is language-specific. If you pass a Japanese image to an English recognizer, you get nonsense. Always match the lang parameter to the text in your images.

Out of memory on large images

Lower det_limit_side_len to 960 or 640. Large images eat GPU/CPU memory during detection. Alternatively, tile the image into overlapping crops and merge results:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import cv2
from paddleocr import PaddleOCR

ocr = PaddleOCR(use_angle_cls=True, lang="en", show_log=False)

img = cv2.imread("huge_panorama.jpg")
h, w = img.shape[:2]
tile_size = 1280
overlap = 200
all_texts = []

for y in range(0, h, tile_size - overlap):
    for x in range(0, w, tile_size - overlap):
        tile = img[y:y + tile_size, x:x + tile_size]
        cv2.imwrite("/tmp/tile.jpg", tile)
        result = ocr.ocr("/tmp/tile.jpg", cls=True)
        for line in result[0] or []:
            bbox, (text, conf) = line
            if conf > 0.6:
                all_texts.append(text)

print(f"Found {len(all_texts)} text regions across tiles")

Slow inference on CPU

Disable the angle classifier if all your text is upright (use_angle_cls=False). Set rec_batch_num to a lower value like 4 to reduce memory pressure. For production, use the GPU version of PaddlePaddle – scene text inference is 5-10x faster on a GPU.

Scene Text vs Document OCR#

Installation#

Processing Scene Images#

Drawing Results on Images#

Extracting Structured Results#

Handling Different Languages#

Batch Processing Scene Images#

Tuning Detection for Scene Text#

Common Errors and Fixes#

Related Guides#

About the Author