How to Build a Receipt Scanner with OCR and Structured Extraction

Receipt photos are messy. They’re crumpled, skewed, faded, and printed on thermal paper that loves to lose contrast. But the data on them – merchant name, line items, prices, tax, total – follows predictable patterns. That makes receipts a great target for OCR plus structured extraction. Here’s the full pipeline: preprocess the image with OpenCV, run OCR with PaddleOCR, then parse the raw text into usable fields with regex and heuristics.

1
pip install paddlepaddle paddleocr opencv-python-headless numpy

Preprocessing the Image with OpenCV

Raw receipt photos need cleanup before OCR can do its job. The three operations that matter most are grayscale conversion, adaptive thresholding, and deskewing. Grayscale removes color noise. Thresholding sharpens faded text against the background. Deskewing corrects the tilt from handheld camera shots.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import cv2
import numpy as np


def preprocess_receipt(image_path: str) -> np.ndarray:
    """Load a receipt image and prepare it for OCR."""
    img = cv2.imread(image_path)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # Adaptive threshold handles uneven lighting across the receipt
    thresh = cv2.adaptiveThreshold(
        gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY, blockSize=15, C=10
    )

    # Deskew: find the dominant text angle and rotate to correct it
    coords = np.column_stack(np.where(thresh > 0))
    angle = cv2.minAreaRect(coords)[-1]

    if angle < -45:
        angle = -(90 + angle)
    else:
        angle = -angle

    h, w = thresh.shape
    center = (w // 2, h // 2)
    rotation_matrix = cv2.getRotationMatrix2D(center, angle, 1.0)
    deskewed = cv2.warpAffine(
        thresh, rotation_matrix, (w, h),
        flags=cv2.INTER_CUBIC,
        borderMode=cv2.BORDER_REPLICATE
    )

    return deskewed

The blockSize=15 and C=10 values work well for standard thermal receipts. If your receipts are from inkjet or laser printers, try blockSize=11 and C=8. The deskew step uses minAreaRect on all non-black pixels to find the dominant angle – anything within a few degrees of horizontal gets corrected.

For heavily wrinkled or curved receipts, you might need morphological operations to close gaps in characters:

1
2
3
4
5
def enhance_for_wrinkled_receipt(thresh_img: np.ndarray) -> np.ndarray:
    """Apply morphological closing to reconnect broken characters."""
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
    closed = cv2.morphologyEx(thresh_img, cv2.MORPH_CLOSE, kernel)
    return closed

Running OCR with PaddleOCR

PaddleOCR handles the text detection and recognition in one call. It returns bounding boxes with their text content and confidence scores. The use_angle_cls=True flag enables automatic text direction classification, which matters when receipt text occasionally prints upside down (it happens more than you’d think).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from paddleocr import PaddleOCR


def extract_text_from_receipt(image_path: str) -> list[dict]:
    """Run OCR on a preprocessed receipt image and return structured results."""
    ocr = PaddleOCR(use_angle_cls=True, lang="en", show_log=False)

    preprocessed = preprocess_receipt(image_path)
    results = ocr.ocr(preprocessed, cls=True)

    lines = []
    for detection in results[0]:
        bbox = detection[0]
        text = detection[1][0]
        confidence = detection[1][1]

        # Calculate vertical midpoint for sorting lines top-to-bottom
        y_mid = (bbox[0][1] + bbox[2][1]) / 2

        lines.append({
            "text": text,
            "confidence": confidence,
            "y_position": y_mid,
            "x_position": bbox[0][0],
            "bbox": bbox
        })

    # Sort by vertical position to reconstruct reading order
    lines.sort(key=lambda x: x["y_position"])
    return lines

The results come back as individual text regions, not full lines. Sorting by y_position reconstructs the top-to-bottom reading order. You’ll also want x_position later when deciding whether two detections belong on the same line (like an item name on the left and its price on the right).

Parsing OCR Output into Structured Data

This is where the real work happens. Receipt text follows common patterns: the merchant name sits at the top, prices have dollar signs or decimal points, totals appear near the bottom with keywords like “TOTAL” or “AMOUNT DUE”, and dates follow recognizable formats.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
import re
from dataclasses import dataclass, field


@dataclass
class ReceiptData:
    merchant: str = ""
    date: str = ""
    items: list[dict] = field(default_factory=list)
    subtotal: float = 0.0
    tax: float = 0.0
    total: float = 0.0


PRICE_PATTERN = re.compile(r"\$?\d+\.\d{2}")
DATE_PATTERN = re.compile(
    r"\d{1,2}[/-]\d{1,2}[/-]\d{2,4}"
    r"|"
    r"\d{4}[/-]\d{1,2}[/-]\d{1,2}"
)
TOTAL_KEYWORDS = ["total", "amount due", "balance due", "grand total"]
TAX_KEYWORDS = ["tax", "gst", "hst", "vat"]
SUBTOTAL_KEYWORDS = ["subtotal", "sub total", "sub-total"]
SKIP_KEYWORDS = [
    "thank you", "visa", "mastercard", "debit", "credit",
    "change", "cash", "card", "approved", "auth"
]


def merge_same_line_detections(
    ocr_lines: list[dict], y_threshold: float = 15.0
) -> list[str]:
    """Merge OCR detections that share the same vertical position into single lines."""
    if not ocr_lines:
        return []

    merged = []
    current_group = [ocr_lines[0]]

    for detection in ocr_lines[1:]:
        if abs(detection["y_position"] - current_group[0]["y_position"]) < y_threshold:
            current_group.append(detection)
        else:
            # Sort left-to-right within the same line
            current_group.sort(key=lambda x: x["x_position"])
            merged_text = " ".join(d["text"] for d in current_group)
            merged.append(merged_text)
            current_group = [detection]

    current_group.sort(key=lambda x: x["x_position"])
    merged.append(" ".join(d["text"] for d in current_group))
    return merged


def parse_receipt(ocr_lines: list[dict]) -> ReceiptData:
    """Parse OCR detections into structured receipt data."""
    receipt = ReceiptData()
    text_lines = merge_same_line_detections(ocr_lines)

    if not text_lines:
        return receipt

    # Merchant name is typically the first non-empty line
    for line in text_lines[:3]:
        cleaned = line.strip()
        if cleaned and not DATE_PATTERN.search(cleaned):
            receipt.merchant = cleaned
            break

    for line in text_lines:
        lower = line.lower().strip()

        # Extract date
        date_match = DATE_PATTERN.search(line)
        if date_match and not receipt.date:
            receipt.date = date_match.group()
            continue

        # Extract total
        if any(kw in lower for kw in TOTAL_KEYWORDS) and not any(
            kw in lower for kw in SUBTOTAL_KEYWORDS
        ):
            price_match = PRICE_PATTERN.search(line)
            if price_match:
                receipt.total = float(price_match.group().replace("$", ""))
            continue

        # Extract tax
        if any(kw in lower for kw in TAX_KEYWORDS):
            price_match = PRICE_PATTERN.search(line)
            if price_match:
                receipt.tax = float(price_match.group().replace("$", ""))
            continue

        # Extract subtotal
        if any(kw in lower for kw in SUBTOTAL_KEYWORDS):
            price_match = PRICE_PATTERN.search(line)
            if price_match:
                receipt.subtotal = float(price_match.group().replace("$", ""))
            continue

        # Skip payment method lines and footers
        if any(kw in lower for kw in SKIP_KEYWORDS):
            continue

        # Remaining lines with prices are likely line items
        price_match = PRICE_PATTERN.search(line)
        if price_match:
            price_str = price_match.group()
            item_name = line[:price_match.start()].strip()
            item_name = re.sub(r"[\$\d\.\s]+$", "", item_name).strip()

            if item_name and len(item_name) > 1:
                receipt.items.append({
                    "name": item_name,
                    "price": float(price_str.replace("$", ""))
                })

    return receipt

The merge_same_line_detections function is critical. PaddleOCR often splits a single receipt line into two detections – the item name on the left, the price on the right. Without merging, you’d lose the association between them. The y_threshold of 15 pixels works for most receipt resolutions; bump it up to 20-25 if you’re working with high-DPI scans.

Putting It All Together

Here’s the complete pipeline from image file to structured receipt:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def scan_receipt(image_path: str) -> dict:
    """Full pipeline: image -> preprocessed -> OCR -> structured data."""
    ocr_lines = extract_text_from_receipt(image_path)
    receipt = parse_receipt(ocr_lines)

    return {
        "merchant": receipt.merchant,
        "date": receipt.date,
        "items": receipt.items,
        "subtotal": receipt.subtotal,
        "tax": receipt.tax,
        "total": receipt.total,
        "item_count": len(receipt.items),
    }


# Example usage
result = scan_receipt("receipt_photo.jpg")
print(f"Store: {result['merchant']}")
print(f"Date: {result['date']}")
print(f"Items found: {result['item_count']}")
for item in result["items"]:
    print(f"  {item['name']}: ${item['price']:.2f}")
print(f"Subtotal: ${result['subtotal']:.2f}")
print(f"Tax: ${result['tax']:.2f}")
print(f"Total: ${result['total']:.2f}")

A sanity check worth adding: compare the sum of extracted item prices against the subtotal. If they’re off by more than a cent or two, you probably missed a line item or double-counted something.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def validate_receipt(receipt_data: dict) -> list[str]:
    """Check for common extraction issues."""
    warnings = []
    item_sum = sum(item["price"] for item in receipt_data["items"])

    if receipt_data["subtotal"] > 0:
        diff = abs(item_sum - receipt_data["subtotal"])
        if diff > 0.02:
            warnings.append(
                f"Item sum (${item_sum:.2f}) doesn't match "
                f"subtotal (${receipt_data['subtotal']:.2f})"
            )

    if receipt_data["total"] > 0 and receipt_data["tax"] >= 0:
        expected_total = receipt_data["subtotal"] + receipt_data["tax"]
        if receipt_data["subtotal"] > 0 and abs(expected_total - receipt_data["total"]) > 0.02:
            warnings.append(
                f"Subtotal + tax (${expected_total:.2f}) doesn't match "
                f"total (${receipt_data['total']:.2f})"
            )

    if not receipt_data["merchant"]:
        warnings.append("No merchant name detected")

    if not receipt_data["date"]:
        warnings.append("No date detected")

    return warnings

Common Errors and Fixes

PaddleOCR returns empty results on a valid image. This usually means the preprocessing wiped out the text. Try skipping the threshold step and passing the grayscale image directly. Some receipts with colored backgrounds get destroyed by binary thresholding.

Item prices extracted as part of the item name. The regex expects prices at the end of the line. If OCR puts them in the middle (e.g., “COFFEE 2x $3.50 $7.00”), you’ll get misattributed names. Adjust the PRICE_PATTERN search to find the last match on the line instead of the first:

1
2
3
4
# Find the last price on the line, not the first
all_prices = list(PRICE_PATTERN.finditer(line))
if all_prices:
    price_match = all_prices[-1]

Deskew rotates the image 90 degrees. The minAreaRect angle can be ambiguous for nearly-vertical receipts. Add a bounds check:

1
2
3
if abs(angle) > 10:
    # Large angles likely indicate a misdetection; skip deskew
    deskewed = thresh

OCR confuses “1” and “l”, “0” and “O” in prices. This is a classic OCR issue. Post-process price strings before parsing:

1
2
3
4
5
6
def clean_price_string(raw: str) -> str:
    """Fix common OCR misreads in price strings."""
    cleaned = raw.replace("O", "0").replace("o", "0")
    cleaned = cleaned.replace("l", "1").replace("I", "1")
    cleaned = cleaned.replace(",", ".")  # European decimal comma
    return cleaned

Merchant name picks up a decorative header line instead of the store name. Some receipts have divider lines like “========” or “——–” at the top. Add a filter to skip lines that are mostly non-alphanumeric characters:

1
2
3
4
def is_valid_merchant_name(text: str) -> bool:
    """Check if a text line looks like an actual merchant name."""
    alpha_ratio = sum(c.isalpha() for c in text) / max(len(text), 1)
    return alpha_ratio > 0.5 and len(text) > 2

Memory usage is high when processing many receipts. PaddleOCR loads the model on every instantiation. Create the PaddleOCR object once and reuse it across calls instead of constructing it inside extract_text_from_receipt. Pass it as a parameter or use a module-level singleton.

Preprocessing the Image with OpenCV#

Running OCR with PaddleOCR#

Parsing OCR Output into Structured Data#

Putting It All Together#

Common Errors and Fixes#

Related Guides#

About the Author

Preprocessing the Image with OpenCV

Running OCR with PaddleOCR

Parsing OCR Output into Structured Data

Putting It All Together

Common Errors and Fixes

Related Guides