Receipt photos are messy. They’re crumpled, skewed, faded, and printed on thermal paper that loves to lose contrast. But the data on them – merchant name, line items, prices, tax, total – follows predictable patterns. That makes receipts a great target for OCR plus structured extraction. Here’s the full pipeline: preprocess the image with OpenCV, run OCR with PaddleOCR, then parse the raw text into usable fields with regex and heuristics.
1
| pip install paddlepaddle paddleocr opencv-python-headless numpy
|
Preprocessing the Image with OpenCV#
Raw receipt photos need cleanup before OCR can do its job. The three operations that matter most are grayscale conversion, adaptive thresholding, and deskewing. Grayscale removes color noise. Thresholding sharpens faded text against the background. Deskewing corrects the tilt from handheld camera shots.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
| import cv2
import numpy as np
def preprocess_receipt(image_path: str) -> np.ndarray:
"""Load a receipt image and prepare it for OCR."""
img = cv2.imread(image_path)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Adaptive threshold handles uneven lighting across the receipt
thresh = cv2.adaptiveThreshold(
gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY, blockSize=15, C=10
)
# Deskew: find the dominant text angle and rotate to correct it
coords = np.column_stack(np.where(thresh > 0))
angle = cv2.minAreaRect(coords)[-1]
if angle < -45:
angle = -(90 + angle)
else:
angle = -angle
h, w = thresh.shape
center = (w // 2, h // 2)
rotation_matrix = cv2.getRotationMatrix2D(center, angle, 1.0)
deskewed = cv2.warpAffine(
thresh, rotation_matrix, (w, h),
flags=cv2.INTER_CUBIC,
borderMode=cv2.BORDER_REPLICATE
)
return deskewed
|
The blockSize=15 and C=10 values work well for standard thermal receipts. If your receipts are from inkjet or laser printers, try blockSize=11 and C=8. The deskew step uses minAreaRect on all non-black pixels to find the dominant angle – anything within a few degrees of horizontal gets corrected.
For heavily wrinkled or curved receipts, you might need morphological operations to close gaps in characters:
1
2
3
4
5
| def enhance_for_wrinkled_receipt(thresh_img: np.ndarray) -> np.ndarray:
"""Apply morphological closing to reconnect broken characters."""
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
closed = cv2.morphologyEx(thresh_img, cv2.MORPH_CLOSE, kernel)
return closed
|
Running OCR with PaddleOCR#
PaddleOCR handles the text detection and recognition in one call. It returns bounding boxes with their text content and confidence scores. The use_angle_cls=True flag enables automatic text direction classification, which matters when receipt text occasionally prints upside down (it happens more than you’d think).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
| from paddleocr import PaddleOCR
def extract_text_from_receipt(image_path: str) -> list[dict]:
"""Run OCR on a preprocessed receipt image and return structured results."""
ocr = PaddleOCR(use_angle_cls=True, lang="en", show_log=False)
preprocessed = preprocess_receipt(image_path)
results = ocr.ocr(preprocessed, cls=True)
lines = []
for detection in results[0]:
bbox = detection[0]
text = detection[1][0]
confidence = detection[1][1]
# Calculate vertical midpoint for sorting lines top-to-bottom
y_mid = (bbox[0][1] + bbox[2][1]) / 2
lines.append({
"text": text,
"confidence": confidence,
"y_position": y_mid,
"x_position": bbox[0][0],
"bbox": bbox
})
# Sort by vertical position to reconstruct reading order
lines.sort(key=lambda x: x["y_position"])
return lines
|
The results come back as individual text regions, not full lines. Sorting by y_position reconstructs the top-to-bottom reading order. You’ll also want x_position later when deciding whether two detections belong on the same line (like an item name on the left and its price on the right).
Parsing OCR Output into Structured Data#
This is where the real work happens. Receipt text follows common patterns: the merchant name sits at the top, prices have dollar signs or decimal points, totals appear near the bottom with keywords like “TOTAL” or “AMOUNT DUE”, and dates follow recognizable formats.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
| import re
from dataclasses import dataclass, field
@dataclass
class ReceiptData:
merchant: str = ""
date: str = ""
items: list[dict] = field(default_factory=list)
subtotal: float = 0.0
tax: float = 0.0
total: float = 0.0
PRICE_PATTERN = re.compile(r"\$?\d+\.\d{2}")
DATE_PATTERN = re.compile(
r"\d{1,2}[/-]\d{1,2}[/-]\d{2,4}"
r"|"
r"\d{4}[/-]\d{1,2}[/-]\d{1,2}"
)
TOTAL_KEYWORDS = ["total", "amount due", "balance due", "grand total"]
TAX_KEYWORDS = ["tax", "gst", "hst", "vat"]
SUBTOTAL_KEYWORDS = ["subtotal", "sub total", "sub-total"]
SKIP_KEYWORDS = [
"thank you", "visa", "mastercard", "debit", "credit",
"change", "cash", "card", "approved", "auth"
]
def merge_same_line_detections(
ocr_lines: list[dict], y_threshold: float = 15.0
) -> list[str]:
"""Merge OCR detections that share the same vertical position into single lines."""
if not ocr_lines:
return []
merged = []
current_group = [ocr_lines[0]]
for detection in ocr_lines[1:]:
if abs(detection["y_position"] - current_group[0]["y_position"]) < y_threshold:
current_group.append(detection)
else:
# Sort left-to-right within the same line
current_group.sort(key=lambda x: x["x_position"])
merged_text = " ".join(d["text"] for d in current_group)
merged.append(merged_text)
current_group = [detection]
current_group.sort(key=lambda x: x["x_position"])
merged.append(" ".join(d["text"] for d in current_group))
return merged
def parse_receipt(ocr_lines: list[dict]) -> ReceiptData:
"""Parse OCR detections into structured receipt data."""
receipt = ReceiptData()
text_lines = merge_same_line_detections(ocr_lines)
if not text_lines:
return receipt
# Merchant name is typically the first non-empty line
for line in text_lines[:3]:
cleaned = line.strip()
if cleaned and not DATE_PATTERN.search(cleaned):
receipt.merchant = cleaned
break
for line in text_lines:
lower = line.lower().strip()
# Extract date
date_match = DATE_PATTERN.search(line)
if date_match and not receipt.date:
receipt.date = date_match.group()
continue
# Extract total
if any(kw in lower for kw in TOTAL_KEYWORDS) and not any(
kw in lower for kw in SUBTOTAL_KEYWORDS
):
price_match = PRICE_PATTERN.search(line)
if price_match:
receipt.total = float(price_match.group().replace("$", ""))
continue
# Extract tax
if any(kw in lower for kw in TAX_KEYWORDS):
price_match = PRICE_PATTERN.search(line)
if price_match:
receipt.tax = float(price_match.group().replace("$", ""))
continue
# Extract subtotal
if any(kw in lower for kw in SUBTOTAL_KEYWORDS):
price_match = PRICE_PATTERN.search(line)
if price_match:
receipt.subtotal = float(price_match.group().replace("$", ""))
continue
# Skip payment method lines and footers
if any(kw in lower for kw in SKIP_KEYWORDS):
continue
# Remaining lines with prices are likely line items
price_match = PRICE_PATTERN.search(line)
if price_match:
price_str = price_match.group()
item_name = line[:price_match.start()].strip()
item_name = re.sub(r"[\$\d\.\s]+$", "", item_name).strip()
if item_name and len(item_name) > 1:
receipt.items.append({
"name": item_name,
"price": float(price_str.replace("$", ""))
})
return receipt
|
The merge_same_line_detections function is critical. PaddleOCR often splits a single receipt line into two detections – the item name on the left, the price on the right. Without merging, you’d lose the association between them. The y_threshold of 15 pixels works for most receipt resolutions; bump it up to 20-25 if you’re working with high-DPI scans.
Putting It All Together#
Here’s the complete pipeline from image file to structured receipt:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
| def scan_receipt(image_path: str) -> dict:
"""Full pipeline: image -> preprocessed -> OCR -> structured data."""
ocr_lines = extract_text_from_receipt(image_path)
receipt = parse_receipt(ocr_lines)
return {
"merchant": receipt.merchant,
"date": receipt.date,
"items": receipt.items,
"subtotal": receipt.subtotal,
"tax": receipt.tax,
"total": receipt.total,
"item_count": len(receipt.items),
}
# Example usage
result = scan_receipt("receipt_photo.jpg")
print(f"Store: {result['merchant']}")
print(f"Date: {result['date']}")
print(f"Items found: {result['item_count']}")
for item in result["items"]:
print(f" {item['name']}: ${item['price']:.2f}")
print(f"Subtotal: ${result['subtotal']:.2f}")
print(f"Tax: ${result['tax']:.2f}")
print(f"Total: ${result['total']:.2f}")
|
A sanity check worth adding: compare the sum of extracted item prices against the subtotal. If they’re off by more than a cent or two, you probably missed a line item or double-counted something.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
| def validate_receipt(receipt_data: dict) -> list[str]:
"""Check for common extraction issues."""
warnings = []
item_sum = sum(item["price"] for item in receipt_data["items"])
if receipt_data["subtotal"] > 0:
diff = abs(item_sum - receipt_data["subtotal"])
if diff > 0.02:
warnings.append(
f"Item sum (${item_sum:.2f}) doesn't match "
f"subtotal (${receipt_data['subtotal']:.2f})"
)
if receipt_data["total"] > 0 and receipt_data["tax"] >= 0:
expected_total = receipt_data["subtotal"] + receipt_data["tax"]
if receipt_data["subtotal"] > 0 and abs(expected_total - receipt_data["total"]) > 0.02:
warnings.append(
f"Subtotal + tax (${expected_total:.2f}) doesn't match "
f"total (${receipt_data['total']:.2f})"
)
if not receipt_data["merchant"]:
warnings.append("No merchant name detected")
if not receipt_data["date"]:
warnings.append("No date detected")
return warnings
|
Common Errors and Fixes#
PaddleOCR returns empty results on a valid image. This usually means the preprocessing wiped out the text. Try skipping the threshold step and passing the grayscale image directly. Some receipts with colored backgrounds get destroyed by binary thresholding.
Item prices extracted as part of the item name. The regex expects prices at the end of the line. If OCR puts them in the middle (e.g., “COFFEE 2x $3.50 $7.00”), you’ll get misattributed names. Adjust the PRICE_PATTERN search to find the last match on the line instead of the first:
1
2
3
4
| # Find the last price on the line, not the first
all_prices = list(PRICE_PATTERN.finditer(line))
if all_prices:
price_match = all_prices[-1]
|
Deskew rotates the image 90 degrees. The minAreaRect angle can be ambiguous for nearly-vertical receipts. Add a bounds check:
1
2
3
| if abs(angle) > 10:
# Large angles likely indicate a misdetection; skip deskew
deskewed = thresh
|
OCR confuses “1” and “l”, “0” and “O” in prices. This is a classic OCR issue. Post-process price strings before parsing:
1
2
3
4
5
6
| def clean_price_string(raw: str) -> str:
"""Fix common OCR misreads in price strings."""
cleaned = raw.replace("O", "0").replace("o", "0")
cleaned = cleaned.replace("l", "1").replace("I", "1")
cleaned = cleaned.replace(",", ".") # European decimal comma
return cleaned
|
Merchant name picks up a decorative header line instead of the store name. Some receipts have divider lines like “========” or “——–” at the top. Add a filter to skip lines that are mostly non-alphanumeric characters:
1
2
3
4
| def is_valid_merchant_name(text: str) -> bool:
"""Check if a text line looks like an actual merchant name."""
alpha_ratio = sum(c.isalpha() for c in text) / max(len(text), 1)
return alpha_ratio > 0.5 and len(text) > 2
|
Memory usage is high when processing many receipts. PaddleOCR loads the model on every instantiation. Create the PaddleOCR object once and reuse it across calls instead of constructing it inside extract_text_from_receipt. Pass it as a parameter or use a module-level singleton.