How to Build a Text Normalization Pipeline for Noisy Data

Real-world text is a mess. Social media posts are riddled with slang and broken Unicode. OCR output swaps letters and injects garbage characters. Chat logs mix abbreviations, emojis, and HTML fragments. If you feed any of this directly into an NLP model, you get garbage results.

A text normalization pipeline fixes this before your model ever sees it. Here’s how to build one in Python that’s composable, testable, and handles the most common types of noise.

Install the Dependencies

You need three packages beyond the standard library:

1
pip install ftfy symspellpy contractions

ftfy fixes mojibake and encoding errors (e.g., â€™ back to ')
symspellpy does fast spell correction using symmetric delete distance
contractions expands English contractions (don't → do not)

The Building Blocks

Each normalization step is a standalone function. This makes it easy to reorder, skip, or add steps depending on your data source.

Fix Encoding and Mojibake

The ftfy library is the single best tool for fixing text that was decoded with the wrong encoding. It handles double-encoded UTF-8, Windows-1252 artifacts, and HTML entities in one call.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import ftfy

broken = "The Mona Lisa doesnâ€\x99t have eyebrows."
fixed = ftfy.fix_text(broken)
print(fixed)
# Output: The Mona Lisa doesn't have eyebrows.

# It also fixes HTML entities
html_mess = "Price: &amp;euro;50 &mdash; free shipping"
print(ftfy.fix_text(html_mess))
# Output: Price: €50 — free shipping

Unicode Normalization

Unicode has multiple ways to represent the same character. The letter é can be a single codepoint (U+00E9) or a combining sequence (e + ´). This causes duplicate detection and search to break silently. Always normalize to NFC (composed form).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import unicodedata

def normalize_unicode(text: str) -> str:
    return unicodedata.normalize("NFC", text)

# These look identical but are different bytes
a = "caf\u00e9"        # precomposed é
b = "cafe\u0301"       # e + combining accent
print(a == b)           # False
print(normalize_unicode(a) == normalize_unicode(b))  # True

Strip HTML, URLs, and Special Characters

OCR output and scraped data often contain leftover HTML tags, URLs, and control characters. Remove them with regex.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import re

def remove_html_tags(text: str) -> str:
    return re.sub(r"<[^>]+>", " ", text)

def remove_urls(text: str) -> str:
    return re.sub(r"https?://\S+|www\.\S+", "", text)

def remove_control_chars(text: str) -> str:
    return "".join(ch for ch in text if unicodedata.category(ch)[0] != "C" or ch in "\n\t")

Expand Contractions

Contractions trip up tokenizers and keyword matching. The contractions library handles standard English contractions including edge cases like y'all and won't.

1
2
3
4
5
6
import contractions

text = "I won't say they're wrong, but y'all shouldn't've done that."
expanded = contractions.fix(text)
print(expanded)
# Output: I will not say they are wrong, but you all should not have done that.

Normalize Whitespace and Punctuation

Noisy text has repeated spaces, tabs mixed with spaces, and inconsistent punctuation like smart quotes or em dashes.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
def normalize_whitespace(text: str) -> str:
    text = text.replace("\t", " ")
    text = re.sub(r" {2,}", " ", text)
    text = re.sub(r"\n{3,}", "\n\n", text)
    return text.strip()

def normalize_punctuation(text: str) -> str:
    replacements = {
        "\u2018": "'", "\u2019": "'",   # smart single quotes
        "\u201c": '"', "\u201d": '"',   # smart double quotes
        "\u2013": "-", "\u2014": "-",   # en/em dashes
        "\u2026": "...",                 # ellipsis character
    }
    for old, new in replacements.items():
        text = text.replace(old, new)
    return text

Handle Slang and Abbreviations

For domain-specific slang, a simple lookup dictionary works better than any ML model. Keep it explicit and auditable.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
SLANG_MAP = {
    "brb": "be right back",
    "idk": "I do not know",
    "tbh": "to be honest",
    "imo": "in my opinion",
    "smh": "shaking my head",
    "ngl": "not going to lie",
    "w/": "with",
    "w/o": "without",
    "b/c": "because",
    "ppl": "people",
}

def expand_slang(text: str) -> str:
    words = text.split()
    return " ".join(SLANG_MAP.get(w.lower(), w) for w in words)

print(expand_slang("idk why ppl do that tbh"))
# Output: I do not know why people do that to be honest

Spell Correction with SymSpellPy

SymSpellPy is orders of magnitude faster than traditional spell checkers. It precomputes all possible edits within a given distance, so lookups are nearly instant even on large vocabularies.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import importlib.resources
from symspellpy import SymSpell

sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)

# Load the built-in frequency dictionary
dict_path = importlib.resources.files("symspellpy") / "frequency_dictionary_en_82_765.txt"
sym_spell.load_dictionary(str(dict_path), term_index=0, count_index=1)

def correct_spelling(text: str) -> str:
    suggestions = sym_spell.lookup_compound(
        text,
        max_edit_distance=2,
        transfer_casing=True,
    )
    if suggestions:
        return suggestions[0].term
    return text

print(correct_spelling("tha quikc brwon fox jmps"))
# Output: the quick brown fox jumps

The transfer_casing=True flag preserves the original capitalization pattern, which matters for proper nouns.

Composable Pipeline Class

Now wire everything together into a pipeline where you pick which steps to run and in what order.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import ftfy
import re
import unicodedata
import importlib.resources
import contractions
from symspellpy import SymSpell
from typing import Callable


class TextNormalizer:
    """Composable text normalization pipeline."""

    def __init__(self, steps: list[Callable[[str], str]] | None = None):
        if steps is not None:
            self.steps = steps
        else:
            self.steps = [
                ftfy.fix_text,
                normalize_unicode,
                remove_html_tags,
                remove_urls,
                remove_control_chars,
                normalize_punctuation,
                contractions.fix,
                expand_slang,
                correct_spelling,
                normalize_whitespace,
            ]

    def __call__(self, text: str) -> str:
        for step in self.steps:
            text = step(text)
        return text

    def add_step(self, func: Callable[[str], str], position: int = -1):
        if position == -1:
            self.steps.append(func)
        else:
            self.steps.insert(position, func)
        return self


# Default pipeline
normalizer = TextNormalizer()

# Custom pipeline for OCR-only data (skip slang expansion)
ocr_normalizer = TextNormalizer(steps=[
    ftfy.fix_text,
    normalize_unicode,
    remove_control_chars,
    normalize_punctuation,
    correct_spelling,
    normalize_whitespace,
])

# Run it
messy = '<p>Tha   usér  doesnâ€\x99t   knw  w/o <b>checking</b> https://example.com  frst</p>'
clean = normalizer(messy)
print(clean)
# Output: the user does not know without checking first

The key design choice: each step is a plain function with the signature str -> str. This means you can test each one independently, swap the order, or inject custom steps for your specific domain.

Processing in Bulk

For large datasets, process text in batches and add basic logging so you can spot problems.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("normalizer")


def normalize_batch(texts: list[str], normalizer: TextNormalizer) -> list[str]:
    results = []
    for i, text in enumerate(texts):
        try:
            results.append(normalizer(text))
        except Exception as e:
            logger.warning(f"Failed on text {i}: {e}")
            results.append(text)  # return original on failure
    logger.info(f"Normalized {len(results)} texts, {len(texts) - len(results)} failures")
    return results


# Example
raw_texts = [
    "omg idk why  thiss is   brokenn",
    "â€œHello worldâ€\x9d she said",
    "check <a href='#'>this</a> out brb",
]

cleaned = normalize_batch(raw_texts, normalizer)
for raw, clean in zip(raw_texts, cleaned):
    print(f"  IN: {raw!r}")
    print(f" OUT: {clean!r}")
    print()

Common Errors and Fixes

UnicodeDecodeError when reading files: You’re opening a file with the wrong encoding. Always specify encoding explicitly:

1
2
with open("data.txt", encoding="utf-8", errors="replace") as f:
    text = f.read()

The errors="replace" flag substitutes undecodable bytes with � instead of crashing. Use errors="ignore" if you’d rather drop them silently.

ftfy.fix_text changes text you didn’t want changed: ftfy is aggressive by default. Use fix_text with specific flags to limit what it does:

1
2
3
4
import ftfy

# Only fix encoding, skip normalization of quotes/dashes
result = ftfy.fix_text(text, fix_character_width=False)

SymSpellPy “corrects” proper nouns and technical terms: The default dictionary doesn’t include domain-specific words. Add your own terms:

1
2
3
sym_spell.create_dictionary_entry("kubernetes", 1000000)
sym_spell.create_dictionary_entry("pytorch", 1000000)
sym_spell.create_dictionary_entry("nginx", 1000000)

Set the count high so these terms are preferred over similar common words.

Contraction expansion breaks code snippets: If your text contains code, skip the contraction step or guard it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
def safe_expand_contractions(text: str) -> str:
    # Skip lines that look like code
    lines = text.split("\n")
    result = []
    for line in lines:
        if any(marker in line for marker in ["=", "()", "{", "}", "import", "def "]):
            result.append(line)
        else:
            result.append(contractions.fix(line))
    return "\n".join(result)

Pipeline order matters: Always fix encoding first (ftfy), then Unicode normalization, then everything else. If you spell-correct before fixing mojibake, you’ll “correct” garbled characters into wrong words.

Install the Dependencies#

The Building Blocks#

Fix Encoding and Mojibake#

Unicode Normalization#

Strip HTML, URLs, and Special Characters#

Expand Contractions#

Normalize Whitespace and Punctuation#

Handle Slang and Abbreviations#

Spell Correction with SymSpellPy#

Composable Pipeline Class#

Processing in Bulk#

Common Errors and Fixes#

Related Guides#

About the Author