How to Build a Spell Checking and Autocorrect Pipeline with Python

Spell checking sounds like a solved problem until you try to build one that handles domain-specific jargon, noisy user input, and real-time performance requirements all at once. The trick is combining a fast dictionary-based approach for obvious typos with a smarter model for ambiguous cases.

Here’s the quickest way to get a working spell checker running:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
from symspellpy import SymSpell, Verbosity
import pkg_resources

sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_dictionary_en_82_765.txt"
)
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)

# Single word correction
suggestions = sym_spell.lookup("speling", Verbosity.CLOSEST, max_edit_distance=2)
for s in suggestions:
    print(f"{s.term}, distance: {s.distance}, count: {s.count}")
# Output: spelling, distance: 1, count: 10710339

# Full sentence correction
result = sym_spell.lookup_compound("whereis th televsion", max_edit_distance=2)
for r in result:
    print(r.term)
# Output: where is the television

Install with pip install symspellpy. That bundled frequency dictionary covers 82,765 English words, and SymSpell processes millions of lookups per second. For most use cases, this is all you need.

Three Approaches Compared

Not every spell checker fits every situation. Here’s how the three main options stack up:

SymSpell is the speed king. It pre-computes all possible edit-distance variations at load time, which means lookups are essentially hash table hits. You get millions of corrections per second with edit distance 2. The downside: it has no awareness of context. “Their” and “there” both look correct to SymSpell regardless of sentence meaning.

TextBlob is the easiest to get running. Two lines of code, no dictionary loading, no configuration. It uses a statistical approach based on Peter Norvig’s spell corrector. Accuracy is decent for common words but worse than SymSpell for unusual terms, and it’s significantly slower.

1
2
3
4
5
6
from textblob import TextBlob

text = TextBlob("I havv a speling eror in this sentance")
corrected = text.correct()
print(corrected)
# Output: I have a spelling error in this sentence

Install with pip install textblob. For quick scripts or prototypes, TextBlob is hard to beat on simplicity.

Transformer-based correction uses masked language models to pick the right word based on surrounding context. It’s orders of magnitude slower (roughly 100 words per second on CPU) but catches errors the other approaches miss entirely, like wrong-word errors where the misspelling happens to be a valid word.

Approach	Speed	Context-Aware	Setup Effort
SymSpell	~5M words/sec	No	Low
TextBlob	~1K words/sec	Partial	Minimal
Transformers	~100 words/sec	Yes	Medium

Context-Aware Correction with Transformers

When SymSpell suggests a valid word but the context is wrong, a masked language model can resolve the ambiguity. BERT treats this as a fill-in-the-blank problem: mask the suspicious word, let the model predict what fits best.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
from transformers import pipeline

fill_mask = pipeline("fill-mask", model="bert-base-uncased")

# "affect" vs "effect" — SymSpell can't help here
sentence = "The new policy will have a major [MASK] on productivity."
predictions = fill_mask(sentence)

for pred in predictions[:3]:
    print(f"{pred['token_str']}: {pred['score']:.4f}")
# Output:
# impact: 0.3842
# effect: 0.3291
# influence: 0.0587

The model correctly identifies “effect” (or “impact”) as the right choice. This is something a dictionary-based checker simply cannot do because both “affect” and “effect” are valid English words.

Install with pip install transformers torch. The first run downloads the model (~440 MB), so plan for that in deployment.

Building a Combined Pipeline

The best approach in practice is a two-stage pipeline. SymSpell handles the easy cases fast, and transformers step in only when needed. This keeps throughput high while catching the hard errors.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
from symspellpy import SymSpell, Verbosity
from transformers import pipeline
import pkg_resources
import re

class SpellCheckPipeline:
    def __init__(self, use_transformer=True):
        # Stage 1: SymSpell for fast dictionary lookup
        self.sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
        dict_path = pkg_resources.resource_filename(
            "symspellpy", "frequency_dictionary_en_82_765.txt"
        )
        self.sym_spell.load_dictionary(dict_path, term_index=0, count_index=1)

        # Stage 2: Transformer for ambiguous cases
        self.fill_mask = None
        if use_transformer:
            self.fill_mask = pipeline("fill-mask", model="bert-base-uncased")

    def add_custom_words(self, words: dict[str, int]):
        """Add domain-specific terms. Keys are words, values are frequency counts."""
        for word, count in words.items():
            self.sym_spell.create_dictionary_entry(word, count)

    def correct_word(self, word: str) -> str:
        if not word.isalpha():
            return word
        suggestions = self.sym_spell.lookup(
            word, Verbosity.CLOSEST, max_edit_distance=2
        )
        if not suggestions:
            return word
        if suggestions[0].distance == 0:
            return word  # Already correct
        return suggestions[0].term

    def correct_text(self, text: str) -> str:
        # Fast pass: SymSpell compound correction
        results = self.sym_spell.lookup_compound(text, max_edit_distance=2)
        corrected = results[0].term if results else text
        return corrected

    def correct_with_context(self, text: str) -> str:
        """Two-stage: SymSpell first, then transformer verification."""
        corrected = self.correct_text(text)

        if self.fill_mask is None:
            return corrected

        # Only run transformer if SymSpell made changes
        if corrected == text.lower():
            return corrected

        words = corrected.split()
        original_words = text.lower().split()

        for i, (orig, corr) in enumerate(zip(original_words, words)):
            if orig != corr:
                # Mask the corrected word, let BERT verify
                masked = words.copy()
                masked[i] = "[MASK]"
                masked_sentence = " ".join(masked)
                try:
                    predictions = self.fill_mask(masked_sentence)
                    top_word = predictions[0]["token_str"].strip()
                    # Keep transformer suggestion if it's confident
                    if predictions[0]["score"] > 0.1:
                        words[i] = top_word
                except Exception:
                    pass  # Fall back to SymSpell suggestion

        return " ".join(words)


# Usage
checker = SpellCheckPipeline(use_transformer=False)

# Add domain-specific vocabulary
checker.add_custom_words({
    "kubernetes": 100000,
    "kubectl": 80000,
    "pytorch": 90000,
    "nvidia": 85000,
    "llama": 70000,
})

print(checker.correct_text("the kuberntes clustr is runnng"))
# Output: the kubernetes cluster is running

print(checker.correct_word("pytorh"))
# Output: pytorch

The add_custom_words method is key for production use. Without it, SymSpell will “correct” domain terms like “kubernetes” into unrelated dictionary words. Set the frequency count high (100,000+) so your custom terms rank above common English words at similar edit distances.

Tuning Edit Distance

Edit distance controls how many character changes SymSpell considers. Higher values catch more errors but increase false positives and memory usage.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
from symspellpy import SymSpell, Verbosity
import pkg_resources

sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dict_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_dictionary_en_82_765.txt"
)
sym_spell.load_dictionary(dict_path, term_index=0, count_index=1)

word = "progrmming"

# Edit distance 1: misses this typo (2 chars different)
results_d1 = sym_spell.lookup(word, Verbosity.CLOSEST, max_edit_distance=1)
print(f"Distance 1: {[s.term for s in results_d1]}")
# Output: Distance 1: []

# Edit distance 2: catches it
results_d2 = sym_spell.lookup(word, Verbosity.CLOSEST, max_edit_distance=2)
print(f"Distance 2: {[s.term for s in results_d2]}")
# Output: Distance 2: ['programming']

Stick with edit distance 2 for most applications. Distance 1 misses too many real typos. Distance 3 starts suggesting wildly unrelated words and uses noticeably more memory at load time. The sweet spot is almost always 2.

Common Errors and Fixes

SymSpell returns no suggestions for a valid word. The bundled dictionary is English-only and doesn’t cover proper nouns, technical terms, or slang. Add them manually with create_dictionary_entry().

TextBlob “corrects” words into the wrong thing. TextBlob picks the most statistically likely word, which sometimes means replacing a rare but correct word with a common one. “niche” might become “nice”. There’s no fix other than switching to SymSpell with a custom dictionary.

Transformer fill-mask returns subword tokens. BERT uses WordPiece tokenization, so predictions sometimes come back as ##ing or other fragments. Filter predictions to only keep results where token_str doesn’t start with ##:

1
2
3
4
predictions = fill_mask(masked_sentence)
valid = [p for p in predictions if not p["token_str"].startswith("##")]
if valid:
    top_word = valid[0]["token_str"].strip()

SymSpell load takes too long. The default dictionary loads in about 1-2 seconds. If that’s too slow for serverless functions, serialize the SymSpell object with pickle after the first load and reload from the pickle file on subsequent runs. The pickle loads in under 100ms.

Memory usage is high with max_edit_distance=3. SymSpell pre-computes delete combinations at load time. Distance 3 can push memory to 500MB+. Drop back to distance 2 or reduce prefix_length from 7 to 5 to trade some accuracy for lower memory.

“No module named symspellpy” after install. Make sure you’re installing symspellpy (with a py suffix), not the older symspell package. They’re different libraries with different APIs.

Three Approaches Compared#

Context-Aware Correction with Transformers#

Building a Combined Pipeline#

Tuning Edit Distance#

Common Errors and Fixes#

Related Guides#

About the Author

Three Approaches Compared

Context-Aware Correction with Transformers

Building a Combined Pipeline

Tuning Edit Distance

Common Errors and Fixes

Related Guides