Spell checking sounds like a solved problem until you try to build one that handles domain-specific jargon, noisy user input, and real-time performance requirements all at once. The trick is combining a fast dictionary-based approach for obvious typos with a smarter model for ambiguous cases.
Here’s the quickest way to get a working spell checker running:
| |
Install with pip install symspellpy. That bundled frequency dictionary covers 82,765 English words, and SymSpell processes millions of lookups per second. For most use cases, this is all you need.
Three Approaches Compared
Not every spell checker fits every situation. Here’s how the three main options stack up:
SymSpell is the speed king. It pre-computes all possible edit-distance variations at load time, which means lookups are essentially hash table hits. You get millions of corrections per second with edit distance 2. The downside: it has no awareness of context. “Their” and “there” both look correct to SymSpell regardless of sentence meaning.
TextBlob is the easiest to get running. Two lines of code, no dictionary loading, no configuration. It uses a statistical approach based on Peter Norvig’s spell corrector. Accuracy is decent for common words but worse than SymSpell for unusual terms, and it’s significantly slower.
| |
Install with pip install textblob. For quick scripts or prototypes, TextBlob is hard to beat on simplicity.
Transformer-based correction uses masked language models to pick the right word based on surrounding context. It’s orders of magnitude slower (roughly 100 words per second on CPU) but catches errors the other approaches miss entirely, like wrong-word errors where the misspelling happens to be a valid word.
| Approach | Speed | Context-Aware | Setup Effort |
|---|---|---|---|
| SymSpell | ~5M words/sec | No | Low |
| TextBlob | ~1K words/sec | Partial | Minimal |
| Transformers | ~100 words/sec | Yes | Medium |
Context-Aware Correction with Transformers
When SymSpell suggests a valid word but the context is wrong, a masked language model can resolve the ambiguity. BERT treats this as a fill-in-the-blank problem: mask the suspicious word, let the model predict what fits best.
| |
The model correctly identifies “effect” (or “impact”) as the right choice. This is something a dictionary-based checker simply cannot do because both “affect” and “effect” are valid English words.
Install with pip install transformers torch. The first run downloads the model (~440 MB), so plan for that in deployment.
Building a Combined Pipeline
The best approach in practice is a two-stage pipeline. SymSpell handles the easy cases fast, and transformers step in only when needed. This keeps throughput high while catching the hard errors.
| |
The add_custom_words method is key for production use. Without it, SymSpell will “correct” domain terms like “kubernetes” into unrelated dictionary words. Set the frequency count high (100,000+) so your custom terms rank above common English words at similar edit distances.
Tuning Edit Distance
Edit distance controls how many character changes SymSpell considers. Higher values catch more errors but increase false positives and memory usage.
| |
Stick with edit distance 2 for most applications. Distance 1 misses too many real typos. Distance 3 starts suggesting wildly unrelated words and uses noticeably more memory at load time. The sweet spot is almost always 2.
Common Errors and Fixes
SymSpell returns no suggestions for a valid word. The bundled dictionary is English-only and doesn’t cover proper nouns, technical terms, or slang. Add them manually with create_dictionary_entry().
TextBlob “corrects” words into the wrong thing. TextBlob picks the most statistically likely word, which sometimes means replacing a rare but correct word with a common one. “niche” might become “nice”. There’s no fix other than switching to SymSpell with a custom dictionary.
Transformer fill-mask returns subword tokens. BERT uses WordPiece tokenization, so predictions sometimes come back as ##ing or other fragments. Filter predictions to only keep results where token_str doesn’t start with ##:
| |
SymSpell load takes too long. The default dictionary loads in about 1-2 seconds. If that’s too slow for serverless functions, serialize the SymSpell object with pickle after the first load and reload from the pickle file on subsequent runs. The pickle loads in under 100ms.
Memory usage is high with max_edit_distance=3. SymSpell pre-computes delete combinations at load time. Distance 3 can push memory to 500MB+. Drop back to distance 2 or reduce prefix_length from 7 to 5 to trade some accuracy for lower memory.
“No module named symspellpy” after install. Make sure you’re installing symspellpy (with a py suffix), not the older symspell package. They’re different libraries with different APIs.
Related Guides
- How to Build a Named Entity Recognition Pipeline with spaCy and Transformers
- How to Build a Text Correction and Grammar Checking Pipeline
- How to Build a Text Style Transfer Pipeline with Transformers
- How to Build a Sentiment Analysis API with Transformers and FastAPI
- How to Build a Text-to-SQL Pipeline with LLMs
- How to Build a Language Detection and Translation Pipeline
- How to Build a Text Classification Pipeline with SetFit
- How to Build a Legal NER Pipeline with Transformers and spaCy
- How to Build a Text Embedding Pipeline with Sentence Transformers and FAISS
- How to Build a Document Chunking Strategy Comparison Pipeline