Choosing the wrong chunking strategy silently ruins RAG retrieval quality. The fix is not to guess – it is to benchmark every strategy on your actual documents and pick the winner with data. This pipeline runs four chunking methods against the same corpus and scores them on retrieval accuracy.
Install everything you need first:
| |
And the core setup code you will reuse across all strategies:
| |
Fixed-Size Character Chunking
The simplest approach. Split every N characters with overlap so you do not lose context at boundaries. This ignores sentence and paragraph structure entirely.
| |
The separator=" " tells it to split on spaces, so you avoid cutting words in half. But it still chops through sentences and paragraphs without any awareness of meaning. Chunk size of 200 characters with 30 characters of overlap is a reasonable starting point for short documents. For longer texts, bump to 500-1000 characters with 50-100 overlap.
Recursive Character Splitting
This is the default recommendation for most RAG pipelines. RecursiveCharacterTextSplitter tries a hierarchy of separators – paragraph breaks, line breaks, sentences, words – and picks the cleanest split point that stays under your size limit.
| |
The separator list matters. The default ["\n\n", "\n", ". ", " ", ""] handles most English prose. For code, use RecursiveCharacterTextSplitter.from_language() which knows about function and class boundaries.
Recursive splitting almost always beats fixed-size because it respects natural boundaries. The cost is negligible – it is still just string operations, no model inference.
Semantic Chunking with Embeddings
Semantic chunking uses an embedding model to detect where topics shift. It embeds each sentence, computes cosine similarity between consecutive sentences, and splits wherever similarity drops below a threshold.
| |
The threshold is the knob you tune. Lower values (0.3-0.4) produce fewer, larger chunks. Higher values (0.7-0.8) split aggressively into many small chunks. Start at 0.5 and adjust based on your retrieval scores.
Semantic chunking is significantly slower than string-based methods because it runs an embedding model on every sentence. For a 10,000-sentence document, that is 10,000 inference calls. Use a fast local model like all-MiniLM-L6-v2 and batch the encoding. Do not send individual API calls to OpenAI for this – the latency and cost are not worth it for a preprocessing step.
Sentence-Based Chunking with NLTK
Sentence-based chunking groups a fixed number of sentences into each chunk. It respects sentence boundaries, which is more principled than splitting on character counts, but does not attempt any semantic awareness.
| |
Three sentences per chunk with one sentence overlap works well for narrative text and articles. For technical docs with dense paragraphs, bump to 4-5 sentences per chunk. NLTK’s sent_tokenize handles edge cases like abbreviations (“Dr.”, “U.S.”) and decimal numbers that naive period-splitting misses.
Benchmarking Retrieval Accuracy
Here is where it all comes together. Run each strategy against the same test queries and measure which one retrieves the correct chunk most often.
| |
On structured documents with clear topic boundaries (like the sample above), semantic chunking and recursive splitting consistently beat fixed-size. For unstructured text blobs with no paragraph breaks, sentence-based chunking outperforms the others because it at least guarantees complete sentences.
My recommendation: start with RecursiveCharacterTextSplitter at 512 characters with 50 character overlap. Run this benchmark on a sample of your actual documents with 20-30 test queries. If recursive does not clear 80% accuracy, try semantic chunking with threshold tuning. Sentence-based is a solid fallback when your text has poor paragraph structure.
The tradeoff is always speed versus quality. Recursive splitting processes millions of characters per second. Semantic chunking requires embedding every sentence, which on CPU takes about 50ms per sentence with all-MiniLM-L6-v2. For offline batch processing that is fine. For real-time ingestion, stick with recursive or sentence-based.
Common Errors and Fixes
ModuleNotFoundError: No module named 'langchain_text_splitters'
LangChain restructured its packages in version 0.2. The text splitters moved to their own package:
| |
If you are on LangChain < 0.2, the old import is from langchain.text_splitter import RecursiveCharacterTextSplitter. Upgrade to the new package to get bug fixes and new splitters.
LookupError: Resource punkt_tab not found
NLTK needs to download tokenizer data before you can use sent_tokenize:
| |
This downloads about 35KB to your NLTK data directory. The punkt_tab resource replaced the older punkt resource in recent NLTK versions. If punkt_tab fails on an older NLTK, try nltk.download("punkt") instead.
Semantic chunking produces one giant chunk or all single-sentence chunks
Your threshold is miscalibrated. Print the similarity scores between consecutive sentences to see their distribution:
| |
Set your threshold to the median similarity minus one standard deviation. That usually gives you reasonable split points.
CharacterTextSplitter returns only one chunk
The separator you specified does not appear in the text. CharacterTextSplitter only splits on a single separator. If you set separator="\n\n" but your text uses single newlines, the entire text becomes one chunk. Switch to RecursiveCharacterTextSplitter which falls through multiple separators automatically.
Overlap causes near-duplicate chunks in vector search results
Keep overlap at 10-15% of chunk size. For 512-character chunks, 50-75 characters is enough. If you are still getting duplicate results, deduplicate at retrieval time by comparing chunk content:
| |
Related Guides
- How to Build a Named Entity Recognition Pipeline with spaCy and Transformers
- How to Build a Text-to-SQL Pipeline with LLMs
- How to Build a Spell Checking and Autocorrect Pipeline with Python
- How to Build a Text Correction and Grammar Checking Pipeline
- How to Build a Text Classification Pipeline with SetFit
- How to Build a Text Embedding Pipeline with Sentence Transformers and FAISS
- How to Build a Text Clustering Pipeline with Embeddings and HDBSCAN
- How to Build a Text Readability Scoring Pipeline with Python
- How to Build a Text Style Transfer Pipeline with Transformers
- How to Build a Text Chunking and Splitting Pipeline for RAG