The Quick Version
Bad chunking is the number one reason RAG pipelines return garbage. You can have the best embedding model and the fastest vector database, but if your chunks split a paragraph mid-sentence or cram three unrelated topics into one block, retrieval quality tanks.
Here is a working chunking pipeline using LangChain’s RecursiveCharacterTextSplitter that handles 90% of use cases:
| |
That is your starting point. The RecursiveCharacterTextSplitter tries each separator in order – paragraph breaks first, then line breaks, then sentences, then words. It preserves natural boundaries instead of chopping text at arbitrary positions.
But “starting point” is the key phrase. Different documents, embedding models, and retrieval patterns need different chunking strategies. The rest of this guide covers when and why to use each one.
Why Chunk Size Matters for Retrieval
Embedding models compress a chunk of text into a single vector. That vector needs to represent the meaning of the chunk well enough for similarity search to work.
Too small (under 100 characters) and each chunk lacks context. The embedding captures a sentence fragment that could mean anything. Too large (over 2000 characters) and you dilute the signal. A chunk covering three topics produces a vector that is vaguely similar to all three but strongly similar to none.
The sweet spot depends on your embedding model’s training data and token limit:
| Embedding Model | Max Tokens | Recommended Chunk Size |
|---|---|---|
text-embedding-3-small (OpenAI) | 8191 | 256-512 tokens |
all-MiniLM-L6-v2 (Sentence Transformers) | 256 | 128-200 tokens |
all-mpnet-base-v2 (Sentence Transformers) | 384 | 200-300 tokens |
text-embedding-ada-002 (OpenAI) | 8191 | 256-512 tokens |
e5-large-v2 | 512 | 256-400 tokens |
bge-large-en-v1.5 | 512 | 256-400 tokens |
Models with shorter context windows like all-MiniLM-L6-v2 perform best with chunks that actually fit within their window. Sending 1000 tokens to a 256-token model means the tail gets truncated silently, and your embedding only represents the first part of the chunk.
Fixed-Size Chunking with Overlap
The simplest approach. Split every N characters with some overlap so you do not lose context at the boundaries.
| |
CharacterTextSplitter only splits on the single separator you provide. If your text has no double newlines, you get one giant chunk. That is why RecursiveCharacterTextSplitter is almost always the better default – it falls through multiple separators.
The overlap parameter controls how many characters from the end of one chunk appear at the start of the next. An overlap of 10-20% of chunk size works well. For a 1000-character chunk, 100-200 characters of overlap keeps cross-boundary context intact without bloating your index.
Token-Based Chunking with tiktoken
Character counts are a rough proxy. What you actually care about is token count, because that is what embedding models and LLMs consume. The word “extraordinary” is 12 characters but only 1-2 tokens depending on the tokenizer.
Use tiktoken to chunk by exact token counts:
| |
LangChain also has a built-in token-based splitter if you prefer:
| |
The from_tiktoken_encoder class method swaps the length function so chunk_size and chunk_overlap are measured in tokens, not characters. It still uses the recursive separator logic to find clean break points.
Use cl100k_base for OpenAI’s newer models (GPT-4, GPT-4o, text-embedding-3-*). For older models like text-embedding-ada-002, the same encoding applies. If you are using a non-OpenAI embedding model, character-based chunking with size estimates is usually fine since you do not need exact token parity.
Semantic Chunking with Sentence Embeddings
Fixed-size and token-based chunking ignore meaning entirely. Semantic chunking uses embeddings to detect natural topic boundaries – it groups sentences that are about the same thing and splits where the topic shifts.
| |
The threshold parameter controls sensitivity. Lower values (0.5-0.6) produce fewer, larger chunks. Higher values (0.8-0.9) split more aggressively. Start at 0.75 and tune based on your retrieval metrics.
Semantic chunking is slower than fixed-size because you embed every sentence. For a 10,000-sentence document, that is 10,000 embedding calls. Batch them and use a local model like all-MiniLM-L6-v2 to keep latency reasonable. Do not send 10,000 API calls to OpenAI’s embedding endpoint for chunking – that defeats the purpose.
Document-Aware Chunking
Structured documents (Markdown, HTML, code) have explicit boundaries you should respect. LangChain ships specialized splitters for these formats.
Markdown
| |
Each chunk gets metadata with the header hierarchy. When you store these in your vector database, the metadata lets you filter by section. A query about “installation” can prioritize chunks under an “Installation” header.
HTML
| |
Code
For code files, split on function and class boundaries instead of arbitrary character limits:
| |
This knows about Python’s syntax – it splits on class definitions, function definitions, and decorators before falling back to line breaks.
Comparing Strategies: When to Use What
There is no universal best strategy. The right choice depends on your documents and your retrieval pattern.
Fixed-size (CharacterTextSplitter): Use when your text is uniform and unstructured – chat logs, plain-text transcripts, data dumps. Fast and predictable.
Recursive (RecursiveCharacterTextSplitter): Your default for general-purpose RAG. Works well on articles, reports, documentation, and mixed content. Respects natural text boundaries without the overhead of embedding every sentence.
Token-based (tiktoken): Use when you need precise control over token budgets – for example, when your embedding model has a hard 256-token limit and you cannot afford truncation.
Semantic: Use when topic coherence within chunks matters more than speed. Technical documentation with frequent topic switches benefits from this. Blog posts and articles with linear flow usually do not.
Document-aware (Markdown/HTML/Code): Use whenever your source documents have structural markup. You get better chunks and free metadata for filtering.
In practice, combine them. Split a Markdown document by headers first, then apply recursive character splitting to any sections that are still too long:
| |
Common Errors and Fixes
ModuleNotFoundError: No module named 'langchain_text_splitters'
LangChain restructured its packages. Install the text splitters package separately:
| |
If you are on an older LangChain version (< 0.2), the import path is from langchain.text_splitter import RecursiveCharacterTextSplitter. The newer langchain_text_splitters package works with LangChain 0.2+.
Chunks are too small or too large despite setting chunk_size
The chunk_size parameter is an upper bound, not a target. If your text has natural breaks (double newlines) that occur more frequently than your chunk size, you get smaller chunks. If none of the separators match, you get one big chunk. Check your separators list and make sure it includes patterns that actually appear in your text.
tiktoken encoding not found
Make sure you use a valid encoding name. The common ones are:
cl100k_base– GPT-4, GPT-4o,text-embedding-3-small,text-embedding-3-largeo200k_base– GPT-4o and newer models (also works withcl100k_base)p50k_base– older GPT-3 / Codex models
| |
Semantic chunking produces single-sentence chunks
Your similarity threshold is too high. Lower it from 0.8 to 0.6. Also check that your sentences are being split correctly – if the text uses semicolons or line breaks instead of periods, the naive . split misses them. Use a proper sentence tokenizer like nltk.sent_tokenize for better results:
| |
Overlap causes duplicate retrieval results
If your overlap is too large relative to chunk size, adjacent chunks become very similar and your vector search returns the same content twice from different chunks. Keep overlap at 10-15% of chunk size. For a 512-token chunk, 50-75 tokens of overlap is plenty.
HTMLHeaderTextSplitter returns empty chunks
The HTML must have actual header tags (<h1>, <h2>, etc.). If your HTML uses <div class="heading"> or other non-standard markup, the splitter does not detect them. Preprocess your HTML to convert custom heading elements to standard tags before splitting.
Related Guides
- How to Build a RAG Pipeline with Hugging Face Transformers v5
- How to Build a Semantic Search Engine with Embeddings
- How to Build a Sentiment-Aware Search Pipeline with Embeddings
- How to Build a Hybrid RAG Pipeline with Qwen3 Embeddings and Qdrant in 2026
- How to Build a Text Summarization Pipeline with Sumy and Transformers
- How to Build a Hybrid Keyword and Semantic Search Pipeline
- How to Build an Abstractive Summarization Pipeline with PEGASUS
- How to Build an Emotion Detection Pipeline with GoEmotions and Transformers
- How to Build an Aspect-Based Sentiment Analysis Pipeline
- How to Build a Keyphrase Generation Pipeline with KeyphraseVectorizers