How to Build a Text Chunking and Splitting Pipeline for RAG

The Quick Version

Bad chunking is the number one reason RAG pipelines return garbage. You can have the best embedding model and the fastest vector database, but if your chunks split a paragraph mid-sentence or cram three unrelated topics into one block, retrieval quality tanks.

Here is a working chunking pipeline using LangChain’s RecursiveCharacterTextSplitter that handles 90% of use cases:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
from langchain_text_splitters import RecursiveCharacterTextSplitter

text = open("document.txt").read()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""],
    length_function=len,
)

chunks = splitter.split_text(text)
print(f"Created {len(chunks)} chunks")
print(f"Average chunk size: {sum(len(c) for c in chunks) / len(chunks):.0f} chars")

That is your starting point. The RecursiveCharacterTextSplitter tries each separator in order – paragraph breaks first, then line breaks, then sentences, then words. It preserves natural boundaries instead of chopping text at arbitrary positions.

But “starting point” is the key phrase. Different documents, embedding models, and retrieval patterns need different chunking strategies. The rest of this guide covers when and why to use each one.

Why Chunk Size Matters for Retrieval

Embedding models compress a chunk of text into a single vector. That vector needs to represent the meaning of the chunk well enough for similarity search to work.

Too small (under 100 characters) and each chunk lacks context. The embedding captures a sentence fragment that could mean anything. Too large (over 2000 characters) and you dilute the signal. A chunk covering three topics produces a vector that is vaguely similar to all three but strongly similar to none.

The sweet spot depends on your embedding model’s training data and token limit:

Embedding Model	Max Tokens	Recommended Chunk Size
`text-embedding-3-small` (OpenAI)	8191	256-512 tokens
`all-MiniLM-L6-v2` (Sentence Transformers)	256	128-200 tokens
`all-mpnet-base-v2` (Sentence Transformers)	384	200-300 tokens
`text-embedding-ada-002` (OpenAI)	8191	256-512 tokens
`e5-large-v2`	512	256-400 tokens
`bge-large-en-v1.5`	512	256-400 tokens

Models with shorter context windows like all-MiniLM-L6-v2 perform best with chunks that actually fit within their window. Sending 1000 tokens to a 256-token model means the tail gets truncated silently, and your embedding only represents the first part of the chunk.

Fixed-Size Chunking with Overlap

The simplest approach. Split every N characters with some overlap so you do not lose context at the boundaries.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from langchain_text_splitters import CharacterTextSplitter

splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
)

chunks = splitter.split_text(text)

CharacterTextSplitter only splits on the single separator you provide. If your text has no double newlines, you get one giant chunk. That is why RecursiveCharacterTextSplitter is almost always the better default – it falls through multiple separators.

The overlap parameter controls how many characters from the end of one chunk appear at the start of the next. An overlap of 10-20% of chunk size works well. For a 1000-character chunk, 100-200 characters of overlap keeps cross-boundary context intact without bloating your index.

Token-Based Chunking with tiktoken

Character counts are a rough proxy. What you actually care about is token count, because that is what embedding models and LLMs consume. The word “extraordinary” is 12 characters but only 1-2 tokens depending on the tokenizer.

Use tiktoken to chunk by exact token counts:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import tiktoken

def chunk_by_tokens(text: str, max_tokens: int = 512, overlap: int = 50) -> list[str]:
    """Split text into chunks based on exact token counts."""
    enc = tiktoken.get_encoding("cl100k_base")  # GPT-4 / text-embedding-3-* tokenizer
    tokens = enc.encode(text)

    chunks = []
    start = 0
    while start < len(tokens):
        end = start + max_tokens
        chunk_tokens = tokens[start:end]
        chunk_text = enc.decode(chunk_tokens)
        chunks.append(chunk_text)
        start = end - overlap  # step back by overlap amount

    return chunks

chunks = chunk_by_tokens(text, max_tokens=256, overlap=30)
for i, chunk in enumerate(chunks[:3]):
    enc = tiktoken.get_encoding("cl100k_base")
    print(f"Chunk {i}: {len(enc.encode(chunk))} tokens, {len(chunk)} chars")

LangChain also has a built-in token-based splitter if you prefer:

1
2
3
4
5
6
7
8
9
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    encoding_name="cl100k_base",
    chunk_size=256,
    chunk_overlap=30,
)

chunks = splitter.split_text(text)

The from_tiktoken_encoder class method swaps the length function so chunk_size and chunk_overlap are measured in tokens, not characters. It still uses the recursive separator logic to find clean break points.

Use cl100k_base for OpenAI’s newer models (GPT-4, GPT-4o, text-embedding-3-*). For older models like text-embedding-ada-002, the same encoding applies. If you are using a non-OpenAI embedding model, character-based chunking with size estimates is usually fine since you do not need exact token parity.

Semantic Chunking with Sentence Embeddings

Fixed-size and token-based chunking ignore meaning entirely. Semantic chunking uses embeddings to detect natural topic boundaries – it groups sentences that are about the same thing and splits where the topic shifts.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
from sentence_transformers import SentenceTransformer
import numpy as np

def semantic_chunk(text: str, threshold: float = 0.75, model_name: str = "all-MiniLM-L6-v2") -> list[str]:
    """Split text into chunks at semantic boundaries."""
    model = SentenceTransformer(model_name)

    # Split into sentences first
    sentences = [s.strip() for s in text.replace("\n", " ").split(". ") if s.strip()]

    if len(sentences) <= 1:
        return [text]

    # Embed all sentences
    embeddings = model.encode(sentences)

    # Calculate cosine similarity between consecutive sentences
    similarities = []
    for i in range(len(embeddings) - 1):
        sim = np.dot(embeddings[i], embeddings[i + 1]) / (
            np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i + 1])
        )
        similarities.append(sim)

    # Split where similarity drops below threshold
    chunks = []
    current_chunk = [sentences[0]]

    for i, sim in enumerate(similarities):
        if sim < threshold:
            # Topic shift detected -- start new chunk
            chunks.append(". ".join(current_chunk) + ".")
            current_chunk = [sentences[i + 1]]
        else:
            current_chunk.append(sentences[i + 1])

    # Add the last chunk
    if current_chunk:
        chunks.append(". ".join(current_chunk) + ".")

    return chunks

chunks = semantic_chunk(text, threshold=0.7)
for i, chunk in enumerate(chunks[:5]):
    print(f"Chunk {i} ({len(chunk)} chars): {chunk[:80]}...")

The threshold parameter controls sensitivity. Lower values (0.5-0.6) produce fewer, larger chunks. Higher values (0.8-0.9) split more aggressively. Start at 0.75 and tune based on your retrieval metrics.

Semantic chunking is slower than fixed-size because you embed every sentence. For a 10,000-sentence document, that is 10,000 embedding calls. Batch them and use a local model like all-MiniLM-L6-v2 to keep latency reasonable. Do not send 10,000 API calls to OpenAI’s embedding endpoint for chunking – that defeats the purpose.

Document-Aware Chunking

Structured documents (Markdown, HTML, code) have explicit boundaries you should respect. LangChain ships specialized splitters for these formats.

Markdown

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from langchain_text_splitters import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "h1"),
    ("##", "h2"),
    ("###", "h3"),
]

splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_text(markdown_text)

for chunk in chunks[:3]:
    print(f"Headers: {chunk.metadata}")
    print(f"Content: {chunk.page_content[:100]}...")
    print()

Each chunk gets metadata with the header hierarchy. When you store these in your vector database, the metadata lets you filter by section. A query about “installation” can prioritize chunks under an “Installation” header.

HTML

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
from langchain_text_splitters import HTMLHeaderTextSplitter

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

html_text = "<h1>Setup Guide</h1><p>Install the package first.</p><h2>Configuration</h2><p>Set your API key in the environment.</p>"

splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_text(html_text)

for chunk in chunks:
    print(f"Headers: {chunk.metadata}")
    print(f"Content: {chunk.page_content[:80]}")
    print()

Code

For code files, split on function and class boundaries instead of arbitrary character limits:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from langchain_text_splitters import Language, RecursiveCharacterTextSplitter

python_code = """
def calculate_embedding(text: str) -> list[float]:
    model = load_model()
    tokens = tokenizer.encode(text)
    return model.encode(tokens)

class EmbeddingCache:
    def __init__(self, max_size=1000):
        self.cache = {}
        self.max_size = max_size

    def get(self, key):
        return self.cache.get(key)
"""

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=1000,
    chunk_overlap=100,
)

chunks = python_splitter.split_text(python_code)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i}:\n{chunk}\n")

This knows about Python’s syntax – it splits on class definitions, function definitions, and decorators before falling back to line breaks.

Comparing Strategies: When to Use What

There is no universal best strategy. The right choice depends on your documents and your retrieval pattern.

Fixed-size (CharacterTextSplitter): Use when your text is uniform and unstructured – chat logs, plain-text transcripts, data dumps. Fast and predictable.

Recursive (RecursiveCharacterTextSplitter): Your default for general-purpose RAG. Works well on articles, reports, documentation, and mixed content. Respects natural text boundaries without the overhead of embedding every sentence.

Token-based (tiktoken): Use when you need precise control over token budgets – for example, when your embedding model has a hard 256-token limit and you cannot afford truncation.

Semantic: Use when topic coherence within chunks matters more than speed. Technical documentation with frequent topic switches benefits from this. Blog posts and articles with linear flow usually do not.

Document-aware (Markdown/HTML/Code): Use whenever your source documents have structural markup. You get better chunks and free metadata for filtering.

In practice, combine them. Split a Markdown document by headers first, then apply recursive character splitting to any sections that are still too long:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
from langchain_text_splitters import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter

# First pass: split by headers
md_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[("#", "h1"), ("##", "h2"), ("###", "h3")]
)
header_chunks = md_splitter.split_text(markdown_text)

# Second pass: split large sections further
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)

final_chunks = []
for chunk in header_chunks:
    if len(chunk.page_content) > 512:
        sub_chunks = text_splitter.split_text(chunk.page_content)
        for sc in sub_chunks:
            final_chunks.append({"content": sc, "metadata": chunk.metadata})
    else:
        final_chunks.append({"content": chunk.page_content, "metadata": chunk.metadata})

print(f"Final chunk count: {len(final_chunks)}")

Common Errors and Fixes

ModuleNotFoundError: No module named 'langchain_text_splitters'

LangChain restructured its packages. Install the text splitters package separately:

1
pip install langchain-text-splitters

If you are on an older LangChain version (< 0.2), the import path is from langchain.text_splitter import RecursiveCharacterTextSplitter. The newer langchain_text_splitters package works with LangChain 0.2+.

Chunks are too small or too large despite setting chunk_size

The chunk_size parameter is an upper bound, not a target. If your text has natural breaks (double newlines) that occur more frequently than your chunk size, you get smaller chunks. If none of the separators match, you get one big chunk. Check your separators list and make sure it includes patterns that actually appear in your text.

tiktoken encoding not found

Make sure you use a valid encoding name. The common ones are:

cl100k_base – GPT-4, GPT-4o, text-embedding-3-small, text-embedding-3-large
o200k_base – GPT-4o and newer models (also works with cl100k_base)
p50k_base – older GPT-3 / Codex models

1
2
3
import tiktoken
# List all available encodings
print(tiktoken.list_encoding_names())

Semantic chunking produces single-sentence chunks

Your similarity threshold is too high. Lower it from 0.8 to 0.6. Also check that your sentences are being split correctly – if the text uses semicolons or line breaks instead of periods, the naive . split misses them. Use a proper sentence tokenizer like nltk.sent_tokenize for better results:

1
2
3
4
5
import nltk
nltk.download("punkt_tab")
from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(text)

Overlap causes duplicate retrieval results

If your overlap is too large relative to chunk size, adjacent chunks become very similar and your vector search returns the same content twice from different chunks. Keep overlap at 10-15% of chunk size. For a 512-token chunk, 50-75 tokens of overlap is plenty.

HTMLHeaderTextSplitter returns empty chunks

The HTML must have actual header tags (<h1>, <h2>, etc.). If your HTML uses <div class="heading"> or other non-standard markup, the splitter does not detect them. Preprocess your HTML to convert custom heading elements to standard tags before splitting.

The Quick Version#

Why Chunk Size Matters for Retrieval#

Fixed-Size Chunking with Overlap#

Token-Based Chunking with tiktoken#

Semantic Chunking with Sentence Embeddings#

Document-Aware Chunking#

Markdown#

HTML#

Code#

Comparing Strategies: When to Use What#

Common Errors and Fixes#

Related Guides#

About the Author