How to Build a Document Chunking Strategy Comparison Pipeline

Choosing the wrong chunking strategy silently ruins RAG retrieval quality. The fix is not to guess – it is to benchmark every strategy on your actual documents and pick the winner with data. This pipeline runs four chunking methods against the same corpus and scores them on retrieval accuracy.

Install everything you need first:

1
pip install langchain-text-splitters sentence-transformers nltk numpy scikit-learn

And the core setup code you will reuse across all strategies:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Sample document -- replace with your actual corpus
document = """
Machine learning models require careful preprocessing of input data.
Tokenization splits raw text into subword units that the model can process.
Different tokenizers use different vocabularies and splitting rules.

Transfer learning allows you to reuse a pretrained model on a new task.
Fine-tuning adjusts the model weights on your specific dataset.
This approach works well when your dataset is small.

Retrieval-augmented generation combines search with language models.
The retriever fetches relevant documents from a vector store.
The generator uses those documents as context to produce answers.
RAG reduces hallucination by grounding responses in real data.
"""

# Embedding model for evaluation
embed_model = SentenceTransformer("all-MiniLM-L6-v2")

# Test queries with known relevant content
test_queries = [
    "How does tokenization work?",
    "What is transfer learning?",
    "How does RAG reduce hallucination?",
]

# Ground truth: which paragraph index is relevant for each query
ground_truth = [0, 1, 2]  # query 0 -> paragraph 0, etc.

Fixed-Size Character Chunking

The simplest approach. Split every N characters with overlap so you do not lose context at boundaries. This ignores sentence and paragraph structure entirely.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
from langchain_text_splitters import CharacterTextSplitter

def fixed_size_chunks(text: str, chunk_size: int = 200, overlap: int = 30) -> list[str]:
    splitter = CharacterTextSplitter(
        separator=" ",
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        length_function=len,
    )
    return splitter.split_text(text)

fixed_chunks = fixed_size_chunks(document, chunk_size=200, overlap=30)
for i, chunk in enumerate(fixed_chunks):
    print(f"Chunk {i} ({len(chunk)} chars): {chunk[:60]}...")

The separator=" " tells it to split on spaces, so you avoid cutting words in half. But it still chops through sentences and paragraphs without any awareness of meaning. Chunk size of 200 characters with 30 characters of overlap is a reasonable starting point for short documents. For longer texts, bump to 500-1000 characters with 50-100 overlap.

Recursive Character Splitting

This is the default recommendation for most RAG pipelines. RecursiveCharacterTextSplitter tries a hierarchy of separators – paragraph breaks, line breaks, sentences, words – and picks the cleanest split point that stays under your size limit.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
from langchain_text_splitters import RecursiveCharacterTextSplitter

def recursive_chunks(text: str, chunk_size: int = 200, overlap: int = 30) -> list[str]:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        separators=["\n\n", "\n", ". ", " ", ""],
        length_function=len,
    )
    return splitter.split_text(text)

recursive_result = recursive_chunks(document, chunk_size=200, overlap=30)
for i, chunk in enumerate(recursive_result):
    print(f"Chunk {i} ({len(chunk)} chars): {chunk[:60]}...")

The separator list matters. The default ["\n\n", "\n", ". ", " ", ""] handles most English prose. For code, use RecursiveCharacterTextSplitter.from_language() which knows about function and class boundaries.

Recursive splitting almost always beats fixed-size because it respects natural boundaries. The cost is negligible – it is still just string operations, no model inference.

Semantic Chunking with Embeddings

Semantic chunking uses an embedding model to detect where topics shift. It embeds each sentence, computes cosine similarity between consecutive sentences, and splits wherever similarity drops below a threshold.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def semantic_chunks(text: str, threshold: float = 0.5) -> list[str]:
    # Split into sentences
    sentences = [s.strip() for s in text.strip().split("\n") if s.strip()]

    if len(sentences) <= 1:
        return [text]

    # Embed all sentences
    embeddings = embed_model.encode(sentences)

    # Find topic boundaries by comparing consecutive sentence embeddings
    chunks = []
    current_chunk = [sentences[0]]

    for i in range(1, len(sentences)):
        sim = cosine_similarity(
            embeddings[i - 1].reshape(1, -1),
            embeddings[i].reshape(1, -1),
        )[0][0]

        if sim < threshold:
            chunks.append("\n".join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])

    if current_chunk:
        chunks.append("\n".join(current_chunk))

    return chunks

semantic_result = semantic_chunks(document, threshold=0.5)
for i, chunk in enumerate(semantic_result):
    print(f"Chunk {i} ({len(chunk)} chars):\n{chunk}\n---")

The threshold is the knob you tune. Lower values (0.3-0.4) produce fewer, larger chunks. Higher values (0.7-0.8) split aggressively into many small chunks. Start at 0.5 and adjust based on your retrieval scores.

Semantic chunking is significantly slower than string-based methods because it runs an embedding model on every sentence. For a 10,000-sentence document, that is 10,000 inference calls. Use a fast local model like all-MiniLM-L6-v2 and batch the encoding. Do not send individual API calls to OpenAI for this – the latency and cost are not worth it for a preprocessing step.

Sentence-Based Chunking with NLTK

Sentence-based chunking groups a fixed number of sentences into each chunk. It respects sentence boundaries, which is more principled than splitting on character counts, but does not attempt any semantic awareness.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import nltk
nltk.download("punkt_tab", quiet=True)
from nltk.tokenize import sent_tokenize

def sentence_chunks(text: str, sentences_per_chunk: int = 3, overlap_sentences: int = 1) -> list[str]:
    sentences = sent_tokenize(text)
    chunks = []
    start = 0

    while start < len(sentences):
        end = min(start + sentences_per_chunk, len(sentences))
        chunk = " ".join(sentences[start:end])
        chunks.append(chunk)
        start += sentences_per_chunk - overlap_sentences

    return chunks

sentence_result = sentence_chunks(document, sentences_per_chunk=3, overlap_sentences=1)
for i, chunk in enumerate(sentence_result):
    print(f"Chunk {i} ({len(chunk)} chars): {chunk[:60]}...")

Three sentences per chunk with one sentence overlap works well for narrative text and articles. For technical docs with dense paragraphs, bump to 4-5 sentences per chunk. NLTK’s sent_tokenize handles edge cases like abbreviations (“Dr.”, “U.S.”) and decimal numbers that naive period-splitting misses.

Benchmarking Retrieval Accuracy

Here is where it all comes together. Run each strategy against the same test queries and measure which one retrieves the correct chunk most often.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
def evaluate_strategy(chunks: list[str], queries: list[str], truth: list[int]) -> dict:
    """Score a chunking strategy on retrieval accuracy."""
    chunk_embeddings = embed_model.encode(chunks)
    query_embeddings = embed_model.encode(queries)

    hits = 0
    results = []

    for q_idx, (query, true_para) in enumerate(zip(queries, truth)):
        # Find most similar chunk
        sims = cosine_similarity(
            query_embeddings[q_idx].reshape(1, -1),
            chunk_embeddings,
        )[0]
        best_chunk_idx = int(np.argmax(sims))
        best_score = float(sims[best_chunk_idx])

        # Check if the best chunk contains content from the correct paragraph
        paragraphs = [p.strip() for p in document.strip().split("\n\n") if p.strip()]
        target_text = paragraphs[true_para] if true_para < len(paragraphs) else ""
        # A hit if the top chunk overlaps with the ground truth paragraph
        hit = any(
            sentence.strip() in chunks[best_chunk_idx]
            for sentence in target_text.split("\n")
            if sentence.strip()
        )
        if hit:
            hits += 1

        results.append({
            "query": query,
            "best_chunk": best_chunk_idx,
            "score": best_score,
            "hit": hit,
        })

    accuracy = hits / len(queries) if queries else 0
    return {"accuracy": accuracy, "results": results, "num_chunks": len(chunks)}

# Run the benchmark
strategies = {
    "Fixed-size": fixed_chunks,
    "Recursive": recursive_result,
    "Semantic": semantic_result,
    "Sentence-based": sentence_result,
}

print(f"{'Strategy':<20} {'Chunks':<10} {'Accuracy':<10}")
print("-" * 40)
for name, chunks in strategies.items():
    score = evaluate_strategy(chunks, test_queries, ground_truth)
    print(f"{name:<20} {score['num_chunks']:<10} {score['accuracy']:<10.1%}")

On structured documents with clear topic boundaries (like the sample above), semantic chunking and recursive splitting consistently beat fixed-size. For unstructured text blobs with no paragraph breaks, sentence-based chunking outperforms the others because it at least guarantees complete sentences.

My recommendation: start with RecursiveCharacterTextSplitter at 512 characters with 50 character overlap. Run this benchmark on a sample of your actual documents with 20-30 test queries. If recursive does not clear 80% accuracy, try semantic chunking with threshold tuning. Sentence-based is a solid fallback when your text has poor paragraph structure.

The tradeoff is always speed versus quality. Recursive splitting processes millions of characters per second. Semantic chunking requires embedding every sentence, which on CPU takes about 50ms per sentence with all-MiniLM-L6-v2. For offline batch processing that is fine. For real-time ingestion, stick with recursive or sentence-based.

Common Errors and Fixes

ModuleNotFoundError: No module named 'langchain_text_splitters'

LangChain restructured its packages in version 0.2. The text splitters moved to their own package:

1
pip install langchain-text-splitters

If you are on LangChain < 0.2, the old import is from langchain.text_splitter import RecursiveCharacterTextSplitter. Upgrade to the new package to get bug fixes and new splitters.

LookupError: Resource punkt_tab not found

NLTK needs to download tokenizer data before you can use sent_tokenize:

1
2
import nltk
nltk.download("punkt_tab")

This downloads about 35KB to your NLTK data directory. The punkt_tab resource replaced the older punkt resource in recent NLTK versions. If punkt_tab fails on an older NLTK, try nltk.download("punkt") instead.

Semantic chunking produces one giant chunk or all single-sentence chunks

Your threshold is miscalibrated. Print the similarity scores between consecutive sentences to see their distribution:

1
2
3
4
5
6
7
embeddings = embed_model.encode(sentences)
for i in range(len(embeddings) - 1):
    sim = cosine_similarity(
        embeddings[i].reshape(1, -1),
        embeddings[i + 1].reshape(1, -1),
    )[0][0]
    print(f"Sentences {i}-{i+1}: similarity = {sim:.3f}")

Set your threshold to the median similarity minus one standard deviation. That usually gives you reasonable split points.

CharacterTextSplitter returns only one chunk

The separator you specified does not appear in the text. CharacterTextSplitter only splits on a single separator. If you set separator="\n\n" but your text uses single newlines, the entire text becomes one chunk. Switch to RecursiveCharacterTextSplitter which falls through multiple separators automatically.

Overlap causes near-duplicate chunks in vector search results

Keep overlap at 10-15% of chunk size. For 512-character chunks, 50-75 characters is enough. If you are still getting duplicate results, deduplicate at retrieval time by comparing chunk content:

1
2
3
4
5
6
7
8
seen_content = set()
unique_results = []
for chunk in retrieved_chunks:
    # Use first 100 chars as a fingerprint
    fingerprint = chunk[:100]
    if fingerprint not in seen_content:
        seen_content.add(fingerprint)
        unique_results.append(chunk)

Fixed-Size Character Chunking#

Recursive Character Splitting#

Semantic Chunking with Embeddings#

Sentence-Based Chunking with NLTK#

Benchmarking Retrieval Accuracy#

Common Errors and Fixes#

Related Guides#

About the Author