How to Build Retrieval-Augmented Generation with Contextual Compression

Standard RAG has a noise problem. You split documents into chunks, embed them, retrieve the top-k matches, and shove them all into the LLM prompt. But most of those chunks contain irrelevant sentences that dilute the useful information. The LLM has to wade through paragraphs of filler to find the three sentences that actually answer the question.

Contextual compression fixes this. Instead of passing raw chunks to the LLM, you run them through a compression step first. Each retrieved document gets filtered, extracted, or condensed so only the relevant parts survive. The result: shorter prompts, lower token costs, and better answers because the LLM focuses on signal instead of noise.

LangChain ships with a ContextualCompressionRetriever that wraps any base retriever and applies compressors before returning results. Here’s how to set it up from scratch.

Setting Up the Vector Store

First, install everything you need:

1
pip install langchain langchain-openai langchain-community chromadb

Set your OpenAI API key:

1
export OPENAI_API_KEY="sk-your-key-here"

Now create a Chroma vector store with some sample documents and build a base retriever:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.schema import Document

# Sample documents — imagine these are chunks from a technical knowledge base
documents = [
    Document(
        page_content="PostgreSQL supports JSONB columns for storing semi-structured data. "
        "JSONB is stored in a decomposed binary format, which makes it slightly slower "
        "to insert but significantly faster to query than plain JSON. You can create "
        "GIN indexes on JSONB columns to speed up containment and existence queries. "
        "The weather in San Francisco is usually foggy in summer.",
        metadata={"source": "postgres-docs"},
    ),
    Document(
        page_content="To create a GIN index on a JSONB column, use: CREATE INDEX idx_data "
        "ON my_table USING GIN (data). GIN indexes support the @>, ?, ?&, and ?| "
        "operators. For partial indexing, use jsonb_path_ops which is smaller and "
        "faster but only supports the @> operator.",
        metadata={"source": "postgres-docs"},
    ),
    Document(
        page_content="Connection pooling with PgBouncer reduces the overhead of establishing "
        "new database connections. Set pool_mode to transaction for most web applications. "
        "Session mode is needed if you use prepared statements or advisory locks. "
        "Default pool size of 20 works for most setups.",
        metadata={"source": "postgres-docs"},
    ),
    Document(
        page_content="MongoDB uses BSON format for document storage. It supports flexible "
        "schemas and horizontal scaling through sharding. The WiredTiger storage engine "
        "provides document-level concurrency control and compression.",
        metadata={"source": "mongodb-docs"},
    ),
]

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Chroma.from_documents(
    documents=documents,
    embedding=embeddings,
    collection_name="db_docs",
)

base_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

If you query this retriever for “How do I index JSONB columns?”, you’ll get back full chunks. That first document includes a random sentence about San Francisco weather. The connection pooling chunk is irrelevant entirely. Both waste tokens and can confuse the LLM.

Adding Contextual Compression

The LLMChainExtractor uses an LLM to read each retrieved document and extract only the sentences relevant to the query. It’s the most accurate compression method because the LLM understands context.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
from langchain_openai import ChatOpenAI
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=base_retriever,
)

# Query with compression
query = "How do I index JSONB columns in PostgreSQL?"
compressed_docs = compression_retriever.invoke(query)

for i, doc in enumerate(compressed_docs):
    print(f"--- Document {i + 1} ---")
    print(doc.page_content)
    print()

The output now contains only the JSONB indexing information. The weather sentence is gone. The connection pooling document is either removed entirely or stripped down to nothing. You’re left with exactly what the LLM needs to answer the question.

The tradeoff is cost. Every retrieved document gets an LLM call for extraction. With gpt-4o-mini this is cheap, but it adds latency. For high-throughput pipelines, consider the embedding-based filter below.

Using an Embedding Filter for Speed

EmbeddingsFilter skips the LLM entirely. It embeds the query and each document, then drops any document whose embedding similarity to the query falls below a threshold. No LLM calls, no extra latency.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
from langchain.retrievers.document_compressors import EmbeddingsFilter

embeddings_filter = EmbeddingsFilter(
    embeddings=embeddings,
    similarity_threshold=0.80,
)

filter_retriever = ContextualCompressionRetriever(
    base_compressor=embeddings_filter,
    base_retriever=base_retriever,
)

query = "How do I index JSONB columns in PostgreSQL?"
filtered_docs = filter_retriever.invoke(query)

print(f"Returned {len(filtered_docs)} documents (from 3 retrieved)")
for doc in filtered_docs:
    print(f"  - {doc.page_content[:80]}...")

This is fast and cheap. The downside is that it works at the document level – it either keeps or drops an entire chunk. It won’t strip irrelevant sentences from within a document the way LLMChainExtractor does.

Set similarity_threshold based on your data. Start at 0.76 and tune upward. Too high and you lose relevant documents. Too low and you keep noise.

Chaining Multiple Compressors

The real power comes from combining compressors in a pipeline. Use EmbeddingsRedundantFilter to remove near-duplicate chunks, EmbeddingsFilter to drop irrelevant ones, and LLMChainExtractor to extract the good parts from what’s left.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from langchain.retrievers.document_compressors import DocumentCompressorPipeline
from langchain_community.document_transformers import EmbeddingsRedundantFilter
from langchain.retrievers.document_compressors import EmbeddingsFilter, LLMChainExtractor
from langchain.text_splitter import CharacterTextSplitter

# Step 1: Split large chunks into smaller pieces for finer-grained filtering
splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=0, separator=". ")

# Step 2: Remove near-duplicate chunks (similarity > 0.95 = redundant)
redundant_filter = EmbeddingsRedundantFilter(embeddings=embeddings)

# Step 3: Drop chunks that aren't relevant to the query
relevant_filter = EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.76)

# Step 4: Extract only the relevant sentences from surviving chunks
extractor = LLMChainExtractor.from_llm(ChatOpenAI(model="gpt-4o-mini", temperature=0))

pipeline_compressor = DocumentCompressorPipeline(
    transformers=[splitter, redundant_filter, relevant_filter, extractor]
)

pipeline_retriever = ContextualCompressionRetriever(
    base_compressor=pipeline_compressor,
    base_retriever=base_retriever,
)

query = "How do I index JSONB columns in PostgreSQL?"
results = pipeline_retriever.invoke(query)

for doc in results:
    print(doc.page_content)
    print("---")

The pipeline processes documents sequentially. The splitter breaks chunks into smaller segments. The redundant filter removes duplicates. The relevance filter drops low-similarity segments. Finally, the extractor pulls out only the relevant sentences from the survivors. Each stage reduces the volume, so the expensive LLM extraction step runs on fewer, more relevant pieces.

This four-stage pipeline is the best balance of quality and cost for production RAG systems. You get deduplication, relevance filtering, and precise extraction without blowing your token budget.

Common Errors and Fixes

ValueError: Missing some input keys: {'context'}

This happens when the LLMChainExtractor receives an empty document. Some chunks get filtered down to empty strings before the extraction step. Fix it by adding a minimum length check or adjusting your similarity threshold downward so fewer documents get removed before extraction:

1
2
3
4
relevant_filter = EmbeddingsFilter(
    embeddings=embeddings,
    similarity_threshold=0.70,  # lower threshold to keep more docs
)

openai.RateLimitError: Rate limit reached

When you retrieve many documents, LLMChainExtractor fires one LLM call per document. If you’re retrieving 20 chunks and running lots of queries, you’ll hit rate limits fast. Add the embedding filter before the extractor in your pipeline so the LLM only processes the most relevant chunks. Also consider using gpt-4o-mini instead of gpt-4o – it handles the extraction task well at a fraction of the cost.

chromadb.errors.InvalidCollectionException: Collection not found

You’re trying to load a collection that doesn’t exist or the persist directory is wrong. When using Chroma with persistence, make sure you pass the same persist_directory and collection_name when loading:

1
2
3
4
5
vectorstore = Chroma(
    collection_name="db_docs",
    persist_directory="./chroma_db",
    embedding_function=embeddings,
)

Note the parameter is embedding_function when loading an existing collection, not embedding as used in from_documents.

TypeError: Expected a Runnable, callable or dict

This shows up when you pass an old-style LLM object where LangChain expects the new interface. Make sure you’re importing from langchain_openai, not the deprecated langchain.llms:

1
2
3
4
5
# Wrong - deprecated
from langchain.llms import OpenAI

# Right - current package
from langchain_openai import ChatOpenAI

Setting Up the Vector Store#

Adding Contextual Compression#

Using an Embedding Filter for Speed#

Chaining Multiple Compressors#

Common Errors and Fixes#

Related Guides#

About the Author

Setting Up the Vector Store

Adding Contextual Compression

Using an Embedding Filter for Speed

Chaining Multiple Compressors

Common Errors and Fixes

Related Guides