Standard RAG has a noise problem. You split documents into chunks, embed them, retrieve the top-k matches, and shove them all into the LLM prompt. But most of those chunks contain irrelevant sentences that dilute the useful information. The LLM has to wade through paragraphs of filler to find the three sentences that actually answer the question.
Contextual compression fixes this. Instead of passing raw chunks to the LLM, you run them through a compression step first. Each retrieved document gets filtered, extracted, or condensed so only the relevant parts survive. The result: shorter prompts, lower token costs, and better answers because the LLM focuses on signal instead of noise.
LangChain ships with a ContextualCompressionRetriever that wraps any base retriever and applies compressors before returning results. Here’s how to set it up from scratch.
Setting Up the Vector Store
First, install everything you need:
| |
Set your OpenAI API key:
| |
Now create a Chroma vector store with some sample documents and build a base retriever:
| |
If you query this retriever for “How do I index JSONB columns?”, you’ll get back full chunks. That first document includes a random sentence about San Francisco weather. The connection pooling chunk is irrelevant entirely. Both waste tokens and can confuse the LLM.
Adding Contextual Compression
The LLMChainExtractor uses an LLM to read each retrieved document and extract only the sentences relevant to the query. It’s the most accurate compression method because the LLM understands context.
| |
The output now contains only the JSONB indexing information. The weather sentence is gone. The connection pooling document is either removed entirely or stripped down to nothing. You’re left with exactly what the LLM needs to answer the question.
The tradeoff is cost. Every retrieved document gets an LLM call for extraction. With gpt-4o-mini this is cheap, but it adds latency. For high-throughput pipelines, consider the embedding-based filter below.
Using an Embedding Filter for Speed
EmbeddingsFilter skips the LLM entirely. It embeds the query and each document, then drops any document whose embedding similarity to the query falls below a threshold. No LLM calls, no extra latency.
| |
This is fast and cheap. The downside is that it works at the document level – it either keeps or drops an entire chunk. It won’t strip irrelevant sentences from within a document the way LLMChainExtractor does.
Set similarity_threshold based on your data. Start at 0.76 and tune upward. Too high and you lose relevant documents. Too low and you keep noise.
Chaining Multiple Compressors
The real power comes from combining compressors in a pipeline. Use EmbeddingsRedundantFilter to remove near-duplicate chunks, EmbeddingsFilter to drop irrelevant ones, and LLMChainExtractor to extract the good parts from what’s left.
| |
The pipeline processes documents sequentially. The splitter breaks chunks into smaller segments. The redundant filter removes duplicates. The relevance filter drops low-similarity segments. Finally, the extractor pulls out only the relevant sentences from the survivors. Each stage reduces the volume, so the expensive LLM extraction step runs on fewer, more relevant pieces.
This four-stage pipeline is the best balance of quality and cost for production RAG systems. You get deduplication, relevance filtering, and precise extraction without blowing your token budget.
Common Errors and Fixes
ValueError: Missing some input keys: {'context'}
This happens when the LLMChainExtractor receives an empty document. Some chunks get filtered down to empty strings before the extraction step. Fix it by adding a minimum length check or adjusting your similarity threshold downward so fewer documents get removed before extraction:
| |
openai.RateLimitError: Rate limit reached
When you retrieve many documents, LLMChainExtractor fires one LLM call per document. If you’re retrieving 20 chunks and running lots of queries, you’ll hit rate limits fast. Add the embedding filter before the extractor in your pipeline so the LLM only processes the most relevant chunks. Also consider using gpt-4o-mini instead of gpt-4o – it handles the extraction task well at a fraction of the cost.
chromadb.errors.InvalidCollectionException: Collection not found
You’re trying to load a collection that doesn’t exist or the persist directory is wrong. When using Chroma with persistence, make sure you pass the same persist_directory and collection_name when loading:
| |
Note the parameter is embedding_function when loading an existing collection, not embedding as used in from_documents.
TypeError: Expected a Runnable, callable or dict
This shows up when you pass an old-style LLM object where LangChain expects the new interface. Make sure you’re importing from langchain_openai, not the deprecated langchain.llms:
| |
Related Guides
- How to Build RAG Applications with LangChain and ChromaDB
- How to Build Retrieval-Augmented Prompts with Contextual Grounding
- How to Build Agentic RAG with Query Routing and Self-Reflection
- How to Build Context-Aware Prompt Routing with Embeddings
- How to Build Few-Shot Prompt Templates with Dynamic Examples
- How to Fine-Tune Embedding Models for Domain-Specific Search
- How to Build Automatic Prompt Optimization with DSPy
- How to Build Prompt Chains with Tool Results and Structured Outputs
- How to Build Prompt Chains with Async LLM Calls and Batching
- How to Build Prompt Versioning and Regression Testing for LLMs