The Fastest Way to Ground LLM Answers
RAG is the single best way to make LLMs answer questions about your data without fine-tuning. You embed your documents into a vector store, retrieve the relevant chunks at query time, and feed them into the LLM as context. The model answers based on what you gave it, not what it memorized during training.
LangChain plus ChromaDB is the most practical stack for this. LangChain handles the orchestration – splitting documents, managing embeddings, building the retrieval chain. ChromaDB handles the vector storage with zero infrastructure. No Docker containers, no servers. It runs as an embedded database in your Python process.
Here is everything you need to get a working RAG pipeline running.
Install the Dependencies
| |
You need an OpenAI API key for embeddings and the chat model. Set it as an environment variable:
| |
You can swap OpenAI for any other provider later. LangChain abstracts the embedding and LLM layers, so switching to Anthropic, Cohere, or a local model is a one-line change.
Load and Split Your Documents
RAG pipelines choke on large documents. You need to split them into chunks small enough that the embedding captures the meaning of each piece, but large enough that you do not lose context.
| |
RecursiveCharacterTextSplitter is the right default. It tries to split on paragraph boundaries first, then sentences, then words. The chunk_overlap=200 means each chunk shares 200 characters with its neighbors, so you do not lose information at the boundaries.
A chunk_size of 1000 characters works well for most use cases. Go smaller (500) if your documents have dense, varied topics. Go larger (1500-2000) if the content flows as long narratives.
Create the ChromaDB Vector Store
ChromaDB stores your chunks as embeddings and lets you query them by semantic similarity. The simplest setup persists to a local directory:
| |
Use text-embedding-3-small over text-embedding-ada-002. It is cheaper, faster, and scores higher on MTEB benchmarks. The persist_directory argument saves everything to disk so you do not re-embed every time you restart.
To load an existing collection later without re-embedding:
| |
Query the Vector Store Directly
Before wiring up the full chain, test that retrieval works on its own:
| |
The k=4 parameter returns the 4 most similar chunks. For most RAG applications, retrieving 3-5 chunks hits the sweet spot between giving the model enough context and staying within token limits.
If the results look off, your chunking strategy is probably wrong. Either the chunks are too big (mixing unrelated content) or too small (losing context).
Build the Full RAG Chain
This is where it comes together. You connect the retriever to a chat model with a prompt that tells it to answer based on the provided context:
| |
Set temperature=0 for RAG. You want deterministic answers grounded in the documents, not creative responses. The prompt template explicitly tells the model to only use the provided context – this is critical for reducing hallucinations.
The create_stuff_documents_chain approach concatenates all retrieved chunks into the prompt. This works well for 3-5 chunks. If you need to retrieve more, look at create_map_reduce_documents_chain instead, which processes each chunk separately then combines the answers.
Add Metadata Filtering
Real applications need more than raw similarity search. ChromaDB supports metadata filtering so you can narrow results before the similarity comparison:
| |
This is the right approach when you have documents from multiple sources or time periods. Filter first, then search. It is much more effective than trying to capture temporal or categorical distinctions purely through embeddings.
Common Errors
ValueError: Could not import chromadb python package
ChromaDB requires SQLite 3.35+. On older Ubuntu versions, the system SQLite is too old. Fix it:
| |
Then add this before importing ChromaDB:
| |
openai.AuthenticationError: Incorrect API key provided
Your key is not set or is wrong. Double-check:
| |
Make sure there are no trailing spaces or newlines. If you are setting it in a .env file, use python-dotenv to load it.
chromadb.errors.DuplicateIDError
You are adding documents that already exist in the collection. Either clear the collection first with vectorstore.delete_collection() or use unique IDs:
| |
Retrieved chunks are irrelevant
This usually means your chunk size is wrong. Try reducing chunk_size to 500 and increasing chunk_overlap to 100. Also check that you are using the same embedding model for indexing and querying – mixing models produces garbage results.
RateLimitError when embedding large document sets
OpenAI rate-limits embedding requests. For large corpora, batch your embeddings:
| |
Performance Tips
Use search_type="mmr" (Maximal Marginal Relevance) instead of plain similarity when your top results are too similar to each other. MMR balances relevance with diversity:
| |
This fetches 20 candidates, then picks the 4 that are most relevant while being diverse. It makes a real difference when your documents have repetitive content.
For production, switch from ChromaDB’s default in-process mode to client-server mode. This lets multiple application instances share the same vector store. But for development and single-user apps, the embedded mode is simpler and faster to set up.
Related Guides
- How to Build Agentic RAG with Query Routing and Self-Reflection
- How to Build Retrieval-Augmented Generation with Contextual Compression
- How to Build Multi-Turn Chatbots with Conversation Memory
- How to Build Prompt Chains with Async LLM Calls and Batching
- How to Build Context-Aware Prompt Routing with Embeddings
- How to Manage Long Context Windows and Token Limits in LLM Apps
- How to Build Retrieval-Augmented Prompts with Contextual Grounding
- How to Build a Knowledge Graph from Text with LLMs
- How to Build Few-Shot Prompt Templates with Dynamic Examples
- How to Build Prefix Tuning for LLMs with PEFT and PyTorch