The Fastest Way to Ground LLM Answers

RAG is the single best way to make LLMs answer questions about your data without fine-tuning. You embed your documents into a vector store, retrieve the relevant chunks at query time, and feed them into the LLM as context. The model answers based on what you gave it, not what it memorized during training.

LangChain plus ChromaDB is the most practical stack for this. LangChain handles the orchestration – splitting documents, managing embeddings, building the retrieval chain. ChromaDB handles the vector storage with zero infrastructure. No Docker containers, no servers. It runs as an embedded database in your Python process.

Here is everything you need to get a working RAG pipeline running.

Install the Dependencies

1
pip install langchain langchain-openai langchain-community chromadb pypdf

You need an OpenAI API key for embeddings and the chat model. Set it as an environment variable:

1
export OPENAI_API_KEY="sk-your-key-here"

You can swap OpenAI for any other provider later. LangChain abstracts the embedding and LLM layers, so switching to Anthropic, Cohere, or a local model is a one-line change.

Load and Split Your Documents

RAG pipelines choke on large documents. You need to split them into chunks small enough that the embedding captures the meaning of each piece, but large enough that you do not lose context.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load a PDF
loader = PyPDFLoader("your-document.pdf")
pages = loader.load()

# Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(pages)
print(f"Split into {len(chunks)} chunks")

RecursiveCharacterTextSplitter is the right default. It tries to split on paragraph boundaries first, then sentences, then words. The chunk_overlap=200 means each chunk shares 200 characters with its neighbors, so you do not lose information at the boundaries.

A chunk_size of 1000 characters works well for most use cases. Go smaller (500) if your documents have dense, varied topics. Go larger (1500-2000) if the content flows as long narratives.

Create the ChromaDB Vector Store

ChromaDB stores your chunks as embeddings and lets you query them by semantic similarity. The simplest setup persists to a local directory:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model,
    persist_directory="./chroma_db",
    collection_name="my_documents"
)

Use text-embedding-3-small over text-embedding-ada-002. It is cheaper, faster, and scores higher on MTEB benchmarks. The persist_directory argument saves everything to disk so you do not re-embed every time you restart.

To load an existing collection later without re-embedding:

1
2
3
4
5
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embedding_model,
    collection_name="my_documents"
)

Query the Vector Store Directly

Before wiring up the full chain, test that retrieval works on its own:

1
2
3
4
5
results = vectorstore.similarity_search("What are the key findings?", k=4)
for doc in results:
    print(doc.page_content[:200])
    print(f"Source: {doc.metadata.get('source', 'unknown')}")
    print("---")

The k=4 parameter returns the 4 most similar chunks. For most RAG applications, retrieving 3-5 chunks hits the sweet spot between giving the model enough context and staying within token limits.

If the results look off, your chunking strategy is probably wrong. Either the chunks are too big (mixing unrelated content) or too small (losing context).

Build the Full RAG Chain

This is where it comes together. You connect the retriever to a chat model with a prompt that tells it to answer based on the provided context:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

llm = ChatOpenAI(model="gpt-4o", temperature=0)

retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}
)

prompt = ChatPromptTemplate.from_template("""
Answer the question based only on the provided context.
If the context does not contain enough information, say so.
Do not make up information.

Context: {context}

Question: {input}
""")

combine_docs_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, combine_docs_chain)

response = rag_chain.invoke({"input": "What are the main topics covered?"})
print(response["answer"])

Set temperature=0 for RAG. You want deterministic answers grounded in the documents, not creative responses. The prompt template explicitly tells the model to only use the provided context – this is critical for reducing hallucinations.

The create_stuff_documents_chain approach concatenates all retrieved chunks into the prompt. This works well for 3-5 chunks. If you need to retrieve more, look at create_map_reduce_documents_chain instead, which processes each chunk separately then combines the answers.

Add Metadata Filtering

Real applications need more than raw similarity search. ChromaDB supports metadata filtering so you can narrow results before the similarity comparison:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Add documents with metadata
from langchain.schema import Document

docs_with_metadata = [
    Document(
        page_content="LangChain supports multiple vector stores...",
        metadata={"source": "docs", "topic": "integration", "year": 2026}
    ),
    Document(
        page_content="ChromaDB uses HNSW for approximate nearest neighbor...",
        metadata={"source": "blog", "topic": "architecture", "year": 2025}
    ),
]

vectorstore.add_documents(docs_with_metadata)

# Query with metadata filter
retriever = vectorstore.as_retriever(
    search_kwargs={
        "k": 4,
        "filter": {"topic": "integration"}
    }
)

This is the right approach when you have documents from multiple sources or time periods. Filter first, then search. It is much more effective than trying to capture temporal or categorical distinctions purely through embeddings.

Common Errors

ValueError: Could not import chromadb python package

ChromaDB requires SQLite 3.35+. On older Ubuntu versions, the system SQLite is too old. Fix it:

1
pip install pysqlite3-binary

Then add this before importing ChromaDB:

1
2
3
__import__('pysqlite3')
import sys
sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

openai.AuthenticationError: Incorrect API key provided

Your key is not set or is wrong. Double-check:

1
echo $OPENAI_API_KEY

Make sure there are no trailing spaces or newlines. If you are setting it in a .env file, use python-dotenv to load it.

chromadb.errors.DuplicateIDError

You are adding documents that already exist in the collection. Either clear the collection first with vectorstore.delete_collection() or use unique IDs:

1
vectorstore.add_documents(new_docs, ids=["doc_1", "doc_2"])

Retrieved chunks are irrelevant

This usually means your chunk size is wrong. Try reducing chunk_size to 500 and increasing chunk_overlap to 100. Also check that you are using the same embedding model for indexing and querying – mixing models produces garbage results.

RateLimitError when embedding large document sets

OpenAI rate-limits embedding requests. For large corpora, batch your embeddings:

1
2
3
4
5
# Process in batches of 100 documents
batch_size = 100
for i in range(0, len(chunks), batch_size):
    batch = chunks[i:i + batch_size]
    vectorstore.add_documents(batch)

Performance Tips

Use search_type="mmr" (Maximal Marginal Relevance) instead of plain similarity when your top results are too similar to each other. MMR balances relevance with diversity:

1
2
3
4
retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 4, "fetch_k": 20}
)

This fetches 20 candidates, then picks the 4 that are most relevant while being diverse. It makes a real difference when your documents have repetitive content.

For production, switch from ChromaDB’s default in-process mode to client-server mode. This lets multiple application instances share the same vector store. But for development and single-user apps, the embedded mode is simpler and faster to set up.