What RAG Actually Solves
LLMs hallucinate. They confidently make up facts, cite nonexistent papers, and fabricate API methods. Retrieval-augmented generation fixes this by feeding the model relevant documents before it generates a response. Instead of relying on what the model memorized during training, you give it the actual source material and tell it to answer based on that.
Transformers v5 shipped in December 2025 with a cleaner pipeline API, PyTorch as the sole backend (TensorFlow and JAX are gone), and first-class quantization support. These changes make building a local RAG pipeline simpler than it used to be.
Here is the full stack: embed your documents with a sentence-transformer model, index them with FAISS for fast similarity search, retrieve the top matches for a user query, then feed those chunks into an LLM to generate a grounded answer.
Install the Dependencies
| |
Transformers v5 requires Python 3.10+. If you are still on 3.9, you will hit this immediately:
| |
Upgrade Python first. On Ubuntu: sudo apt install python3.12 python3.12-venv.
Build the Document Index
Start by embedding your documents and storing them in a FAISS index. This is the retrieval half of RAG.
| |
A few things to note. IndexFlatIP does exact inner-product search. Because the embeddings are L2-normalized (normalize_embeddings=True), inner product equals cosine similarity. For datasets under a million documents, exact search is fast enough. Beyond that, switch to IndexIVFFlat or IndexHNSWFlat for approximate nearest neighbors.
Retrieve and Generate
Now wire the retriever to a text-generation model. This is where Transformers v5 comes in.
| |
The device_map="auto" flag lets Accelerate distribute the model across available GPUs, or fall back to CPU if none are present. The torch_dtype=torch.bfloat16 halves memory usage with negligible quality loss on modern hardware.
You should create the generator pipeline once and reuse it across calls. Instantiating it inside the function like this is fine for a tutorial, but in production you would load the model at startup.
Picking the Right Chunk Size
Chunking is the single biggest factor in retrieval quality. Too large and you dilute the relevant signal with noise. Too small and you split key information across chunks, making it impossible to retrieve as a unit.
For most use cases, aim for 256 to 512 tokens per chunk with 50 to 100 tokens of overlap between consecutive chunks. Here is a simple chunker:
| |
If your retrieval scores are all bunched together (everything above 0.7 or everything below 0.4), your chunks are probably the wrong size. Experiment with different values and check the actual similarity scores.
Errors You Will Actually Hit
RuntimeError: CUDA out of memory when loading the generation model. Phi-3.5-mini needs around 7 GB of VRAM in bfloat16. If you are on a smaller GPU, use 4-bit quantization:
| |
This drops memory usage to around 2.5 GB with minimal quality loss for short-form Q&A tasks.
ValueError: text input must be of type str (single example), List[str] (batch) from sentence-transformers. This happens when you accidentally pass a numpy array or tensor to embedder.encode(). Always pass plain Python strings or a list of strings.
ImportError: cannot import name 'RagTokenizer' from 'transformers' if you are trying to use the old RagTokenizer/RagRetriever classes. Those were designed for the original DPR-based RAG model (facebook/rag-token-nq) and are not what you want for a custom RAG pipeline. Build your own retrieval + generation loop as shown above.
FAISS index returning wrong results. Check that you normalized your embeddings before adding them to IndexFlatIP. If you skip normalization, inner product does not equal cosine similarity, and documents with more tokens (longer embeddings with larger magnitudes) get artificially boosted.
When to Use a Vector Database Instead
FAISS is great for prototyping and for datasets that fit in memory. Once you are past a few million documents, or you need persistence, filtering, or multi-tenancy, switch to a dedicated vector database.
- ChromaDB – easy setup, good for local development and small production loads
- Qdrant – strong filtering support, handles metadata queries well
- Pinecone – fully managed, no infrastructure to maintain
- Weaviate – hybrid search (vector + keyword) out of the box
The retrieval interface stays the same. You swap index.search() for the database client’s query method and everything downstream remains unchanged.
Production Considerations
Separate the embedding step from the serving path. Pre-compute and store your document embeddings in a persistent index. At query time, you only embed the user query (one inference call, sub-millisecond for MiniLM) and run the similarity search.
Cache your LLM pipeline instance. Loading a model from disk takes seconds; running inference takes milliseconds. Never reload the model per request.
Set a similarity threshold. If the top retrieved document scores below 0.3, the knowledge base probably does not contain the answer. Return “I don’t know” instead of forcing the model to generate from irrelevant context – that is when hallucinations sneak back in.
Monitor retrieval quality separately from generation quality. A bad answer might be the retriever’s fault (wrong documents) or the generator’s fault (right documents, wrong interpretation). Log the retrieved chunks alongside the final answer so you can debug which component failed.