What RAG Actually Solves

LLMs hallucinate. They confidently make up facts, cite nonexistent papers, and fabricate API methods. Retrieval-augmented generation fixes this by feeding the model relevant documents before it generates a response. Instead of relying on what the model memorized during training, you give it the actual source material and tell it to answer based on that.

Transformers v5 shipped in December 2025 with a cleaner pipeline API, PyTorch as the sole backend (TensorFlow and JAX are gone), and first-class quantization support. These changes make building a local RAG pipeline simpler than it used to be.

Here is the full stack: embed your documents with a sentence-transformer model, index them with FAISS for fast similarity search, retrieve the top matches for a user query, then feed those chunks into an LLM to generate a grounded answer.

Install the Dependencies

1
pip install transformers>=5.0 sentence-transformers faiss-cpu torch accelerate

Transformers v5 requires Python 3.10+. If you are still on 3.9, you will hit this immediately:

1
ERROR: Package 'transformers' requires a different Python: 3.9.7 not in '>=3.10'

Upgrade Python first. On Ubuntu: sudo apt install python3.12 python3.12-venv.

Build the Document Index

Start by embedding your documents and storing them in a FAISS index. This is the retrieval half of RAG.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

# Load an embedding model — all-MiniLM-L6-v2 is fast and good enough for most RAG use cases.
# For higher accuracy, try all-mpnet-base-v2 (slower, 768 dims instead of 384).
embedder = SentenceTransformer("all-MiniLM-L6-v2")

# Your knowledge base. In production, these come from chunked PDFs, docs, or database rows.
documents = [
    "FAISS supports both CPU and GPU indexes. Use faiss-gpu for NVIDIA hardware.",
    "Transformers v5 removed TensorFlow and JAX backends. Only PyTorch is supported.",
    "The apply_chat_template method in v5 now returns a BatchEncoding with input_ids and attention_mask.",
    "Chunking documents into 256-512 token segments works best for retrieval accuracy.",
    "Use HF_HOME instead of TRANSFORMERS_CACHE — the old environment variable was removed in v5.",
    "Quantization is a first-class feature in Transformers v5, supporting 4-bit and 8-bit formats.",
    "The sentence-transformers library provides pre-trained models optimized for semantic similarity.",
    "RAG reduces hallucination by grounding generation in retrieved source documents.",
]

# Encode all documents into dense vectors
doc_embeddings = embedder.encode(documents, normalize_embeddings=True)

# Build a FAISS index using inner product (cosine similarity on normalized vectors)
dimension = doc_embeddings.shape[1]  # 384 for MiniLM
index = faiss.IndexFlatIP(dimension)
index.add(doc_embeddings.astype(np.float32))

print(f"Indexed {index.ntotal} documents with {dimension}-dim embeddings")
# Indexed 8 documents with 384-dim embeddings

A few things to note. IndexFlatIP does exact inner-product search. Because the embeddings are L2-normalized (normalize_embeddings=True), inner product equals cosine similarity. For datasets under a million documents, exact search is fast enough. Beyond that, switch to IndexIVFFlat or IndexHNSWFlat for approximate nearest neighbors.

Retrieve and Generate

Now wire the retriever to a text-generation model. This is where Transformers v5 comes in.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import torch
from transformers import pipeline

def retrieve(query: str, top_k: int = 3) -> list[str]:
    """Find the most relevant documents for a query."""
    query_embedding = embedder.encode([query], normalize_embeddings=True)
    scores, indices = index.search(query_embedding.astype(np.float32), top_k)
    return [documents[i] for i in indices[0]]


def rag_answer(question: str, top_k: int = 3) -> str:
    """Retrieve context, then generate a grounded answer."""
    # Step 1: Retrieve relevant chunks
    context_docs = retrieve(question, top_k=top_k)
    context = "\n".join(f"- {doc}" for doc in context_docs)

    # Step 2: Build the prompt with retrieved context
    prompt = f"""Answer the question based only on the provided context.
If the context doesn't contain the answer, say "I don't have enough information."

Context:
{context}

Question: {question}
Answer:"""

    # Step 3: Generate with a local model
    generator = pipeline(
        "text-generation",
        model="microsoft/Phi-3.5-mini-instruct",
        torch_dtype=torch.bfloat16,
        device_map="auto",
    )

    output = generator(
        prompt,
        max_new_tokens=256,
        do_sample=False,
        return_full_text=False,
    )

    return output[0]["generated_text"].strip()


# Try it
answer = rag_answer("What changed about the cache environment variable in Transformers v5?")
print(answer)
# Use HF_HOME instead of TRANSFORMERS_CACHE. The TRANSFORMERS_CACHE environment
# variable was removed in Transformers v5.

The device_map="auto" flag lets Accelerate distribute the model across available GPUs, or fall back to CPU if none are present. The torch_dtype=torch.bfloat16 halves memory usage with negligible quality loss on modern hardware.

You should create the generator pipeline once and reuse it across calls. Instantiating it inside the function like this is fine for a tutorial, but in production you would load the model at startup.

Picking the Right Chunk Size

Chunking is the single biggest factor in retrieval quality. Too large and you dilute the relevant signal with noise. Too small and you split key information across chunks, making it impossible to retrieve as a unit.

For most use cases, aim for 256 to 512 tokens per chunk with 50 to 100 tokens of overlap between consecutive chunks. Here is a simple chunker:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def chunk_text(text: str, chunk_size: int = 400, overlap: int = 80) -> list[str]:
    """Split text into overlapping chunks by word count."""
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = start + chunk_size
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start += chunk_size - overlap
    return chunks

If your retrieval scores are all bunched together (everything above 0.7 or everything below 0.4), your chunks are probably the wrong size. Experiment with different values and check the actual similarity scores.

Errors You Will Actually Hit

RuntimeError: CUDA out of memory when loading the generation model. Phi-3.5-mini needs around 7 GB of VRAM in bfloat16. If you are on a smaller GPU, use 4-bit quantization:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True)

generator = pipeline(
    "text-generation",
    model="microsoft/Phi-3.5-mini-instruct",
    device_map="auto",
    model_kwargs={"quantization_config": quantization_config},
)

This drops memory usage to around 2.5 GB with minimal quality loss for short-form Q&A tasks.

ValueError: text input must be of type str (single example), List[str] (batch) from sentence-transformers. This happens when you accidentally pass a numpy array or tensor to embedder.encode(). Always pass plain Python strings or a list of strings.

ImportError: cannot import name 'RagTokenizer' from 'transformers' if you are trying to use the old RagTokenizer/RagRetriever classes. Those were designed for the original DPR-based RAG model (facebook/rag-token-nq) and are not what you want for a custom RAG pipeline. Build your own retrieval + generation loop as shown above.

FAISS index returning wrong results. Check that you normalized your embeddings before adding them to IndexFlatIP. If you skip normalization, inner product does not equal cosine similarity, and documents with more tokens (longer embeddings with larger magnitudes) get artificially boosted.

When to Use a Vector Database Instead

FAISS is great for prototyping and for datasets that fit in memory. Once you are past a few million documents, or you need persistence, filtering, or multi-tenancy, switch to a dedicated vector database.

  • ChromaDB – easy setup, good for local development and small production loads
  • Qdrant – strong filtering support, handles metadata queries well
  • Pinecone – fully managed, no infrastructure to maintain
  • Weaviate – hybrid search (vector + keyword) out of the box

The retrieval interface stays the same. You swap index.search() for the database client’s query method and everything downstream remains unchanged.

Production Considerations

Separate the embedding step from the serving path. Pre-compute and store your document embeddings in a persistent index. At query time, you only embed the user query (one inference call, sub-millisecond for MiniLM) and run the similarity search.

Cache your LLM pipeline instance. Loading a model from disk takes seconds; running inference takes milliseconds. Never reload the model per request.

Set a similarity threshold. If the top retrieved document scores below 0.3, the knowledge base probably does not contain the answer. Return “I don’t know” instead of forcing the model to generate from irrelevant context – that is when hallucinations sneak back in.

Monitor retrieval quality separately from generation quality. A bad answer might be the retriever’s fault (wrong documents) or the generator’s fault (right documents, wrong interpretation). Log the retrieved chunks alongside the final answer so you can debug which component failed.