The Core Idea
You have a pile of documents and need to find the ones most similar to a query. Keyword search fails when people phrase things differently. Embedding-based search fixes this by converting text into dense vectors, then finding the closest vectors using FAISS (Facebook AI Similarity Search).
The pipeline is straightforward: encode text with Sentence Transformers, index vectors with FAISS, query with a new embedding. The whole thing runs locally, no external API calls required.
Install Dependencies
| |
Use faiss-gpu instead of faiss-cpu if you have a CUDA-capable GPU. The API is identical – FAISS handles the backend switch transparently.
Encode Text with Sentence Transformers
The all-MiniLM-L6-v2 model is the go-to for general-purpose embeddings. It produces 384-dimensional vectors, runs fast, and scores well on semantic similarity benchmarks. Unless you have domain-specific needs, start here.
| |
Setting normalize_embeddings=True ensures all vectors have unit length. This matters because with normalized vectors, L2 distance and cosine similarity produce the same ranking, so you can use FAISS’s L2 index and still get cosine-based results.
Build a FAISS Index
Flat Index (Exact Search)
IndexFlatL2 does brute-force search. It checks every single vector against your query. For datasets under a million vectors, this is perfectly fine and gives you exact results.
| |
Now query it:
| |
Output:
| |
The distances are L2 distances between normalized vectors. Lower means more similar. With normalized vectors, L2 distance ranges from 0 (identical) to 4 (opposite directions, though rare with real text).
IVF Index (Approximate Search for Larger Datasets)
When you have millions of vectors, brute-force becomes slow. IndexIVFFlat partitions the vector space into clusters (Voronoi cells) and only searches the nearest clusters at query time. This trades a small amount of accuracy for a big speedup.
| |
For production workloads, set nlist to roughly sqrt(n) where n is your dataset size. With 1 million vectors, use nlist=1000. Increase nprobe for better recall at the cost of speed – nprobe=10 is a solid default for most cases.
Quantized Indices for Memory Efficiency
Full float32 vectors eat memory fast. A million 384-dimensional vectors take about 1.5 GB. Product quantization compresses vectors by splitting each one into sub-vectors and encoding each sub-vector with a short code. You lose some accuracy but cut memory usage dramatically.
| |
With m=48 and nbits=8, each vector is compressed from 1536 bytes (384 * 4 bytes) down to 48 bytes. That is a 32x reduction. The distance values will be approximate, but ranking accuracy stays surprisingly good for most use cases.
A practical guideline: use IndexFlatL2 for datasets under 100K vectors, IndexIVFFlat for 100K-10M, and IndexIVFPQ for anything larger.
Save and Load Indices
FAISS makes persistence dead simple:
| |
You still need to store the original documents separately. FAISS only stores and returns integer IDs. A common pattern is to keep a parallel list or a SQLite database that maps IDs back to document text.
| |
Build a Semantic Search API
Here is a minimal FastAPI service that wraps the pipeline into an HTTP endpoint:
| |
Run it with:
| |
Test it:
| |
The index check for -1 matters. FAISS returns -1 for indices when a partition has fewer vectors than top_k, which can happen with IVF indices on small datasets.
Batch Encoding for Large Corpora
When encoding thousands of documents, batch processing matters. Sentence Transformers handles batching internally, but you should control the batch size to avoid running out of GPU memory:
| |
On CPU, encoding 50K documents with all-MiniLM-L6-v2 takes around 5 minutes. On a T4 GPU, under 30 seconds.
Common Errors and Fixes
RuntimeError: Error in void faiss::IndexIVF::train(...): nlist is too large for the training set
You set nlist higher than the number of training vectors. FAISS needs at least nlist vectors to train. Reduce nlist or add more training data. A safe rule: nlist should never exceed n / 39 where n is your training set size.
ValueError: could not broadcast input array from shape (N,768) into shape (N,384)
Your query embedding dimension does not match the index dimension. This happens when you encode the query with a different model than the one used to build the index. Always use the same model for both encoding and querying.
Index is not trained error when calling index.add()
IVF and PQ indices require .train() before .add(). Flat indices do not need training. Call index.train(training_data) first.
Search returns -1 indices
This happens with IVF indices when nprobe is too low or partitions are nearly empty. Increase nprobe or rebuild the index with fewer clusters.
Out of memory when encoding large datasets
Reduce batch_size in model.encode(). Start with 32 and increase until you hit your memory ceiling. Alternatively, encode in chunks and concatenate the NumPy arrays afterward:
| |
FAISS GPU index cannot be saved directly
Move GPU indices to CPU before saving: cpu_index = faiss.index_gpu_to_cpu(gpu_index), then use faiss.write_index(cpu_index, "path.index").
Related Guides
- How to Build a Text Similarity API with Cross-Encoders
- How to Build a Named Entity Recognition Pipeline with spaCy and Transformers
- How to Build a Text-to-SQL Pipeline with LLMs
- How to Build a Spell Checking and Autocorrect Pipeline with Python
- How to Build a Sentiment Analysis API with Transformers and FastAPI
- How to Build a Text Correction and Grammar Checking Pipeline
- How to Build a Text Classification Pipeline with SetFit
- How to Build a Text Clustering Pipeline with Embeddings and HDBSCAN
- How to Build a Multilingual NLP Pipeline with Sentence Transformers
- How to Build a Document Chunking Strategy Comparison Pipeline