The Core Idea

You have a pile of documents and need to find the ones most similar to a query. Keyword search fails when people phrase things differently. Embedding-based search fixes this by converting text into dense vectors, then finding the closest vectors using FAISS (Facebook AI Similarity Search).

The pipeline is straightforward: encode text with Sentence Transformers, index vectors with FAISS, query with a new embedding. The whole thing runs locally, no external API calls required.

Install Dependencies

1
pip install sentence-transformers faiss-cpu numpy

Use faiss-gpu instead of faiss-cpu if you have a CUDA-capable GPU. The API is identical – FAISS handles the backend switch transparently.

Encode Text with Sentence Transformers

The all-MiniLM-L6-v2 model is the go-to for general-purpose embeddings. It produces 384-dimensional vectors, runs fast, and scores well on semantic similarity benchmarks. Unless you have domain-specific needs, start here.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

documents = [
    "FAISS is a library for efficient similarity search developed by Meta.",
    "Sentence Transformers produce dense vector representations of text.",
    "PostgreSQL supports full-text search with tsvector and tsquery.",
    "Vector databases store embeddings for fast nearest neighbor lookup.",
    "Redis can be used as a message broker with pub/sub patterns.",
    "Cosine similarity measures the angle between two vectors.",
    "Kubernetes orchestrates containerized applications across clusters.",
    "Retrieval-augmented generation combines search with language models.",
    "The inverted file index partitions vectors into Voronoi cells for faster search.",
    "Batch processing with Apache Spark handles large-scale data transformations.",
]

embeddings = model.encode(documents, convert_to_numpy=True, normalize_embeddings=True)
print(f"Embeddings shape: {embeddings.shape}")
# Embeddings shape: (10, 384)

Setting normalize_embeddings=True ensures all vectors have unit length. This matters because with normalized vectors, L2 distance and cosine similarity produce the same ranking, so you can use FAISS’s L2 index and still get cosine-based results.

Build a FAISS Index

IndexFlatL2 does brute-force search. It checks every single vector against your query. For datasets under a million vectors, this is perfectly fine and gives you exact results.

1
2
3
4
5
6
7
8
9
import faiss

dimension = embeddings.shape[1]  # 384

index = faiss.IndexFlatL2(dimension)
index.add(embeddings)

print(f"Vectors in index: {index.ntotal}")
# Vectors in index: 10

Now query it:

1
2
3
4
5
6
7
8
9
query = "How does vector similarity search work?"
query_embedding = model.encode([query], convert_to_numpy=True, normalize_embeddings=True)

k = 3  # number of nearest neighbors
distances, indices = index.search(query_embedding, k)

print("Top results:")
for rank, (idx, dist) in enumerate(zip(indices[0], distances[0])):
    print(f"  {rank + 1}. (dist={dist:.4f}) {documents[idx]}")

Output:

1
2
3
4
Top results:
  1. (dist=0.5765) Vector databases store embeddings for fast nearest neighbor lookup.
  2. (dist=0.7842) Cosine similarity measures the angle between two vectors.
  3. (dist=0.8901) FAISS is a library for efficient similarity search developed by Meta.

The distances are L2 distances between normalized vectors. Lower means more similar. With normalized vectors, L2 distance ranges from 0 (identical) to 4 (opposite directions, though rare with real text).

IVF Index (Approximate Search for Larger Datasets)

When you have millions of vectors, brute-force becomes slow. IndexIVFFlat partitions the vector space into clusters (Voronoi cells) and only searches the nearest clusters at query time. This trades a small amount of accuracy for a big speedup.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
nlist = 4       # number of clusters (use sqrt(n) as a starting point)
nprobe = 2      # number of clusters to search at query time

quantizer = faiss.IndexFlatL2(dimension)
index_ivf = faiss.IndexIVFFlat(quantizer, dimension, nlist)

# IVF indices need training on representative data
index_ivf.train(embeddings)
index_ivf.add(embeddings)

index_ivf.nprobe = nprobe

distances, indices = index_ivf.search(query_embedding, k)

print("IVF results:")
for rank, (idx, dist) in enumerate(zip(indices[0], distances[0])):
    print(f"  {rank + 1}. (dist={dist:.4f}) {documents[idx]}")

For production workloads, set nlist to roughly sqrt(n) where n is your dataset size. With 1 million vectors, use nlist=1000. Increase nprobe for better recall at the cost of speed – nprobe=10 is a solid default for most cases.

Quantized Indices for Memory Efficiency

Full float32 vectors eat memory fast. A million 384-dimensional vectors take about 1.5 GB. Product quantization compresses vectors by splitting each one into sub-vectors and encoding each sub-vector with a short code. You lose some accuracy but cut memory usage dramatically.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
m = 48          # number of sub-quantizers (must divide dimension evenly: 384 / 48 = 8)
nbits = 8       # bits per sub-quantizer code

quantizer = faiss.IndexFlatL2(dimension)
index_pq = faiss.IndexIVFPQ(quantizer, dimension, nlist, m, nbits)

index_pq.train(embeddings)
index_pq.add(embeddings)

index_pq.nprobe = nprobe

distances, indices = index_pq.search(query_embedding, k)

print("PQ results:")
for rank, (idx, dist) in enumerate(zip(indices[0], distances[0])):
    print(f"  {rank + 1}. (dist={dist:.4f}) {documents[idx]}")

With m=48 and nbits=8, each vector is compressed from 1536 bytes (384 * 4 bytes) down to 48 bytes. That is a 32x reduction. The distance values will be approximate, but ranking accuracy stays surprisingly good for most use cases.

A practical guideline: use IndexFlatL2 for datasets under 100K vectors, IndexIVFFlat for 100K-10M, and IndexIVFPQ for anything larger.

Save and Load Indices

FAISS makes persistence dead simple:

1
2
3
4
5
6
# Save
faiss.write_index(index, "documents.index")

# Load
loaded_index = faiss.read_index("documents.index")
distances, indices = loaded_index.search(query_embedding, k)

You still need to store the original documents separately. FAISS only stores and returns integer IDs. A common pattern is to keep a parallel list or a SQLite database that maps IDs back to document text.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import json

# Save document mapping alongside the index
doc_map = {i: doc for i, doc in enumerate(documents)}
with open("doc_map.json", "w") as f:
    json.dump(doc_map, f)

# Load and use together
with open("doc_map.json", "r") as f:
    doc_map = json.load(f)

distances, indices = loaded_index.search(query_embedding, k)
for idx in indices[0]:
    print(doc_map[str(idx)])

Build a Semantic Search API

Here is a minimal FastAPI service that wraps the pipeline into an HTTP endpoint:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
from contextlib import asynccontextmanager
from fastapi import FastAPI
from pydantic import BaseModel
from sentence_transformers import SentenceTransformer
import faiss
import json
import numpy as np

search_state = {}

@asynccontextmanager
async def lifespan(app: FastAPI):
    search_state["model"] = SentenceTransformer("all-MiniLM-L6-v2")
    search_state["index"] = faiss.read_index("documents.index")
    with open("doc_map.json", "r") as f:
        search_state["doc_map"] = json.load(f)
    yield
    search_state.clear()

app = FastAPI(lifespan=lifespan)

class SearchRequest(BaseModel):
    query: str
    top_k: int = 5

class SearchResult(BaseModel):
    document: str
    score: float

@app.post("/search")
def search(req: SearchRequest) -> list[SearchResult]:
    model = search_state["model"]
    index = search_state["index"]
    doc_map = search_state["doc_map"]

    query_vec = model.encode([req.query], convert_to_numpy=True, normalize_embeddings=True)
    distances, indices = index.search(query_vec, req.top_k)

    results = []
    for idx, dist in zip(indices[0], distances[0]):
        if idx == -1:
            continue
        results.append(SearchResult(document=doc_map[str(idx)], score=float(dist)))
    return results

Run it with:

1
uvicorn search_api:app --host 0.0.0.0 --port 8000

Test it:

1
2
3
curl -X POST http://localhost:8000/search \
  -H "Content-Type: application/json" \
  -d '{"query": "how do vector databases work", "top_k": 3}'

The index check for -1 matters. FAISS returns -1 for indices when a partition has fewer vectors than top_k, which can happen with IVF indices on small datasets.

Batch Encoding for Large Corpora

When encoding thousands of documents, batch processing matters. Sentence Transformers handles batching internally, but you should control the batch size to avoid running out of GPU memory:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
large_corpus = ["document text..."] * 50000  # your actual documents

# Encode in controlled batches
all_embeddings = model.encode(
    large_corpus,
    batch_size=256,
    show_progress_bar=True,
    convert_to_numpy=True,
    normalize_embeddings=True,
)

# Build IVF index for large corpus
nlist = int(np.sqrt(len(large_corpus)))
quantizer = faiss.IndexFlatL2(dimension)
index_large = faiss.IndexIVFFlat(quantizer, dimension, nlist)
index_large.train(all_embeddings)
index_large.add(all_embeddings)

On CPU, encoding 50K documents with all-MiniLM-L6-v2 takes around 5 minutes. On a T4 GPU, under 30 seconds.

Common Errors and Fixes

RuntimeError: Error in void faiss::IndexIVF::train(...): nlist is too large for the training set

You set nlist higher than the number of training vectors. FAISS needs at least nlist vectors to train. Reduce nlist or add more training data. A safe rule: nlist should never exceed n / 39 where n is your training set size.

ValueError: could not broadcast input array from shape (N,768) into shape (N,384)

Your query embedding dimension does not match the index dimension. This happens when you encode the query with a different model than the one used to build the index. Always use the same model for both encoding and querying.

Index is not trained error when calling index.add()

IVF and PQ indices require .train() before .add(). Flat indices do not need training. Call index.train(training_data) first.

Search returns -1 indices

This happens with IVF indices when nprobe is too low or partitions are nearly empty. Increase nprobe or rebuild the index with fewer clusters.

Out of memory when encoding large datasets

Reduce batch_size in model.encode(). Start with 32 and increase until you hit your memory ceiling. Alternatively, encode in chunks and concatenate the NumPy arrays afterward:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
chunk_size = 10000
all_chunks = []
for i in range(0, len(large_corpus), chunk_size):
    chunk = model.encode(
        large_corpus[i:i + chunk_size],
        convert_to_numpy=True,
        normalize_embeddings=True,
    )
    all_chunks.append(chunk)

all_embeddings = np.concatenate(all_chunks, axis=0)

FAISS GPU index cannot be saved directly

Move GPU indices to CPU before saving: cpu_index = faiss.index_gpu_to_cpu(gpu_index), then use faiss.write_index(cpu_index, "path.index").