The Core Idea#
CLIP (Contrastive Language-Image Pre-Training) maps images and text into the same embedding space. That means you can compare an image to a text description – or an image to another image – using cosine similarity. Pair it with FAISS for fast nearest-neighbor lookup and you have a production-ready image search engine in under 100 lines of Python.
1
| pip install transformers torch Pillow faiss-cpu
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| from transformers import CLIPModel, CLIPProcessor
from PIL import Image
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# Encode an image
image = Image.open("photo.jpg")
inputs = processor(images=image, return_tensors="pt")
image_embedding = model.get_image_features(**inputs)
# Encode a text query
text_inputs = processor(text=["a dog playing in the snow"], return_tensors="pt")
text_embedding = model.get_text_features(**text_inputs)
print(image_embedding.shape) # torch.Size([1, 512])
print(text_embedding.shape) # torch.Size([1, 512])
|
Both embeddings live in a 512-dimensional space. You compare them with cosine similarity. That is the entire trick behind CLIP-based search.
Choosing the Right CLIP Model#
Not all CLIP checkpoints are equal. Here is what matters:
| Model | Dim | Speed | Quality | Best For |
|---|
clip-vit-base-patch32 | 512 | Fast | Good | Prototyping, moderate datasets |
clip-vit-base-patch16 | 512 | Medium | Better | Production with GPU |
clip-vit-large-patch14 | 768 | Slow | Best | Maximum accuracy |
Use clip-vit-base-patch32 while building. Switch to clip-vit-large-patch14 when accuracy matters more than speed. The larger model is noticeably better at fine-grained distinctions – it can tell apart “golden retriever” from “labrador” where the base model sometimes cannot.
Computing Cosine Similarity#
Cosine similarity measures how close two vectors point in the same direction. CLIP embeddings are not normalized by default, so you need to handle that yourself.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
| import torch
import torch.nn.functional as F
from transformers import CLIPModel, CLIPProcessor
from PIL import Image
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
image = Image.open("beach.jpg")
# Encode image and multiple text queries
image_inputs = processor(images=image, return_tensors="pt")
text_inputs = processor(
text=["a sandy beach", "a mountain trail", "a city street"],
return_tensors="pt",
padding=True
)
with torch.no_grad():
image_features = model.get_image_features(**image_inputs)
text_features = model.get_text_features(**text_inputs)
# Normalize before computing similarity
image_features = F.normalize(image_features, dim=-1)
text_features = F.normalize(text_features, dim=-1)
# Cosine similarity: dot product of normalized vectors
similarities = (image_features @ text_features.T).squeeze(0)
for text, score in zip(["a sandy beach", "a mountain trail", "a city street"], similarities):
print(f"{text}: {score:.4f}")
# Example output:
# a sandy beach: 0.3124
# a mountain trail: 0.1856
# a city street: 0.1542
|
The highest score wins. Always normalize first – skipping this step gives you dot products that are not bounded between -1 and 1, making scores hard to interpret.
Building a FAISS Index#
Comparing every query against every image is O(n). FAISS gives you approximate nearest-neighbor search that stays fast even with millions of images.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
| import os
import numpy as np
import faiss
import torch
from transformers import CLIPModel, CLIPProcessor
from PIL import Image
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
def encode_images(image_paths: list[str]) -> np.ndarray:
"""Encode a list of image paths into normalized CLIP embeddings."""
embeddings = []
for path in image_paths:
image = Image.open(path).convert("RGB")
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
features = model.get_image_features(**inputs)
features = features / features.norm(dim=-1, keepdim=True)
embeddings.append(features.squeeze(0).numpy())
return np.array(embeddings).astype("float32")
# Gather all images from a directory
image_dir = "images/"
image_paths = [
os.path.join(image_dir, f)
for f in os.listdir(image_dir)
if f.lower().endswith((".jpg", ".png", ".jpeg"))
]
# Build the index
embeddings = encode_images(image_paths)
dimension = embeddings.shape[1] # 512
index = faiss.IndexFlatIP(dimension) # Inner product = cosine sim on normalized vectors
index.add(embeddings)
# Save for later
faiss.write_index(index, "image_search.index")
print(f"Indexed {index.ntotal} images")
|
Use IndexFlatIP (inner product) instead of IndexFlatL2 (Euclidean distance). On normalized vectors, inner product equals cosine similarity. This matters – L2 distance ranks results differently and gives worse search results for CLIP embeddings.
Querying by Text or Image#
Once the index is built, you can search with either a text string or a reference image.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
| import numpy as np
import faiss
import torch
from transformers import CLIPModel, CLIPProcessor
from PIL import Image
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# Load the saved index and image paths
index = faiss.read_index("image_search.index")
# image_paths should be loaded in the same order used during indexing
def search_by_text(query: str, top_k: int = 5) -> list[tuple[int, float]]:
"""Search images by text description."""
inputs = processor(text=[query], return_tensors="pt")
with torch.no_grad():
features = model.get_text_features(**inputs)
features = features / features.norm(dim=-1, keepdim=True)
query_vec = features.squeeze(0).numpy().astype("float32").reshape(1, -1)
scores, indices = index.search(query_vec, top_k)
return list(zip(indices[0].tolist(), scores[0].tolist()))
def search_by_image(image_path: str, top_k: int = 5) -> list[tuple[int, float]]:
"""Search images by visual similarity to a reference image."""
image = Image.open(image_path).convert("RGB")
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
features = model.get_image_features(**inputs)
features = features / features.norm(dim=-1, keepdim=True)
query_vec = features.squeeze(0).numpy().astype("float32").reshape(1, -1)
scores, indices = index.search(query_vec, top_k)
return list(zip(indices[0].tolist(), scores[0].tolist()))
# Text search
results = search_by_text("a red car parked on a street")
for idx, score in results:
print(f"Image index {idx}: score {score:.4f}")
# Image search
results = search_by_image("query_photo.jpg")
for idx, score in results:
print(f"Image index {idx}: score {score:.4f}")
|
Text-to-image search is where CLIP truly shines. You describe what you want in plain English and it finds matching images without any labels or metadata. Image-to-image search is useful for “find more like this” features.
Batch Processing for Large Datasets#
Encoding images one at a time is painfully slow. Process them in batches to take advantage of GPU parallelism.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
| import os
import numpy as np
import torch
from transformers import CLIPModel, CLIPProcessor
from PIL import Image
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
def encode_images_batched(
image_paths: list[str], batch_size: int = 32
) -> np.ndarray:
"""Encode images in batches for much faster processing."""
all_embeddings = []
for i in range(0, len(image_paths), batch_size):
batch_paths = image_paths[i : i + batch_size]
images = []
for path in batch_paths:
try:
img = Image.open(path).convert("RGB")
images.append(img)
except Exception as e:
print(f"Skipping {path}: {e}")
continue
if not images:
continue
inputs = processor(images=images, return_tensors="pt", padding=True)
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.no_grad():
features = model.get_image_features(**inputs)
features = features / features.norm(dim=-1, keepdim=True)
all_embeddings.append(features.cpu().numpy())
if (i // batch_size) % 10 == 0:
print(f"Processed {i + len(images)}/{len(image_paths)} images")
return np.concatenate(all_embeddings, axis=0).astype("float32")
# Process a large directory
image_dir = "dataset/"
image_paths = sorted([
os.path.join(image_dir, f)
for f in os.listdir(image_dir)
if f.lower().endswith((".jpg", ".png", ".jpeg"))
])
embeddings = encode_images_batched(image_paths, batch_size=64)
print(f"Encoded {len(embeddings)} images, shape: {embeddings.shape}")
|
On a T4 GPU, batch size 64 encodes about 200 images per second with the base model. On CPU, expect 5-10 per second. For datasets over 100k images, batch processing is not optional – it is the difference between minutes and days.
Scaling the Index#
IndexFlatIP does exact search. It is fine up to about 100k images. Beyond that, use an approximate index.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| import faiss
import numpy as np
dimension = 512
num_images = 1_000_000
# For 100k-1M images: IVF index with clustering
nlist = 100 # Number of clusters
quantizer = faiss.IndexFlatIP(dimension)
index = faiss.IndexIVFFlat(quantizer, dimension, nlist, faiss.METRIC_INNER_PRODUCT)
# Must train on a sample of your data first
training_data = embeddings[:50_000] # Use a representative sample
index.train(training_data)
index.add(embeddings)
# Control speed vs accuracy tradeoff
index.nprobe = 10 # Search 10 of 100 clusters (default is 1)
|
Set nprobe based on your latency budget. nprobe=1 is fastest but misses results. nprobe=10 catches most relevant results. nprobe=50 is nearly exact but slower. Start at 10 and adjust.
Common Errors#
RuntimeError: expected scalar type Float but found Half
This happens when running on GPU with mixed precision. FAISS expects float32 arrays. Fix it by casting explicitly:
1
| embeddings = features.cpu().float().numpy().astype("float32")
|
PIL.UnidentifiedImageError: cannot identify image file
Corrupted or non-image files in your dataset. Wrap Image.open() in a try/except block (as shown in the batch processing example) and skip bad files.
faiss.search() returns -1 indices
This means the index is empty or not trained. For IVF indices, you must call index.train(data) before index.add(data). For IndexFlatIP, make sure you actually added vectors.
CLIP returns identical scores for different queries
You are probably not normalizing embeddings. Without normalization, inner product scores are dominated by vector magnitude, not direction. Always normalize before adding to the index and before querying.
torch.cuda.OutOfMemoryError during batch encoding
Reduce your batch size. Start at 32 and halve it until it fits. Also make sure you are using torch.no_grad() – without it, PyTorch stores intermediate activations for backpropagation you do not need.