The Core Idea

CLIP (Contrastive Language-Image Pre-Training) maps images and text into the same embedding space. That means you can compare an image to a text description – or an image to another image – using cosine similarity. Pair it with FAISS for fast nearest-neighbor lookup and you have a production-ready image search engine in under 100 lines of Python.

1
pip install transformers torch Pillow faiss-cpu
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
from transformers import CLIPModel, CLIPProcessor
from PIL import Image

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Encode an image
image = Image.open("photo.jpg")
inputs = processor(images=image, return_tensors="pt")
image_embedding = model.get_image_features(**inputs)

# Encode a text query
text_inputs = processor(text=["a dog playing in the snow"], return_tensors="pt")
text_embedding = model.get_text_features(**text_inputs)

print(image_embedding.shape)  # torch.Size([1, 512])
print(text_embedding.shape)   # torch.Size([1, 512])

Both embeddings live in a 512-dimensional space. You compare them with cosine similarity. That is the entire trick behind CLIP-based search.

Choosing the Right CLIP Model

Not all CLIP checkpoints are equal. Here is what matters:

ModelDimSpeedQualityBest For
clip-vit-base-patch32512FastGoodPrototyping, moderate datasets
clip-vit-base-patch16512MediumBetterProduction with GPU
clip-vit-large-patch14768SlowBestMaximum accuracy

Use clip-vit-base-patch32 while building. Switch to clip-vit-large-patch14 when accuracy matters more than speed. The larger model is noticeably better at fine-grained distinctions – it can tell apart “golden retriever” from “labrador” where the base model sometimes cannot.

Computing Cosine Similarity

Cosine similarity measures how close two vectors point in the same direction. CLIP embeddings are not normalized by default, so you need to handle that yourself.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import torch
import torch.nn.functional as F
from transformers import CLIPModel, CLIPProcessor
from PIL import Image

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

image = Image.open("beach.jpg")

# Encode image and multiple text queries
image_inputs = processor(images=image, return_tensors="pt")
text_inputs = processor(
    text=["a sandy beach", "a mountain trail", "a city street"],
    return_tensors="pt",
    padding=True
)

with torch.no_grad():
    image_features = model.get_image_features(**image_inputs)
    text_features = model.get_text_features(**text_inputs)

# Normalize before computing similarity
image_features = F.normalize(image_features, dim=-1)
text_features = F.normalize(text_features, dim=-1)

# Cosine similarity: dot product of normalized vectors
similarities = (image_features @ text_features.T).squeeze(0)

for text, score in zip(["a sandy beach", "a mountain trail", "a city street"], similarities):
    print(f"{text}: {score:.4f}")

# Example output:
# a sandy beach: 0.3124
# a mountain trail: 0.1856
# a city street: 0.1542

The highest score wins. Always normalize first – skipping this step gives you dot products that are not bounded between -1 and 1, making scores hard to interpret.

Building a FAISS Index

Comparing every query against every image is O(n). FAISS gives you approximate nearest-neighbor search that stays fast even with millions of images.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import os
import numpy as np
import faiss
import torch
from transformers import CLIPModel, CLIPProcessor
from PIL import Image

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

def encode_images(image_paths: list[str]) -> np.ndarray:
    """Encode a list of image paths into normalized CLIP embeddings."""
    embeddings = []
    for path in image_paths:
        image = Image.open(path).convert("RGB")
        inputs = processor(images=image, return_tensors="pt")
        with torch.no_grad():
            features = model.get_image_features(**inputs)
        features = features / features.norm(dim=-1, keepdim=True)
        embeddings.append(features.squeeze(0).numpy())
    return np.array(embeddings).astype("float32")

# Gather all images from a directory
image_dir = "images/"
image_paths = [
    os.path.join(image_dir, f)
    for f in os.listdir(image_dir)
    if f.lower().endswith((".jpg", ".png", ".jpeg"))
]

# Build the index
embeddings = encode_images(image_paths)
dimension = embeddings.shape[1]  # 512

index = faiss.IndexFlatIP(dimension)  # Inner product = cosine sim on normalized vectors
index.add(embeddings)

# Save for later
faiss.write_index(index, "image_search.index")
print(f"Indexed {index.ntotal} images")

Use IndexFlatIP (inner product) instead of IndexFlatL2 (Euclidean distance). On normalized vectors, inner product equals cosine similarity. This matters – L2 distance ranks results differently and gives worse search results for CLIP embeddings.

Querying by Text or Image

Once the index is built, you can search with either a text string or a reference image.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import numpy as np
import faiss
import torch
from transformers import CLIPModel, CLIPProcessor
from PIL import Image

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Load the saved index and image paths
index = faiss.read_index("image_search.index")
# image_paths should be loaded in the same order used during indexing

def search_by_text(query: str, top_k: int = 5) -> list[tuple[int, float]]:
    """Search images by text description."""
    inputs = processor(text=[query], return_tensors="pt")
    with torch.no_grad():
        features = model.get_text_features(**inputs)
    features = features / features.norm(dim=-1, keepdim=True)
    query_vec = features.squeeze(0).numpy().astype("float32").reshape(1, -1)

    scores, indices = index.search(query_vec, top_k)
    return list(zip(indices[0].tolist(), scores[0].tolist()))

def search_by_image(image_path: str, top_k: int = 5) -> list[tuple[int, float]]:
    """Search images by visual similarity to a reference image."""
    image = Image.open(image_path).convert("RGB")
    inputs = processor(images=image, return_tensors="pt")
    with torch.no_grad():
        features = model.get_image_features(**inputs)
    features = features / features.norm(dim=-1, keepdim=True)
    query_vec = features.squeeze(0).numpy().astype("float32").reshape(1, -1)

    scores, indices = index.search(query_vec, top_k)
    return list(zip(indices[0].tolist(), scores[0].tolist()))

# Text search
results = search_by_text("a red car parked on a street")
for idx, score in results:
    print(f"Image index {idx}: score {score:.4f}")

# Image search
results = search_by_image("query_photo.jpg")
for idx, score in results:
    print(f"Image index {idx}: score {score:.4f}")

Text-to-image search is where CLIP truly shines. You describe what you want in plain English and it finds matching images without any labels or metadata. Image-to-image search is useful for “find more like this” features.

Batch Processing for Large Datasets

Encoding images one at a time is painfully slow. Process them in batches to take advantage of GPU parallelism.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import os
import numpy as np
import torch
from transformers import CLIPModel, CLIPProcessor
from PIL import Image

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

def encode_images_batched(
    image_paths: list[str], batch_size: int = 32
) -> np.ndarray:
    """Encode images in batches for much faster processing."""
    all_embeddings = []

    for i in range(0, len(image_paths), batch_size):
        batch_paths = image_paths[i : i + batch_size]
        images = []
        for path in batch_paths:
            try:
                img = Image.open(path).convert("RGB")
                images.append(img)
            except Exception as e:
                print(f"Skipping {path}: {e}")
                continue

        if not images:
            continue

        inputs = processor(images=images, return_tensors="pt", padding=True)
        inputs = {k: v.to(device) for k, v in inputs.items()}

        with torch.no_grad():
            features = model.get_image_features(**inputs)

        features = features / features.norm(dim=-1, keepdim=True)
        all_embeddings.append(features.cpu().numpy())

        if (i // batch_size) % 10 == 0:
            print(f"Processed {i + len(images)}/{len(image_paths)} images")

    return np.concatenate(all_embeddings, axis=0).astype("float32")

# Process a large directory
image_dir = "dataset/"
image_paths = sorted([
    os.path.join(image_dir, f)
    for f in os.listdir(image_dir)
    if f.lower().endswith((".jpg", ".png", ".jpeg"))
])

embeddings = encode_images_batched(image_paths, batch_size=64)
print(f"Encoded {len(embeddings)} images, shape: {embeddings.shape}")

On a T4 GPU, batch size 64 encodes about 200 images per second with the base model. On CPU, expect 5-10 per second. For datasets over 100k images, batch processing is not optional – it is the difference between minutes and days.

Scaling the Index

IndexFlatIP does exact search. It is fine up to about 100k images. Beyond that, use an approximate index.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import faiss
import numpy as np

dimension = 512
num_images = 1_000_000

# For 100k-1M images: IVF index with clustering
nlist = 100  # Number of clusters
quantizer = faiss.IndexFlatIP(dimension)
index = faiss.IndexIVFFlat(quantizer, dimension, nlist, faiss.METRIC_INNER_PRODUCT)

# Must train on a sample of your data first
training_data = embeddings[:50_000]  # Use a representative sample
index.train(training_data)
index.add(embeddings)

# Control speed vs accuracy tradeoff
index.nprobe = 10  # Search 10 of 100 clusters (default is 1)

Set nprobe based on your latency budget. nprobe=1 is fastest but misses results. nprobe=10 catches most relevant results. nprobe=50 is nearly exact but slower. Start at 10 and adjust.

Common Errors

RuntimeError: expected scalar type Float but found Half This happens when running on GPU with mixed precision. FAISS expects float32 arrays. Fix it by casting explicitly:

1
embeddings = features.cpu().float().numpy().astype("float32")

PIL.UnidentifiedImageError: cannot identify image file Corrupted or non-image files in your dataset. Wrap Image.open() in a try/except block (as shown in the batch processing example) and skip bad files.

faiss.search() returns -1 indices This means the index is empty or not trained. For IVF indices, you must call index.train(data) before index.add(data). For IndexFlatIP, make sure you actually added vectors.

CLIP returns identical scores for different queries You are probably not normalizing embeddings. Without normalization, inner product scores are dominated by vector magnitude, not direction. Always normalize before adding to the index and before querying.

torch.cuda.OutOfMemoryError during batch encoding Reduce your batch size. Start at 32 and halve it until it fits. Also make sure you are using torch.no_grad() – without it, PyTorch stores intermediate activations for backpropagation you do not need.