Why Off-the-Shelf Embeddings Fall Short

General-purpose embedding models like all-MiniLM-L6-v2 work well on generic text, but they struggle with domain-specific jargon. If you’re building search for legal documents, medical records, or internal engineering wikis, the base model has never seen your terminology in the right context. “Plaintiff filed a motion to compel” and “Party requested forced disclosure” mean the same thing, but a general model might not place them close together in vector space.

Fine-tuning fixes this. You train the model on query-document pairs from your domain, and it learns which texts should be close together. The result: 10-30% better retrieval accuracy with the same model architecture and inference cost.

Here’s the full workflow using the sentence-transformers library.

Build Your Training Data

You need pairs of (query, relevant_document). The easiest way to start is with your existing search logs – queries users typed and the documents they clicked. If you don’t have logs, use an LLM to generate synthetic queries from your documents.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import json

# Each example: {"query": "...", "positive": "..."}
# The positive is a passage that answers the query
training_data = [
    {
        "query": "statute of limitations for breach of contract",
        "positive": "Under UCC Section 2-725, an action for breach of a sales contract must be commenced within four years after the cause of action has accrued."
    },
    {
        "query": "can landlord enter without notice",
        "positive": "A landlord must provide at least 24 hours written notice before entering a tenant's dwelling unit, except in cases of emergency as defined in Civil Code Section 1954."
    },
    {
        "query": "requirements for valid will",
        "positive": "A valid will requires the testator to be at least 18 years of age, of sound mind, and the document must be signed by the testator and witnessed by at least two competent witnesses."
    },
]

# Save as JSONL for loading later
with open("train_pairs.jsonl", "w") as f:
    for item in training_data:
        f.write(json.dumps(item) + "\n")

You want at least 500 pairs for noticeable improvement, and 5,000+ for strong results. Quality matters more than quantity – noisy pairs where the “positive” document doesn’t actually answer the query will hurt you.

Fine-Tune with MultipleNegativesRankingLoss

MultipleNegativesRankingLoss (MNRL) is the go-to loss function for retrieval fine-tuning. For each (query, positive) pair in a batch, every other positive in that batch acts as a negative. With a batch size of 64, each query sees 1 positive and 63 negatives. No need to mine hard negatives yourself – the batch does it for you.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
    losses,
)
from sentence_transformers.training_args import BatchSamplers
from datasets import Dataset
import json

# Load training pairs
records = []
with open("train_pairs.jsonl") as f:
    for line in f:
        records.append(json.loads(line))

train_dataset = Dataset.from_list([
    {"anchor": r["query"], "positive": r["positive"]}
    for r in records
])

# Load base model
model = SentenceTransformer("BAAI/bge-base-en-v1.5")

# MNRL: in-batch negatives, no need for explicit negative examples
loss = losses.MultipleNegativesRankingLoss(model)

args = SentenceTransformerTrainingArguments(
    output_dir="./fine-tuned-embeddings",
    num_train_epochs=3,
    per_device_train_batch_size=64,   # larger batch = more negatives = better
    learning_rate=2e-5,
    warmup_ratio=0.1,
    fp16=True,
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # avoid duplicate passages in a batch
    save_strategy="epoch",
)

trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
)

trainer.train()
model.save_pretrained("./fine-tuned-embeddings/final")

Batch size is critical here. MNRL performance scales directly with batch size because more items in the batch means more negatives per query. If your GPU can handle 128 or 256, use it. On a 16GB GPU, batch size 64 with bge-base-en-v1.5 (768 dimensions) fits comfortably.

Add Matryoshka Representation Learning

Matryoshka embeddings let you truncate vectors to smaller dimensions at inference time without retraining. A 768-dim model can be used at 256 or 128 dimensions with graceful quality degradation. This is useful when you need to trade off storage/speed vs. accuracy – store 256-dim vectors in your vector database instead of 768-dim ones.

Wrap your loss function with MatryoshkaLoss:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
from sentence_transformers import losses

base_loss = losses.MultipleNegativesRankingLoss(model)

# Train at multiple dimensionalities simultaneously
matryoshka_loss = losses.MatryoshkaLoss(
    model=model,
    loss=base_loss,
    matryoshka_dims=[768, 512, 256, 128, 64],
)

# Use matryoshka_loss instead of base_loss in the trainer
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=matryoshka_loss,
)

trainer.train()

At inference time, you truncate and normalize:

1
2
3
4
5
6
7
# Encode at full dimensionality, then truncate
embeddings = model.encode(["search query here"], normalize_embeddings=True)
embeddings_256d = embeddings[:, :256]  # use first 256 dims

# Or set truncate_dim on the model directly
model.truncate_dim = 256
embeddings_256d = model.encode(["search query here"], normalize_embeddings=True)

The first 256 dimensions capture ~95% of the information in typical Matryoshka-trained models. At 128 dimensions you’re usually at ~90%. Below 64 quality drops off fast.

Evaluate with InformationRetrievalEvaluator

Don’t ship a fine-tuned model without measuring it. InformationRetrievalEvaluator computes standard IR metrics: NDCG@k, MRR@k, MAP@k, and Recall@k.

You need a test set with queries, a corpus, and relevance judgments (which query maps to which corpus documents).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
from sentence_transformers.evaluation import InformationRetrievalEvaluator

# Corpus: doc_id -> text
corpus = {
    "doc_0": "Under UCC Section 2-725, an action for breach...",
    "doc_1": "A landlord must provide at least 24 hours notice...",
    "doc_2": "A valid will requires the testator to be at least 18...",
    # ... hundreds more
}

# Queries: query_id -> text
queries = {
    "q_0": "how long to sue for broken contract",
    "q_1": "landlord entering apartment rules",
    "q_2": "what makes a will legally valid",
}

# Relevant docs per query: query_id -> set of doc_ids
relevant_docs = {
    "q_0": {"doc_0"},
    "q_1": {"doc_1"},
    "q_2": {"doc_2"},
}

evaluator = InformationRetrievalEvaluator(
    queries=queries,
    corpus=corpus,
    relevant_docs=relevant_docs,
    name="legal-domain-eval",
    show_progress_bar=True,
)

# Run evaluation
results = evaluator(model)
print(f"NDCG@10: {results['legal-domain-eval_ndcg@10']:.4f}")
print(f"MRR@10:  {results['legal-domain-eval_mrr@10']:.4f}")
print(f"Recall@10: {results['legal-domain-eval_recall@10']:.4f}")

Run this evaluator on both the base model and your fine-tuned model to measure the improvement. If NDCG@10 doesn’t improve by at least 2-3 points, your training data might be too noisy or too small.

You can also pass the evaluator directly into the trainer to log metrics during training:

1
2
3
4
5
6
7
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=matryoshka_loss,
    evaluator=evaluator,
)

Load the saved model and use it like any sentence-transformers model:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("./fine-tuned-embeddings/final")

# Embed your corpus (do this once, store the vectors)
corpus_texts = ["doc text 1...", "doc text 2...", "doc text 3..."]
corpus_embeddings = model.encode(corpus_texts, normalize_embeddings=True)

# Search
query = "what's the deadline for filing a contract dispute"
query_embedding = model.encode([query], normalize_embeddings=True)

# Cosine similarity (since vectors are normalized, dot product = cosine sim)
scores = np.dot(corpus_embeddings, query_embedding.T).flatten()
top_indices = np.argsort(scores)[::-1][:10]

for idx in top_indices:
    print(f"Score: {scores[idx]:.4f} | {corpus_texts[idx][:100]}")

Always set normalize_embeddings=True when encoding. Normalized vectors let you use dot product instead of cosine similarity, which is faster and what most vector databases expect.

Common Errors and Fixes

RuntimeError: CUDA out of memory during training

Reduce per_device_train_batch_size. MNRL still works at batch size 32 or even 16, it’s just less effective. You can also enable gradient checkpointing or use fp16=True if you haven’t already.

Evaluation metrics are worse after fine-tuning

This usually means your training pairs are noisy. Check a random sample of 50 pairs manually. If more than 10% have mismatched query-document pairs, clean your data before retraining. Another cause: training too long. Try 1 epoch instead of 3 – embedding models overfit fast on small datasets.

ValueError: Columns ['anchor', 'positive'] not found

The SentenceTransformerTrainer expects specific column names depending on the loss function. For MNRL with pairs, use anchor and positive. If you’re using triplets with explicit negatives, the columns should be anchor, positive, and negative.

Matryoshka truncated embeddings give bad results

Make sure you’re normalizing after truncation, not before. If you truncate pre-normalized vectors, they’re no longer unit-length and dot product scores will be inconsistent. Use model.truncate_dim = 256 and normalize_embeddings=True together – the library handles the order correctly.

Fine-tuned model doesn’t load with SentenceTransformer()

If you saved with model.save_pretrained(), the output directory needs both the model weights and the sentence_transformers config files. Use model.save("./path") instead, which saves everything needed for SentenceTransformer("./path") to work.