Why Off-the-Shelf Embeddings Fall Short
General-purpose embedding models like all-MiniLM-L6-v2 work well on generic text, but they struggle with domain-specific jargon. If you’re building search for legal documents, medical records, or internal engineering wikis, the base model has never seen your terminology in the right context. “Plaintiff filed a motion to compel” and “Party requested forced disclosure” mean the same thing, but a general model might not place them close together in vector space.
Fine-tuning fixes this. You train the model on query-document pairs from your domain, and it learns which texts should be close together. The result: 10-30% better retrieval accuracy with the same model architecture and inference cost.
Here’s the full workflow using the sentence-transformers library.
Build Your Training Data
You need pairs of (query, relevant_document). The easiest way to start is with your existing search logs – queries users typed and the documents they clicked. If you don’t have logs, use an LLM to generate synthetic queries from your documents.
| |
You want at least 500 pairs for noticeable improvement, and 5,000+ for strong results. Quality matters more than quantity – noisy pairs where the “positive” document doesn’t actually answer the query will hurt you.
Fine-Tune with MultipleNegativesRankingLoss
MultipleNegativesRankingLoss (MNRL) is the go-to loss function for retrieval fine-tuning. For each (query, positive) pair in a batch, every other positive in that batch acts as a negative. With a batch size of 64, each query sees 1 positive and 63 negatives. No need to mine hard negatives yourself – the batch does it for you.
| |
Batch size is critical here. MNRL performance scales directly with batch size because more items in the batch means more negatives per query. If your GPU can handle 128 or 256, use it. On a 16GB GPU, batch size 64 with bge-base-en-v1.5 (768 dimensions) fits comfortably.
Add Matryoshka Representation Learning
Matryoshka embeddings let you truncate vectors to smaller dimensions at inference time without retraining. A 768-dim model can be used at 256 or 128 dimensions with graceful quality degradation. This is useful when you need to trade off storage/speed vs. accuracy – store 256-dim vectors in your vector database instead of 768-dim ones.
Wrap your loss function with MatryoshkaLoss:
| |
At inference time, you truncate and normalize:
| |
The first 256 dimensions capture ~95% of the information in typical Matryoshka-trained models. At 128 dimensions you’re usually at ~90%. Below 64 quality drops off fast.
Evaluate with InformationRetrievalEvaluator
Don’t ship a fine-tuned model without measuring it. InformationRetrievalEvaluator computes standard IR metrics: NDCG@k, MRR@k, MAP@k, and Recall@k.
You need a test set with queries, a corpus, and relevance judgments (which query maps to which corpus documents).
| |
Run this evaluator on both the base model and your fine-tuned model to measure the improvement. If NDCG@10 doesn’t improve by at least 2-3 points, your training data might be too noisy or too small.
You can also pass the evaluator directly into the trainer to log metrics during training:
| |
Use Your Fine-Tuned Model for Search
Load the saved model and use it like any sentence-transformers model:
| |
Always set normalize_embeddings=True when encoding. Normalized vectors let you use dot product instead of cosine similarity, which is faster and what most vector databases expect.
Common Errors and Fixes
RuntimeError: CUDA out of memory during training
Reduce per_device_train_batch_size. MNRL still works at batch size 32 or even 16, it’s just less effective. You can also enable gradient checkpointing or use fp16=True if you haven’t already.
Evaluation metrics are worse after fine-tuning
This usually means your training pairs are noisy. Check a random sample of 50 pairs manually. If more than 10% have mismatched query-document pairs, clean your data before retraining. Another cause: training too long. Try 1 epoch instead of 3 – embedding models overfit fast on small datasets.
ValueError: Columns ['anchor', 'positive'] not found
The SentenceTransformerTrainer expects specific column names depending on the loss function. For MNRL with pairs, use anchor and positive. If you’re using triplets with explicit negatives, the columns should be anchor, positive, and negative.
Matryoshka truncated embeddings give bad results
Make sure you’re normalizing after truncation, not before. If you truncate pre-normalized vectors, they’re no longer unit-length and dot product scores will be inconsistent. Use model.truncate_dim = 256 and normalize_embeddings=True together – the library handles the order correctly.
Fine-tuned model doesn’t load with SentenceTransformer()
If you saved with model.save_pretrained(), the output directory needs both the model weights and the sentence_transformers config files. Use model.save("./path") instead, which saves everything needed for SentenceTransformer("./path") to work.
Related Guides
- How to Fine-Tune LLMs with LoRA and Unsloth
- How to Fine-Tune LLMs on Custom Datasets with Axolotl
- How to Fine-Tune LLMs with DPO and RLHF
- How to Build Context-Aware Prompt Routing with Embeddings
- How to Distill Large LLMs into Smaller, Cheaper Models
- How to Build Retrieval-Augmented Generation with Contextual Compression
- How to Route Prompts to the Best LLM with a Semantic Router
- How to Build Few-Shot Prompt Templates with Dynamic Examples
- How to Build RAG Applications with LangChain and ChromaDB
- How to Build Prefix Tuning for LLMs with PEFT and PyTorch