Cross-encoders score text pairs directly instead of comparing separate embeddings. That matters because they see both texts at the same time, letting the attention mechanism find fine-grained interactions that bi-encoders miss entirely. The tradeoff is speed — cross-encoders are slower for large-scale retrieval — but for pairwise scoring, reranking, and similarity checks on small candidate sets, they crush cosine similarity on accuracy.
Here’s the full setup: load a cross-encoder from sentence-transformers, score text pairs, and wrap the whole thing in a FastAPI service you can deploy today.
Why Cross-Encoders Beat Bi-Encoders for Pairwise Similarity
Bi-encoders encode each text independently into a fixed vector, then compare with cosine similarity. That’s fast for millions of documents but lossy — the model never sees both texts together, so it misses token-level interactions.
Cross-encoders concatenate both texts and pass them through a single transformer. The model attends across both inputs simultaneously, producing a single relevance score. On the STS Benchmark, cross-encoder/stsb-roberta-large scores around 92.7 Spearman correlation compared to roughly 86-88 for the best bi-encoder models.
Use bi-encoders when you need to search millions of documents. Use cross-encoders when you already have a shortlist of 10-100 candidates and need accurate pairwise scores.
Scoring Text Pairs with a Cross-Encoder
Install the dependencies first:
| |
The sentence-transformers library ships with a CrossEncoder class that handles tokenization and inference. Here’s a standalone script that scores a few pairs:
| |
Output looks something like this:
| |
Higher scores mean higher relevance. The scale depends on the model — ms-marco-MiniLM-L-12-v2 outputs raw logits roughly in the range of -10 to 10. If you want 0-1 scores, use an STS-trained model like cross-encoder/stsb-roberta-large which outputs calibrated similarity scores, or apply a sigmoid yourself.
Choosing the Right Model
Pick the model based on your task:
cross-encoder/ms-marco-MiniLM-L-12-v2— Trained on MS MARCO passage ranking. Best for query-document relevance and search reranking. Fast (12 layers).cross-encoder/stsb-roberta-large— Trained on STS Benchmark. Outputs 0-1 similarity scores. Better for pure semantic similarity between two sentences.cross-encoder/ms-marco-MiniLM-L-6-v2— Smaller, faster variant. Good when latency matters more than the last bit of accuracy.
For this API, we’ll use ms-marco-MiniLM-L-12-v2 since it handles both similarity and relevance tasks well.
Wrapping It in FastAPI
Here’s the full API with the lifespan context manager for model loading, single-pair and batch endpoints, and proper Pydantic validation:
| |
Run it:
| |
Test the single-pair endpoint:
| |
Test the batch endpoint:
| |
Batch Scoring Performance
Cross-encoder inference benefits from batching. The predict method already handles this internally — pass all your pairs at once rather than calling predict in a loop. For large batches, you can control the internal batch size:
| |
On a CPU, expect roughly 50-100 pairs per second with the MiniLM-L-12 model. On a GPU, you’ll get 500-2000 pairs per second depending on text lengths. That’s why cross-encoders work best as a reranking stage after a fast bi-encoder retrieval step.
Comparing Cross-Encoder vs Bi-Encoder
Here’s a direct comparison to see the accuracy difference yourself:
| |
You’ll notice the cross-encoder assigns a much wider score gap between related and unrelated pairs. That sharper discrimination is exactly why cross-encoders win for tasks where precision matters — duplicate detection, answer selection, and reranking retrieved passages.
Common Errors and Fixes
RuntimeError: CUDA out of memory — Cross-encoders load the full transformer model. If you’re on a GPU with limited VRAM, either use a smaller model (ms-marco-MiniLM-L-6-v2) or force CPU inference:
| |
ValueError: Input exceeds max_length — The default max token length varies by model. Set max_length explicitly when loading the model to truncate long inputs rather than crashing:
| |
Scores look wrong or inverted — Make sure you’re using the right model for your task. MS MARCO models output relevance logits (higher = more relevant). STS models output 0-1 similarity scores. Mixing them up leads to confusing results.
Slow inference on batch endpoint — If your batch endpoint feels sluggish, check that you’re passing all pairs to predict() at once instead of looping. Also verify you’re not accidentally running on CPU when a GPU is available. Check with:
| |
OSError: Can't load tokenizer for 'cross-encoder/...' — This usually means the model wasn’t downloaded yet and your environment has no internet access. Pre-download models before deployment:
| |
FastAPI returns 422 on valid-looking requests — Check that your JSON body matches the Pydantic model exactly. The field names are text_a and text_b, not textA or text1. For batch requests, wrap pairs in a pairs array.
Related Guides
- How to Build a Sentiment Analysis API with Transformers and FastAPI
- How to Build a Text Embedding Pipeline with Sentence Transformers and FAISS
- How to Build a Text Readability Scoring Pipeline with Python
- How to Build a Text-to-SQL Pipeline with LLMs
- How to Build a Text Correction and Grammar Checking Pipeline
- How to Build a Text Classification Pipeline with SetFit
- How to Build a Text Clustering Pipeline with Embeddings and HDBSCAN
- How to Build a Text Style Transfer Pipeline with Transformers
- How to Build a Text Paraphrase Pipeline with T5 and PEGASUS
- How to Build a Named Entity Recognition Pipeline with spaCy and Transformers