Cross-encoders score text pairs directly instead of comparing separate embeddings. That matters because they see both texts at the same time, letting the attention mechanism find fine-grained interactions that bi-encoders miss entirely. The tradeoff is speed — cross-encoders are slower for large-scale retrieval — but for pairwise scoring, reranking, and similarity checks on small candidate sets, they crush cosine similarity on accuracy.

Here’s the full setup: load a cross-encoder from sentence-transformers, score text pairs, and wrap the whole thing in a FastAPI service you can deploy today.

Why Cross-Encoders Beat Bi-Encoders for Pairwise Similarity

Bi-encoders encode each text independently into a fixed vector, then compare with cosine similarity. That’s fast for millions of documents but lossy — the model never sees both texts together, so it misses token-level interactions.

Cross-encoders concatenate both texts and pass them through a single transformer. The model attends across both inputs simultaneously, producing a single relevance score. On the STS Benchmark, cross-encoder/stsb-roberta-large scores around 92.7 Spearman correlation compared to roughly 86-88 for the best bi-encoder models.

Use bi-encoders when you need to search millions of documents. Use cross-encoders when you already have a shortlist of 10-100 candidates and need accurate pairwise scores.

Scoring Text Pairs with a Cross-Encoder

Install the dependencies first:

1
pip install sentence-transformers fastapi uvicorn pydantic

The sentence-transformers library ships with a CrossEncoder class that handles tokenization and inference. Here’s a standalone script that scores a few pairs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
from sentence_transformers import CrossEncoder

model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2", max_length=512)

pairs = [
    ("How do I reset my password?", "Click 'Forgot Password' on the login page to reset your credentials."),
    ("How do I reset my password?", "Our restaurant serves Italian and Japanese cuisine."),
    ("What is the return policy?", "You can return any item within 30 days for a full refund."),
]

scores = model.predict(pairs)

for pair, score in zip(pairs, scores):
    print(f"Score: {score:.4f}")
    print(f"  Text A: {pair[0]}")
    print(f"  Text B: {pair[1]}")
    print()

Output looks something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
Score: 8.8341
  Text A: How do I reset my password?
  Text B: Click 'Forgot Password' on the login page to reset your credentials.

Score: -5.2107
  Text A: How do I reset my password?
  Text B: Our restaurant serves Italian and Japanese cuisine.

Score: 7.1523
  Text A: What is the return policy?
  Text B: You can return any item within 30 days for a full refund.

Higher scores mean higher relevance. The scale depends on the model — ms-marco-MiniLM-L-12-v2 outputs raw logits roughly in the range of -10 to 10. If you want 0-1 scores, use an STS-trained model like cross-encoder/stsb-roberta-large which outputs calibrated similarity scores, or apply a sigmoid yourself.

Choosing the Right Model

Pick the model based on your task:

  • cross-encoder/ms-marco-MiniLM-L-12-v2 — Trained on MS MARCO passage ranking. Best for query-document relevance and search reranking. Fast (12 layers).
  • cross-encoder/stsb-roberta-large — Trained on STS Benchmark. Outputs 0-1 similarity scores. Better for pure semantic similarity between two sentences.
  • cross-encoder/ms-marco-MiniLM-L-6-v2 — Smaller, faster variant. Good when latency matters more than the last bit of accuracy.

For this API, we’ll use ms-marco-MiniLM-L-12-v2 since it handles both similarity and relevance tasks well.

Wrapping It in FastAPI

Here’s the full API with the lifespan context manager for model loading, single-pair and batch endpoints, and proper Pydantic validation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
from contextlib import asynccontextmanager
from fastapi import FastAPI
from pydantic import BaseModel, Field
from sentence_transformers import CrossEncoder

model_store: dict = {}


@asynccontextmanager
async def lifespan(app: FastAPI):
    model_store["cross_encoder"] = CrossEncoder(
        "cross-encoder/ms-marco-MiniLM-L-12-v2", max_length=512
    )
    yield
    model_store.clear()


app = FastAPI(title="Text Similarity API", lifespan=lifespan)


class TextPair(BaseModel):
    text_a: str = Field(..., min_length=1, max_length=2000)
    text_b: str = Field(..., min_length=1, max_length=2000)


class BatchRequest(BaseModel):
    pairs: list[TextPair] = Field(..., min_length=1, max_length=100)


class SimilarityResult(BaseModel):
    text_a: str
    text_b: str
    score: float


@app.post("/score", response_model=SimilarityResult)
async def score_pair(request: TextPair):
    model = model_store["cross_encoder"]
    score = model.predict([(request.text_a, request.text_b)])[0]
    return SimilarityResult(
        text_a=request.text_a,
        text_b=request.text_b,
        score=float(score),
    )


@app.post("/score/batch", response_model=list[SimilarityResult])
async def score_batch(request: BatchRequest):
    model = model_store["cross_encoder"]
    input_pairs = [(p.text_a, p.text_b) for p in request.pairs]
    scores = model.predict(input_pairs)
    return [
        SimilarityResult(
            text_a=pair.text_a,
            text_b=pair.text_b,
            score=float(score),
        )
        for pair, score in zip(request.pairs, scores)
    ]


@app.get("/health")
async def health():
    return {"status": "ok", "model_loaded": "cross_encoder" in model_store}

Run it:

1
uvicorn main:app --host 0.0.0.0 --port 8000

Test the single-pair endpoint:

1
2
3
curl -X POST http://localhost:8000/score \
  -H "Content-Type: application/json" \
  -d '{"text_a": "How to train a neural network", "text_b": "Guide to training deep learning models"}'

Test the batch endpoint:

1
2
3
4
5
6
7
8
curl -X POST http://localhost:8000/score/batch \
  -H "Content-Type: application/json" \
  -d '{
    "pairs": [
      {"text_a": "machine learning basics", "text_b": "intro to ML algorithms"},
      {"text_a": "machine learning basics", "text_b": "best pizza in New York"}
    ]
  }'

Batch Scoring Performance

Cross-encoder inference benefits from batching. The predict method already handles this internally — pass all your pairs at once rather than calling predict in a loop. For large batches, you can control the internal batch size:

1
scores = model.predict(pairs, batch_size=32)

On a CPU, expect roughly 50-100 pairs per second with the MiniLM-L-12 model. On a GPU, you’ll get 500-2000 pairs per second depending on text lengths. That’s why cross-encoders work best as a reranking stage after a fast bi-encoder retrieval step.

Comparing Cross-Encoder vs Bi-Encoder

Here’s a direct comparison to see the accuracy difference yourself:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from sentence_transformers import CrossEncoder, SentenceTransformer, util

cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")
bi_encoder = SentenceTransformer("all-MiniLM-L6-v2")

pairs = [
    ("The cat sat on the mat", "A feline rested on a rug"),
    ("Python is great for ML", "Java is used in enterprise software"),
    ("How to fix a flat tire", "Steps to repair a punctured bicycle tire"),
]

print("Cross-Encoder Scores:")
ce_scores = cross_encoder.predict(pairs)
for pair, score in zip(pairs, ce_scores):
    print(f"  {score:>8.4f}  |  {pair[0]}  <->  {pair[1]}")

print("\nBi-Encoder Cosine Similarity:")
for text_a, text_b in pairs:
    emb_a = bi_encoder.encode(text_a)
    emb_b = bi_encoder.encode(text_b)
    cos_sim = util.cos_sim(emb_a, emb_b).item()
    print(f"  {cos_sim:>8.4f}  |  {text_a}  <->  {text_b}")

You’ll notice the cross-encoder assigns a much wider score gap between related and unrelated pairs. That sharper discrimination is exactly why cross-encoders win for tasks where precision matters — duplicate detection, answer selection, and reranking retrieved passages.

Common Errors and Fixes

RuntimeError: CUDA out of memory — Cross-encoders load the full transformer model. If you’re on a GPU with limited VRAM, either use a smaller model (ms-marco-MiniLM-L-6-v2) or force CPU inference:

1
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2", device="cpu")

ValueError: Input exceeds max_length — The default max token length varies by model. Set max_length explicitly when loading the model to truncate long inputs rather than crashing:

1
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2", max_length=512)

Scores look wrong or inverted — Make sure you’re using the right model for your task. MS MARCO models output relevance logits (higher = more relevant). STS models output 0-1 similarity scores. Mixing them up leads to confusing results.

Slow inference on batch endpoint — If your batch endpoint feels sluggish, check that you’re passing all pairs to predict() at once instead of looping. Also verify you’re not accidentally running on CPU when a GPU is available. Check with:

1
2
import torch
print(torch.cuda.is_available())

OSError: Can't load tokenizer for 'cross-encoder/...' — This usually means the model wasn’t downloaded yet and your environment has no internet access. Pre-download models before deployment:

1
python -c "from sentence_transformers import CrossEncoder; CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')"

FastAPI returns 422 on valid-looking requests — Check that your JSON body matches the Pydantic model exactly. The field names are text_a and text_b, not textA or text1. For batch requests, wrap pairs in a pairs array.