How to Build a Multilingual NLP Pipeline with Sentence Transformers

Quick Start: Multilingual Embeddings in 5 Lines

The core idea is simple: encode text from any language into the same vector space. A sentence in English and its translation in Japanese land near each other. That unlocks cross-lingual search, classification, and clustering without training separate models per language.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")

sentences = [
    "How do I reset my password?",       # English
    "Comment réinitialiser mon mot de passe ?",  # French
    "Wie setze ich mein Passwort zurück?",       # German
    "The weather is nice today.",                  # English (unrelated)
]

embeddings = model.encode(sentences)
print(embeddings.shape)  # (4, 384)

The first three sentences will cluster together despite being in different languages. The fourth sits far away. That’s the whole trick – one model, one vector space, all languages.

paraphrase-multilingual-MiniLM-L12-v2 supports 50+ languages and produces 384-dimensional vectors. It’s the sweet spot between speed and quality for most production use cases. If you need higher accuracy and can afford the latency, paraphrase-multilingual-mpnet-base-v2 gives 768-dimensional vectors with better performance on benchmarks.

Cross-Lingual Semantic Search with FAISS

The real payoff comes when you build a search index. You can index documents in French, German, Spanish, and Chinese, then query in English. Or any combination.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")

# Your multilingual document corpus
documents = [
    {"text": "Le chat est assis sur le tapis.", "lang": "fr"},
    {"text": "Die Katze sitzt auf der Matte.", "lang": "de"},
    {"text": "El gato está sentado en la alfombra.", "lang": "es"},
    {"text": "Machine learning requires large datasets.", "lang": "en"},
    {"text": "L'apprentissage automatique nécessite de grands ensembles de données.", "lang": "fr"},
    {"text": "猫がマットの上に座っている。", "lang": "ja"},
]

# Encode all documents
doc_texts = [d["text"] for d in documents]
doc_embeddings = model.encode(doc_texts, normalize_embeddings=True)

# Build a FAISS index
dimension = doc_embeddings.shape[1]  # 384
index = faiss.IndexFlatIP(dimension)  # Inner product (cosine sim on normalized vecs)
index.add(np.array(doc_embeddings, dtype="float32"))

# Query in English, find results in any language
query = "the cat is sitting on the mat"
query_embedding = model.encode([query], normalize_embeddings=True)

scores, indices = index.search(np.array(query_embedding, dtype="float32"), k=3)

for rank, (score, idx) in enumerate(zip(scores[0], indices[0])):
    doc = documents[idx]
    print(f"{rank+1}. [{doc['lang']}] {doc['text']} (score: {score:.4f})")

Output:

1
2
3
1. [ja] 猫がマットの上に座っている。 (score: 0.8921)
2. [de] Die Katze sitzt auf der Matte. (score: 0.8847)
3. [fr] Le chat est assis sur le tapis. (score: 0.8803)

A few things to notice. We use normalize_embeddings=True so that inner product equals cosine similarity – FAISS IndexFlatIP handles this efficiently. The Japanese, German, and French “cat on mat” sentences all score above 0.88 against the English query, while the machine learning sentences don’t appear in the top 3.

For production with millions of documents, swap IndexFlatIP for IndexIVFFlat or IndexHNSWFlat to get sublinear search time.

Cross-Lingual Similarity Scoring

Sometimes you don’t need a full index. You just want to know how similar two texts are across languages – for deduplication, translation quality checks, or matching support tickets.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")

pairs = [
    ("I love programming", "J'adore la programmation"),
    ("I love programming", "Das Wetter ist schön heute"),
    ("The server is down", "Le serveur est en panne"),
]

for sent_a, sent_b in pairs:
    emb_a = model.encode(sent_a, convert_to_tensor=True)
    emb_b = model.encode(sent_b, convert_to_tensor=True)
    similarity = util.cos_sim(emb_a, emb_b).item()
    print(f"{similarity:.4f} | {sent_a} <-> {sent_b}")

1
2
3
0.8634 | I love programming <-> J'adore la programmation
0.1205 | I love programming <-> Das Wetter ist schön heute
0.9012 | The server is down <-> Le serveur est en panne

Translation pairs score above 0.85. Unrelated pairs drop below 0.2. That gap is wide enough to set a threshold for automated matching.

Cross-Lingual Text Classification

You can train a classifier on English data and apply it to other languages. The multilingual embeddings act as a language-agnostic feature extractor.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
import numpy as np

model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")

# Training data: English only
train_texts = [
    "My order hasn't arrived yet",
    "I want a refund for this product",
    "How do I change my shipping address?",
    "Great product, works perfectly!",
    "I'm very happy with my purchase",
    "Excellent customer service",
]
train_labels = [0, 0, 0, 1, 1, 1]  # 0=complaint, 1=positive

train_embeddings = model.encode(train_texts)
clf = LogisticRegression()
clf.fit(train_embeddings, train_labels)

# Test on French and German -- zero-shot cross-lingual transfer
test_texts = [
    "Ma commande n'est pas encore arrivée",       # French complaint
    "Ich bin sehr zufrieden mit meinem Kauf",       # German positive
    "El producto no funciona correctamente",        # Spanish complaint
    "素晴らしいサービスでした",                          # Japanese positive
]

test_embeddings = model.encode(test_texts)
predictions = clf.predict(test_embeddings)

label_names = {0: "complaint", 1: "positive"}
for text, pred in zip(test_texts, predictions):
    print(f"{label_names[pred]:>10} | {text}")

1
2
3
4
 complaint | Ma commande n'est pas encore arrivée
  positive | Ich bin sehr zufrieden mit meinem Kauf
 complaint | El producto no funciona correctamente
  positive | 素晴らしいサービスでした

Train in one language, infer in any. The sklearn classifier doesn’t know anything about language – it just sees 384-dimensional vectors. This works surprisingly well for sentiment, intent classification, and topic routing.

Bilingual Text Mining

Need to find translation pairs in a mixed-language corpus? Sentence Transformers has a built-in mining utility.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")

english_sentences = [
    "The cat sat on the mat.",
    "Machine learning is transforming healthcare.",
    "I need to book a flight to Berlin.",
]

french_sentences = [
    "J'ai besoin de réserver un vol pour Berlin.",
    "Le chat était assis sur le tapis.",
    "L'apprentissage automatique transforme la santé.",
    "La tour Eiffel est à Paris.",  # No English match
]

en_embeddings = model.encode(english_sentences, normalize_embeddings=True)
fr_embeddings = model.encode(french_sentences, normalize_embeddings=True)

# Compute all pairwise similarities
cosine_scores = util.cos_sim(en_embeddings, fr_embeddings)

# Find best matches above threshold
threshold = 0.75
for i, en_sent in enumerate(english_sentences):
    best_idx = cosine_scores[i].argmax().item()
    best_score = cosine_scores[i][best_idx].item()
    if best_score > threshold:
        print(f"{best_score:.4f} | EN: {en_sent}")
        print(f"       | FR: {french_sentences[best_idx]}\n")

This is useful for building parallel corpora from messy multilingual data, aligning documents across languages, or verifying translation quality at scale.

Choosing the Right Model

Not all multilingual models are equal. Here’s what actually matters:

Model	Dimensions	Languages	Speed	Quality
`paraphrase-multilingual-MiniLM-L12-v2`	384	50+	Fast	Good
`paraphrase-multilingual-mpnet-base-v2`	768	50+	Medium	Better
`distiluse-base-multilingual-cased-v2`	512	50+	Fast	Decent

Start with paraphrase-multilingual-MiniLM-L12-v2. It’s half the dimensions of mpnet, which means your FAISS index uses half the memory and search is faster. Switch to mpnet only if you measure a meaningful accuracy gap on your actual data.

For domain-specific tasks, fine-tuning on parallel sentence pairs in your target languages gives a significant boost. Even 1,000 high-quality pairs can move the needle.

Common Errors

`RuntimeError: CUDA out of memory`

Encoding large batches on GPU eats memory fast. Reduce the batch size:

1
embeddings = model.encode(sentences, batch_size=32, show_progress_bar=True)

Default batch size is 64. Drop to 16 or 32 for long sentences or limited GPU memory.

`ValueError: expected a non-empty list of sentences`

Happens when you pass an empty list or None to model.encode(). Always validate input:

1
2
3
4
sentences = [s for s in raw_sentences if s and s.strip()]
if not sentences:
    raise ValueError("No valid sentences to encode")
embeddings = model.encode(sentences)

FAISS index returns wrong results after updates

FAISS IndexFlatIP expects normalized vectors when you use inner product as cosine similarity. If you forget normalize_embeddings=True on some batches, scores become meaningless. Always normalize consistently:

1
2
3
4
5
# Correct: normalize at encode time
embeddings = model.encode(texts, normalize_embeddings=True)

# Or normalize after the fact
faiss.normalize_L2(embeddings)

Slow encoding on CPU

If you’re stuck on CPU and encoding is painfully slow, try the ONNX or OpenVINO backends:

1
pip install sentence-transformers[onnx]

1
2
3
4
model = SentenceTransformer(
    "paraphrase-multilingual-MiniLM-L12-v2",
    backend="onnx"
)

This typically gives a 2-4x speedup on CPU without any accuracy loss.

Garbled results for CJK languages

Some tokenizers struggle with Chinese, Japanese, or Korean if you’re running an older version of sentence-transformers or transformers. Upgrade both:

1
pip install -U sentence-transformers transformers

Also make sure you’re passing actual Unicode strings, not byte strings or escaped sequences.

Quick Start: Multilingual Embeddings in 5 Lines#

Cross-Lingual Semantic Search with FAISS#

Cross-Lingual Similarity Scoring#

Cross-Lingual Text Classification#

Bilingual Text Mining#

Choosing the Right Model#

Common Errors#

RuntimeError: CUDA out of memory#

ValueError: expected a non-empty list of sentences#

FAISS index returns wrong results after updates#

Slow encoding on CPU#

Garbled results for CJK languages#

Related Guides#

About the Author