Quick Start: Multilingual Embeddings in 5 Lines
The core idea is simple: encode text from any language into the same vector space. A sentence in English and its translation in Japanese land near each other. That unlocks cross-lingual search, classification, and clustering without training separate models per language.
| |
The first three sentences will cluster together despite being in different languages. The fourth sits far away. That’s the whole trick – one model, one vector space, all languages.
paraphrase-multilingual-MiniLM-L12-v2 supports 50+ languages and produces 384-dimensional vectors. It’s the sweet spot between speed and quality for most production use cases. If you need higher accuracy and can afford the latency, paraphrase-multilingual-mpnet-base-v2 gives 768-dimensional vectors with better performance on benchmarks.
Cross-Lingual Semantic Search with FAISS
The real payoff comes when you build a search index. You can index documents in French, German, Spanish, and Chinese, then query in English. Or any combination.
| |
Output:
| |
A few things to notice. We use normalize_embeddings=True so that inner product equals cosine similarity – FAISS IndexFlatIP handles this efficiently. The Japanese, German, and French “cat on mat” sentences all score above 0.88 against the English query, while the machine learning sentences don’t appear in the top 3.
For production with millions of documents, swap IndexFlatIP for IndexIVFFlat or IndexHNSWFlat to get sublinear search time.
Cross-Lingual Similarity Scoring
Sometimes you don’t need a full index. You just want to know how similar two texts are across languages – for deduplication, translation quality checks, or matching support tickets.
| |
| |
Translation pairs score above 0.85. Unrelated pairs drop below 0.2. That gap is wide enough to set a threshold for automated matching.
Cross-Lingual Text Classification
You can train a classifier on English data and apply it to other languages. The multilingual embeddings act as a language-agnostic feature extractor.
| |
| |
Train in one language, infer in any. The sklearn classifier doesn’t know anything about language – it just sees 384-dimensional vectors. This works surprisingly well for sentiment, intent classification, and topic routing.
Bilingual Text Mining
Need to find translation pairs in a mixed-language corpus? Sentence Transformers has a built-in mining utility.
| |
This is useful for building parallel corpora from messy multilingual data, aligning documents across languages, or verifying translation quality at scale.
Choosing the Right Model
Not all multilingual models are equal. Here’s what actually matters:
| Model | Dimensions | Languages | Speed | Quality |
|---|---|---|---|---|
paraphrase-multilingual-MiniLM-L12-v2 | 384 | 50+ | Fast | Good |
paraphrase-multilingual-mpnet-base-v2 | 768 | 50+ | Medium | Better |
distiluse-base-multilingual-cased-v2 | 512 | 50+ | Fast | Decent |
Start with paraphrase-multilingual-MiniLM-L12-v2. It’s half the dimensions of mpnet, which means your FAISS index uses half the memory and search is faster. Switch to mpnet only if you measure a meaningful accuracy gap on your actual data.
For domain-specific tasks, fine-tuning on parallel sentence pairs in your target languages gives a significant boost. Even 1,000 high-quality pairs can move the needle.
Common Errors
RuntimeError: CUDA out of memory
Encoding large batches on GPU eats memory fast. Reduce the batch size:
| |
Default batch size is 64. Drop to 16 or 32 for long sentences or limited GPU memory.
ValueError: expected a non-empty list of sentences
Happens when you pass an empty list or None to model.encode(). Always validate input:
| |
FAISS index returns wrong results after updates
FAISS IndexFlatIP expects normalized vectors when you use inner product as cosine similarity. If you forget normalize_embeddings=True on some batches, scores become meaningless. Always normalize consistently:
| |
Slow encoding on CPU
If you’re stuck on CPU and encoding is painfully slow, try the ONNX or OpenVINO backends:
| |
| |
This typically gives a 2-4x speedup on CPU without any accuracy loss.
Garbled results for CJK languages
Some tokenizers struggle with Chinese, Japanese, or Korean if you’re running an older version of sentence-transformers or transformers. Upgrade both:
| |
Also make sure you’re passing actual Unicode strings, not byte strings or escaped sequences.
Related Guides
- How to Build a Sentiment-Aware Search Pipeline with Embeddings
- How to Build a RAG Pipeline with Hugging Face Transformers v5
- How to Build a Text Embedding Pipeline with Sentence Transformers and FAISS
- How to Build a Hybrid Keyword and Semantic Search Pipeline
- How to Build a Text Clustering Pipeline with Embeddings and HDBSCAN
- How to Build a Text Chunking and Splitting Pipeline for RAG
- How to Build a Text-to-Knowledge-Graph Pipeline with SpaCy and NetworkX
- How to Build a Text Entailment and Contradiction Detection Pipeline
- How to Build an Extractive Question Answering System with Transformers
- How to Build a Text Summarization Pipeline with Sumy and Transformers