You have text in an unknown language and you need it in English (or any other target). The pipeline is two steps: figure out what language the text is in, then pick the right translation model and run it. No paid APIs, no rate limits, everything runs locally.
We will use lingua for language detection (it is fast, accurate, and works offline) and Helsinki-NLP MarianMT models from Hugging Face for translation. MarianMT has pre-trained models for hundreds of language pairs, and they are small enough to run on CPU.
Install Dependencies#
1
| pip install lingua-language-detector transformers sentencepiece torch fastapi uvicorn
|
sentencepiece is required by MarianMT tokenizers. Skip torch if you already have it installed with CUDA support.
Detect the Language#
lingua ships pre-built detectors with configurable language sets. Loading all 75 languages takes about 1 second and uses ~200 MB of RAM. If you know your inputs are limited to a handful of languages, restrict the detector to those for faster and more accurate results.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| from lingua import Language, LanguageDetectorBuilder
# Full detector - supports 75 languages
detector = LanguageDetectorBuilder.from_all_languages().build()
# Or restrict to languages you actually expect
detector = LanguageDetectorBuilder.from_languages(
Language.ENGLISH, Language.SPANISH, Language.FRENCH,
Language.GERMAN, Language.CHINESE, Language.ARABIC,
Language.JAPANESE, Language.KOREAN, Language.RUSSIAN,
Language.PORTUGUESE
).build()
text = "La inteligencia artificial está transformando el mundo"
language = detector.detect_language_of(text)
print(language) # Language.SPANISH
print(language.iso_code_639_1.name.lower()) # "es"
|
lingua also gives you confidence scores when you need them:
1
2
3
4
5
6
| confidences = detector.compute_language_confidence_values(text)
for lang, score in confidences[:3]:
print(f"{lang.name}: {score:.4f}")
# SPANISH: 0.9487
# PORTUGUESE: 0.0312
# ITALIAN: 0.0089
|
This is useful for short texts where the detector is less certain. If the top confidence is below 0.5, you probably want to flag the input for manual review rather than guessing.
Load the Right Translation Model#
Helsinki-NLP models follow a naming convention: Helsinki-NLP/opus-mt-{src}-{tgt}. So Spanish to English is Helsinki-NLP/opus-mt-es-en, French to German is Helsinki-NLP/opus-mt-fr-de.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| from transformers import MarianMTModel, MarianTokenizer
def get_translator(src_lang: str, tgt_lang: str):
"""Load a MarianMT model for a given language pair."""
model_name = f"Helsinki-NLP/opus-mt-{src_lang}-{tgt_lang}"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
return tokenizer, model
def translate(text: str, tokenizer, model) -> str:
"""Translate a single string."""
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
translated = model.generate(**inputs)
return tokenizer.decode(translated[0], skip_special_tokens=True)
tokenizer, model = get_translator("es", "en")
result = translate("La inteligencia artificial está transformando el mundo", tokenizer, model)
print(result) # "Artificial intelligence is transforming the world"
|
The first call downloads the model (~300 MB per pair). After that, loading from cache is fast.
Handle Unsupported Pairs with Pivot Translation#
Not every language pair has a direct model. There is no Helsinki-NLP/opus-mt-ja-fr for Japanese to French. The workaround is pivot translation: translate to English first, then from English to your target.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
| from huggingface_hub import model_info
from huggingface_hub.utils import RepositoryNotFoundError
# Cache loaded models to avoid reloading
_model_cache = {}
def has_direct_model(src: str, tgt: str) -> bool:
"""Check if a direct translation model exists on Hugging Face."""
try:
model_info(f"Helsinki-NLP/opus-mt-{src}-{tgt}")
return True
except RepositoryNotFoundError:
return False
def get_cached_translator(src: str, tgt: str):
"""Load and cache a translator for a language pair."""
key = f"{src}-{tgt}"
if key not in _model_cache:
_model_cache[key] = get_translator(src, tgt)
return _model_cache[key]
def translate_with_pivot(text: str, src_lang: str, tgt_lang: str) -> str:
"""Translate text, using English as a pivot if no direct model exists."""
if src_lang == tgt_lang:
return text
if has_direct_model(src_lang, tgt_lang):
tokenizer, model = get_cached_translator(src_lang, tgt_lang)
return translate(text, tokenizer, model)
# Pivot through English
tokenizer_to_en, model_to_en = get_cached_translator(src_lang, "en")
english_text = translate(text, tokenizer_to_en, model_to_en)
if tgt_lang == "en":
return english_text
tokenizer_from_en, model_from_en = get_cached_translator("en", tgt_lang)
return translate(english_text, tokenizer_from_en, model_from_en)
|
Pivot translation adds latency and can lose nuance, but it works for the vast majority of practical cases. English is the best pivot because Helsinki-NLP has models from almost every language to English and vice versa.
Batch Translation#
MarianMT handles batches natively. If you have a list of texts in the same language, batch them for a significant speedup over translating one at a time.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| def translate_batch(texts: list[str], tokenizer, model) -> list[str]:
"""Translate a batch of texts in a single forward pass."""
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
translated = model.generate(**inputs)
return [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
# Translate 4 Spanish texts in one shot
spanish_texts = [
"Buenos dias",
"Como estas?",
"El gato esta en la mesa",
"Necesito ayuda con mi codigo",
]
tokenizer, model = get_cached_translator("es", "en")
results = translate_batch(spanish_texts, tokenizer, model)
for src, tgt in zip(spanish_texts, results):
print(f"{src} -> {tgt}")
|
For mixed-language inputs, group texts by detected language first, then batch-translate each group. This avoids loading and unloading models repeatedly.
The Full Pipeline as a FastAPI Service#
Here is a complete endpoint that accepts text, detects its language, and translates it to a requested target language.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
| from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from lingua import Language, LanguageDetectorBuilder
from transformers import MarianMTModel, MarianTokenizer
from huggingface_hub import model_info
from huggingface_hub.utils import RepositoryNotFoundError
state = {}
@asynccontextmanager
async def lifespan(app: FastAPI):
state["detector"] = LanguageDetectorBuilder.from_all_languages().build()
state["models"] = {}
yield
state.clear()
app = FastAPI(title="Translation Pipeline", lifespan=lifespan)
LANG_TO_ISO = {lang: lang.iso_code_639_1.name.lower() for lang in Language}
def load_model(src: str, tgt: str):
key = f"{src}-{tgt}"
if key not in state["models"]:
name = f"Helsinki-NLP/opus-mt-{src}-{tgt}"
tokenizer = MarianTokenizer.from_pretrained(name)
model = MarianMTModel.from_pretrained(name)
state["models"][key] = (tokenizer, model)
return state["models"][key]
def do_translate(text: str, src: str, tgt: str) -> str:
try:
tokenizer, model = load_model(src, tgt)
except Exception:
# Pivot through English
tok_en, mod_en = load_model(src, "en")
inputs = tok_en(text, return_tensors="pt", padding=True, truncation=True)
en_text = tok_en.decode(mod_en.generate(**inputs)[0], skip_special_tokens=True)
if tgt == "en":
return en_text
tok_tgt, mod_tgt = load_model("en", tgt)
inputs = tok_tgt(en_text, return_tensors="pt", padding=True, truncation=True)
return tok_tgt.decode(mod_tgt.generate(**inputs)[0], skip_special_tokens=True)
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
translated = model.generate(**inputs)
return tokenizer.decode(translated[0], skip_special_tokens=True)
class TranslateRequest(BaseModel):
text: str = Field(..., min_length=1, max_length=5000)
target_lang: str = Field(default="en", pattern="^[a-z]{2}$")
class TranslateResponse(BaseModel):
detected_language: str
confidence: float
source_lang: str
target_lang: str
translated_text: str
@app.post("/translate", response_model=TranslateResponse)
def translate_endpoint(req: TranslateRequest):
confidences = state["detector"].compute_language_confidence_values(req.text)
if not confidences or confidences[0][1] < 0.25:
raise HTTPException(400, "Could not detect language with sufficient confidence")
detected_lang = confidences[0][0]
src_iso = LANG_TO_ISO.get(detected_lang)
if src_iso is None:
raise HTTPException(400, f"Unsupported language: {detected_lang.name}")
if src_iso == req.target_lang:
return TranslateResponse(
detected_language=detected_lang.name,
confidence=round(confidences[0][1], 4),
source_lang=src_iso,
target_lang=req.target_lang,
translated_text=req.text,
)
translated = do_translate(req.text, src_iso, req.target_lang)
return TranslateResponse(
detected_language=detected_lang.name,
confidence=round(confidences[0][1], 4),
source_lang=src_iso,
target_lang=req.target_lang,
translated_text=translated,
)
|
Run it with:
1
| uvicorn main:app --host 0.0.0.0 --port 8000
|
Test it:
1
2
3
| curl -X POST http://localhost:8000/translate \
-H "Content-Type: application/json" \
-d '{"text": "La inteligencia artificial esta cambiando todo", "target_lang": "en"}'
|
Common Errors and Fixes#
OSError: Helsinki-NLP/opus-mt-xx-yy is not a local folder and is not a valid model identifier
The model for that language pair does not exist. This is the most common issue. Check the Helsinki-NLP model list for available pairs. The pivot approach described above handles this automatically.
lingua returns None for short texts
Detection on texts shorter than 10 characters is unreliable. lingua may return None if it cannot determine the language. Always check for None before proceeding and consider requiring a minimum text length in your API.
RuntimeError: The size of tensor a must match the size of tensor b
This happens when you pass texts of wildly different lengths in a batch without proper padding. Make sure you pass padding=True and truncation=True to the tokenizer. If inputs exceed the model’s max length (512 tokens for MarianMT), they will be silently truncated.
Model downloads are slow or fail
MarianMT models are ~300 MB each. If you are deploying to a server, pre-download models during your Docker build step. Use MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-es-en") in a build script so the cache is baked into the image.
Memory usage grows with many language pairs
Each loaded model takes ~300 MB of RAM. If you support 20 language pairs, that is 6 GB. Implement an LRU cache that evicts least-recently-used models, or limit the number of models loaded at once. For the FastAPI service above, swap state["models"] with functools.lru_cache or a bounded dict.