How to Build a Multilingual Sentiment Pipeline with XLM-RoBERTa

Most sentiment models only handle English. The moment you hit Spanish reviews, Japanese tweets, or Arabic comments, they fall apart. XLM-RoBERTa fixes this. A single model, trained across 100+ languages, gives you sentiment labels without needing separate models per language.

The cardiffnlp/twitter-xlm-roberta-base-sentiment-multilingual model is fine-tuned on multilingual tweet data and outputs three labels: positive, negative, and neutral. It works out of the box with the Hugging Face Transformers library. You don’t need to detect the language first or do any translation preprocessing.

Quick Start with the Pipeline API

Install the dependencies:

1
pip install transformers torch

The fastest way to get predictions is the pipeline API. Three lines and you have multilingual sentiment:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
from transformers import pipeline

classifier = pipeline(
    "sentiment-analysis",
    model="cardiffnlp/twitter-xlm-roberta-base-sentiment-multilingual",
    device=-1,  # CPU; use 0 for GPU
)

texts = [
    "I love this product!",           # English
    "Este producto es terrible",      # Spanish
    "この映画は最高だった",              # Japanese
    "Ce restaurant est vraiment nul", # French
    "Dieses Buch ist fantastisch",    # German
]

results = classifier(texts)

for text, result in zip(texts, results):
    print(f"{text[:40]:40s} -> {result['label']:8s} ({result['score']:.3f})")

Output looks like this:

1
2
3
4
5
I love this product!                     -> positive (0.874)
Este producto es terrible                -> negative (0.891)
この映画は最高だった                        -> positive (0.762)
Ce restaurant est vraiment nul           -> negative (0.843)
Dieses Buch ist fantastisch              -> positive (0.809)

The model handles all of these with zero configuration changes. The label mapping is positive, neutral, negative – same three labels regardless of input language.

Manual Tokenizer and Model Usage

The pipeline API is convenient, but sometimes you need more control. Maybe you want raw logits, custom postprocessing, or to run the model inside a larger system. Here is how to load the tokenizer and model directly:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

MODEL_ID = "cardiffnlp/twitter-xlm-roberta-base-sentiment-multilingual"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
model.eval()

LABELS = ["negative", "neutral", "positive"]

def predict_sentiment(text: str) -> dict:
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        logits = model(**inputs).logits
    probs = torch.softmax(logits, dim=-1).squeeze()
    top_idx = probs.argmax().item()
    return {
        "label": LABELS[top_idx],
        "score": probs[top_idx].item(),
        "all_scores": {LABELS[i]: probs[i].item() for i in range(len(LABELS))},
    }

result = predict_sentiment("Мне очень нравится эта книга")  # Russian: "I really like this book"
print(result)
# {'label': 'positive', 'score': 0.812, 'all_scores': {'negative': 0.043, 'neutral': 0.145, 'positive': 0.812}}

The all_scores dict gives you the full probability distribution. This is useful when you need to flag borderline cases where the top label has low confidence, say below 0.5.

Batch Processing with DataLoader

When you have thousands or millions of texts, feeding them one at a time is slow. Use a DataLoader to batch inputs and push them through the model efficiently:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import torch
from torch.utils.data import DataLoader, Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification

MODEL_ID = "cardiffnlp/twitter-xlm-roberta-base-sentiment-multilingual"
LABELS = ["negative", "neutral", "positive"]

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
model.eval()

class TextDataset(Dataset):
    def __init__(self, texts: list[str]):
        self.texts = texts

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        return self.texts[idx]

def collate_fn(batch: list[str]):
    return tokenizer(batch, padding=True, truncation=True, max_length=512, return_tensors="pt")

# Sample data -- replace with your actual dataset
texts = [
    "Great service, highly recommend!",
    "El peor restaurante de la ciudad",
    "普通の味でした",
    "Produit de mauvaise qualité",
    "Toller Kundenservice",
] * 200  # 1000 texts for demonstration

dataset = TextDataset(texts)
loader = DataLoader(dataset, batch_size=32, collate_fn=collate_fn)

all_results = []

with torch.no_grad():
    for batch in loader:
        logits = model(**batch).logits
        probs = torch.softmax(logits, dim=-1)
        top_indices = probs.argmax(dim=-1)
        top_scores = probs.max(dim=-1).values

        for idx, score in zip(top_indices, top_scores):
            all_results.append({
                "label": LABELS[idx.item()],
                "score": score.item(),
            })

print(f"Processed {len(all_results)} texts")
print(f"Positive: {sum(1 for r in all_results if r['label'] == 'positive')}")
print(f"Negative: {sum(1 for r in all_results if r['label'] == 'negative')}")
print(f"Neutral:  {sum(1 for r in all_results if r['label'] == 'neutral')}")

A batch size of 32 works well on CPU. If you have a GPU, bump it to 64 or 128. The bottleneck is tokenization and padding, so keeping texts roughly the same length within a batch helps throughput.

Adding Language Detection

Sometimes you want to know what language a text is before you score sentiment. Maybe you want to filter by language or route different languages to specialized models. The langdetect library handles this:

1
pip install langdetect

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
from langdetect import detect, DetectorFactory
from transformers import pipeline

# Make language detection deterministic
DetectorFactory.seed = 0

classifier = pipeline(
    "sentiment-analysis",
    model="cardiffnlp/twitter-xlm-roberta-base-sentiment-multilingual",
    device=-1,
)

def analyze_with_language(text: str) -> dict:
    try:
        lang = detect(text)
    except Exception:
        lang = "unknown"

    result = classifier(text)[0]
    return {
        "text": text,
        "language": lang,
        "sentiment": result["label"],
        "confidence": round(result["score"], 4),
    }

samples = [
    "The weather today is absolutely beautiful",
    "La comida estaba fría y sin sabor",
    "素晴らしいサービスでした",
    "Отличный продукт, рекомендую",
    "المنتج سيء جداً",
]

for text in samples:
    result = analyze_with_language(text)
    print(f"[{result['language']}] {result['sentiment']:8s} ({result['confidence']}) {text[:50]}")

Output:

1
2
3
4
5
[en] positive (0.8731) The weather today is absolutely beautiful
[es] negative (0.8654) La comida estaba fría y sin sabor
[ja] positive (0.7814) 素晴らしいサービスでした
[ru] positive (0.7923) Отличный продукт, рекомендую
[ar] negative (0.8102) المنتج سيء جداً

Note: langdetect struggles with very short texts (under 10 characters). For short inputs, skip the detection step or use a fallback.

FastAPI Endpoint

Here is a production-ready API that wraps the pipeline behind a /predict endpoint. It uses the lifespan context manager for model loading:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from transformers import pipeline

models = {}

@asynccontextmanager
async def lifespan(app: FastAPI):
    models["classifier"] = pipeline(
        "sentiment-analysis",
        model="cardiffnlp/twitter-xlm-roberta-base-sentiment-multilingual",
        device=-1,
    )
    # Warmup run so the first real request is not slow
    models["classifier"]("warmup")
    yield
    models.clear()

app = FastAPI(title="Multilingual Sentiment API", lifespan=lifespan)

class SentimentRequest(BaseModel):
    texts: list[str] = Field(..., min_length=1, max_length=64, description="List of texts to analyze")

class SentimentResult(BaseModel):
    text: str
    label: str
    score: float

class SentimentResponse(BaseModel):
    results: list[SentimentResult]

@app.post("/predict", response_model=SentimentResponse)
async def predict(request: SentimentRequest):
    for text in request.texts:
        if len(text.strip()) == 0:
            raise HTTPException(status_code=422, detail="Empty text is not allowed")
        if len(text) > 5000:
            raise HTTPException(status_code=422, detail="Text exceeds 5000 character limit")

    predictions = models["classifier"](request.texts)
    results = [
        SentimentResult(text=text, label=pred["label"], score=round(pred["score"], 4))
        for text, pred in zip(request.texts, predictions)
    ]
    return SentimentResponse(results=results)

Save this as app.py and run it:

1
2
pip install fastapi uvicorn transformers torch
uvicorn app:app --host 0.0.0.0 --port 8000

Test it with curl:

1
2
3
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"texts": ["I love this!", "Das ist schrecklich", "まあまあです"]}'

The response contains the sentiment label and confidence for each text:

1
2
3
4
5
6
7
{
  "results": [
    {"text": "I love this!", "label": "positive", "score": 0.8912},
    {"text": "Das ist schrecklich", "label": "negative", "score": 0.8345},
    {"text": "まあまあです", "label": "neutral", "score": 0.6721}
  ]
}

Common Errors and Fixes

OSError: Can't load tokenizer for 'cardiffnlp/twitter-xlm-roberta-base-sentiment-multilingual'

You are probably behind a firewall or the model is not cached. Download it explicitly first:

1
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('cardiffnlp/twitter-xlm-roberta-base-sentiment-multilingual')"

Or set the TRANSFORMERS_CACHE environment variable to a writable directory.

RuntimeError: The size of tensor a (514) must match the size of tensor b (512)

Your input text is too long. The model has a 512-token limit. Always set truncation=True in the tokenizer or pipeline call:

1
2
3
4
5
6
classifier = pipeline(
    "sentiment-analysis",
    model="cardiffnlp/twitter-xlm-roberta-base-sentiment-multilingual",
    truncation=True,
    max_length=512,
)

ValueError: too many values to unpack when using pipeline on a list

This happens when you pass a list but expect a single dict back. The pipeline returns a list of dicts when the input is a list. Handle it like this:

1
2
results = classifier(["text1", "text2"])  # Returns [{"label": ..., "score": ...}, ...]
# NOT: result = classifier(["text1", "text2"])["label"]  # This will fail

Low confidence scores on short texts (under 5 words)

XLM-RoBERTa needs some context to make good predictions. Single-word or very short inputs often get neutral labels with low confidence. If your data is mostly short texts, consider padding with context or filtering out texts shorter than 3 words.

torch.cuda.OutOfMemoryError during batch processing

Reduce your batch size. Start with 8 and increase until you hit the memory limit. You can also use torch.cuda.empty_cache() between batches, though this is usually a sign you need a smaller batch size or float16 inference:

1
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID, torch_dtype=torch.float16).to("cuda")

Quick Start with the Pipeline API#

Manual Tokenizer and Model Usage#

Batch Processing with DataLoader#

Adding Language Detection#

FastAPI Endpoint#

Common Errors and Fixes#

Related Guides#

About the Author