How to Build a Sentiment Analysis API with Transformers and FastAPI

You want a REST endpoint that takes text and returns a sentiment label with a confidence score. No training, no labeling, no GPU cluster. Just a pre-trained model behind a FastAPI server that handles real traffic.

The cardiffnlp/twitter-roberta-base-sentiment-latest model is the best off-the-shelf option for general English sentiment. It is a RoBERTa-base model fine-tuned on ~124 million tweets, outputs three labels (negative, neutral, positive), and runs fast on CPU. We will wrap it in a FastAPI app with proper startup loading, input validation, batch support, and the preprocessing the model actually needs.

The Minimal Working API

Install dependencies first:

1
pip install fastapi uvicorn transformers torch pydantic

Here is the full API in a single file. Save this as main.py:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from transformers import pipeline

model = {}

def preprocess(text: str) -> str:
    """Normalize mentions and URLs the way the model expects."""
    tokens = []
    for token in text.split(" "):
        if token.startswith("@") and len(token) > 1:
            token = "@user"
        elif token.startswith("http"):
            token = "http"
        tokens.append(token)
    return " ".join(tokens)

@asynccontextmanager
async def lifespan(app: FastAPI):
    model["pipe"] = pipeline(
        "sentiment-analysis",
        model="cardiffnlp/twitter-roberta-base-sentiment-latest",
        tokenizer="cardiffnlp/twitter-roberta-base-sentiment-latest",
        device=-1,  # CPU; use 0 for first GPU
    )
    # Warmup: run a throwaway prediction so the first real request is fast
    model["pipe"]("warmup")
    yield
    model.clear()

app = FastAPI(title="Sentiment Analysis API", lifespan=lifespan)

class SentimentRequest(BaseModel):
    text: str = Field(..., min_length=1, max_length=512)

class SentimentResponse(BaseModel):
    label: str
    score: float
    text: str

@app.post("/predict", response_model=SentimentResponse)
async def predict(req: SentimentRequest):
    cleaned = preprocess(req.text)
    result = model["pipe"](cleaned)[0]
    return SentimentResponse(
        label=result["label"],
        score=round(result["score"], 4),
        text=req.text,
    )

Start the server:

1
uvicorn main:app --host 0.0.0.0 --port 8000

Test it:

1
2
3
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"text": "This product is absolutely terrible, worst purchase ever"}'

Response:

1
{"label": "Negative", "score": 0.9312, "text": "This product is absolutely terrible, worst purchase ever"}

That is a working sentiment API. The model downloads on first startup (about 500MB), then stays in memory. Subsequent restarts use the cached model from ~/.cache/huggingface/.

Why Preprocessing Matters

The cardiffnlp model was trained on tweets where usernames are replaced with @user and URLs with http. If you skip this step, the model still produces output – but the scores drift. A tweet like @elonmusk check https://example.com great product! gives different confidence scores than @user check http great product!. The difference is typically 5-15% on the confidence score, which is enough to flip borderline predictions.

The preprocess function above handles this. Always run it before inference.

Adding Batch Predictions

Single-text predictions are fine for interactive use, but if you need to score hundreds of reviews, sending them one by one wastes time. The Transformers pipeline natively supports batching – you just pass a list.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from pydantic import conlist

class BatchRequest(BaseModel):
    texts: conlist(str, min_length=1, max_length=64)

class BatchResponse(BaseModel):
    results: list[SentimentResponse]

@app.post("/predict/batch", response_model=BatchResponse)
async def predict_batch(req: BatchRequest):
    cleaned = [preprocess(t) for t in req.texts]
    results = model["pipe"](cleaned, batch_size=16)
    return BatchResponse(
        results=[
            SentimentResponse(
                label=r["label"],
                score=round(r["score"], 4),
                text=original,
            )
            for r, original in zip(results, req.texts)
        ]
    )

The batch_size=16 parameter controls how many texts the model processes in a single forward pass. On CPU, 16 is a reasonable default. On GPU, you can push this to 32 or 64 depending on VRAM. The pipeline handles padding and truncation internally.

Test it:

1
2
3
curl -X POST http://localhost:8000/predict/batch \
  -H "Content-Type: application/json" \
  -d '{"texts": ["I love this!", "Worst experience ever.", "It was okay I guess."]}'

Handling Long Texts

RoBERTa has a 512-token limit. If you pass a 2000-word product review, the tokenizer silently truncates it, which means the model only sees the first ~400 words. For most sentiment tasks this is fine – sentiment is usually expressed early. But if you want to be explicit about it, set truncation=True when creating the pipeline:

1
2
3
4
5
6
7
model["pipe"] = pipeline(
    "sentiment-analysis",
    model="cardiffnlp/twitter-roberta-base-sentiment-latest",
    tokenizer="cardiffnlp/twitter-roberta-base-sentiment-latest",
    truncation=True,
    max_length=512,
)

This suppresses the warning you would otherwise see:

1
2
3
Token indices sequence length is longer than the specified maximum sequence length
for this model (723 > 512). Running this sequence through the model will result
in indexing errors.

Common Errors and Fixes

OSError: Can’t load tokenizer for ‘cardiffnlp/twitter-roberta-base-sentiment-latest’

This usually means you are behind a corporate proxy or firewall blocking huggingface.co. Fix it by downloading the model locally first:

1
python -c "from transformers import pipeline; pipeline('sentiment-analysis', model='cardiffnlp/twitter-roberta-base-sentiment-latest')"

Or set the proxy:

1
export HTTPS_PROXY=http://your-proxy:8080

torch.cuda.OutOfMemoryError: CUDA out of memory

You loaded the model on GPU but the batch is too large. Reduce batch_size or switch to CPU with device=-1. For a base-sized model (125M parameters), CPU inference is fast enough for most APIs – about 50-100ms per prediction.

Predictions always return Neutral

You are probably passing empty strings or whitespace-only input after preprocessing. The Pydantic min_length=1 validator catches empty strings, but a text like @john http preprocesses to @user http which is effectively content-free. Add a length check after preprocessing:

1
2
3
cleaned = preprocess(req.text)
if len(cleaned.strip()) < 3:
    raise HTTPException(status_code=422, detail="Text too short after preprocessing")

Running with Multiple Workers

For production traffic, run multiple Uvicorn workers behind Gunicorn. Each worker loads its own copy of the model, so watch your memory – the model uses about 500MB per worker.

1
gunicorn main:app -w 4 -k uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000

Four workers means ~2GB of RAM for the models alone. If memory is tight, use Gunicorn’s --preload flag to load the model once and share it across workers via copy-on-write:

1
gunicorn main:app -w 4 -k uvicorn.workers.UvicornWorker --preload --bind 0.0.0.0:8000

This works because the model weights are read-only after loading. The OS shares the physical memory pages between processes until a worker tries to write to them.

Switching Models

The entire API is model-agnostic. Swap the model string to use a different sentiment model without changing any other code:

Model	Labels	Best For
`cardiffnlp/twitter-roberta-base-sentiment-latest`	Negative/Neutral/Positive	Social media, general text
`nlptown/bert-base-multilingual-uncased-sentiment`	1-5 stars	Product reviews, multilingual
`distilbert-base-uncased-finetuned-sst-2-english`	Positive/Negative	Binary sentiment, fast inference

The pipeline API normalizes the output format, so your FastAPI endpoints work with any of these without modification. Just change the model name in the lifespan function and redeploy.

The Minimal Working API#

Why Preprocessing Matters#

Adding Batch Predictions#

Handling Long Texts#

Common Errors and Fixes#

Running with Multiple Workers#

Switching Models#

Related Guides#

About the Author