How to Build Automated Hate Speech Detection with Guardrails

Hate speech detection in production needs more than a keyword blocklist. You need a layered pipeline: fast keyword filters catch the obvious stuff, a transformer classifier handles the nuanced cases, and an LLM review step picks up edge cases that classifiers miss.

Here’s the fastest path to a working detector using the facebook/roberta-hate-speech-dynabench-r4-target model from Hugging Face:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="facebook/roberta-hate-speech-dynabench-r4-target",
)

texts = [
    "I love spending time with my family.",
    "All members of that group should be eliminated.",
    "The weather is nice today.",
]

for text in texts:
    result = classifier(text)[0]
    print(f"[{result['label']}] score={result['score']:.4f} | {text[:60]}")

The model outputs hate or nothate labels with confidence scores. That gives you the building block for everything else.

Build the Multi-Layer Detection Pipeline

A single classifier isn’t enough. You want three layers: a fast keyword pre-filter, the transformer classifier, and an LLM fallback for ambiguous cases. Each layer catches what the previous one misses.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import re
from dataclasses import dataclass
from transformers import pipeline

@dataclass
class DetectionResult:
    flagged: bool
    action: str  # "allow", "block", "review"
    layer: str
    confidence: float
    reason: str

# Layer 1: Keyword filter (microseconds)
HATE_PATTERNS = [
    r"\b(?:kill|murder|eliminate)\s+(?:all|every)\s+\w+",
    r"\b(?:exterminate|eradicate)\b.*\b(?:race|group|people)\b",
    r"\bdeath\s+to\b",
]

def keyword_filter(text: str) -> DetectionResult | None:
    text_lower = text.lower()
    for pattern in HATE_PATTERNS:
        if re.search(pattern, text_lower):
            return DetectionResult(
                flagged=True,
                action="block",
                layer="keyword",
                confidence=0.99,
                reason=f"Matched pattern: {pattern}",
            )
    return None

# Layer 2: Transformer classifier (~50ms on GPU)
hate_classifier = pipeline(
    "text-classification",
    model="facebook/roberta-hate-speech-dynabench-r4-target",
)

BLOCK_THRESHOLD = 0.85
REVIEW_THRESHOLD = 0.50

def classifier_filter(text: str) -> DetectionResult:
    result = hate_classifier(text)[0]
    label = result["label"]
    score = result["score"]

    if label == "hate" and score >= BLOCK_THRESHOLD:
        return DetectionResult(True, "block", "classifier", score, "High-confidence hate speech")
    elif label == "hate" and score >= REVIEW_THRESHOLD:
        return DetectionResult(True, "review", "classifier", score, "Possible hate speech, needs review")
    else:
        return DetectionResult(False, "allow", "classifier", score, "No hate speech detected")

# Full pipeline
def detect_hate_speech(text: str) -> DetectionResult:
    # Layer 1: keyword check
    keyword_result = keyword_filter(text)
    if keyword_result:
        return keyword_result

    # Layer 2: classifier
    return classifier_filter(text)

The thresholds matter. Setting BLOCK_THRESHOLD at 0.85 means you auto-block only when the model is very confident. Anything between 0.50 and 0.85 gets flagged for human review. Tune these based on your false positive tolerance – start conservative and loosen up as you collect labeled data.

Integrate as FastAPI Middleware

Wrap the detection pipeline as middleware so every request gets screened before hitting your LLM endpoint. Use FastAPI’s lifespan context manager to load the model once at startup instead of on every request.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
from contextlib import asynccontextmanager
from fastapi import FastAPI, Request, HTTPException
from pydantic import BaseModel
from transformers import pipeline as hf_pipeline
import logging
import json
from datetime import datetime, timezone

# Audit logger
audit_logger = logging.getLogger("hate_speech_audit")
audit_logger.setLevel(logging.INFO)
handler = logging.FileHandler("hate_speech_audit.jsonl")
audit_logger.addHandler(handler)

ml_models = {}

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Load classifier at startup
    ml_models["hate_classifier"] = hf_pipeline(
        "text-classification",
        model="facebook/roberta-hate-speech-dynabench-r4-target",
    )
    yield
    ml_models.clear()

app = FastAPI(lifespan=lifespan)

class ChatRequest(BaseModel):
    message: str

class ChatResponse(BaseModel):
    reply: str
    flagged: bool
    action: str

def log_detection(request_text: str, result: DetectionResult):
    event = {
        "timestamp": datetime.now(timezone.utc).isoformat(),
        "action": result.action,
        "layer": result.layer,
        "confidence": result.confidence,
        "reason": result.reason,
        "input_preview": request_text[:200],
    }
    audit_logger.info(json.dumps(event))

@app.post("/chat", response_model=ChatResponse)
async def chat(req: ChatRequest):
    classifier = ml_models["hate_classifier"]
    prediction = classifier(req.message)[0]

    label = prediction["label"]
    score = prediction["score"]

    if label == "hate" and score >= 0.85:
        log_detection(req.message, DetectionResult(True, "block", "classifier", score, "Auto-blocked"))
        raise HTTPException(status_code=400, detail="Message blocked by content policy.")

    if label == "hate" and score >= 0.50:
        log_detection(req.message, DetectionResult(True, "review", "classifier", score, "Flagged for review"))
        return ChatResponse(
            reply="Your message has been flagged for review.",
            flagged=True,
            action="review",
        )

    # Normal path: pass to your LLM here
    return ChatResponse(
        reply="This is where your LLM response goes.",
        flagged=False,
        action="allow",
    )

Run it with uvicorn main:app --reload. Every request goes through the classifier before your LLM processes it. Blocked messages get a 400 response. Flagged messages get a soft warning. Clean messages pass through.

Handle Multilingual Content

The facebook/roberta-hate-speech-dynabench-r4-target model only works well on English text. For multilingual support, swap in an XLM-RoBERTa-based model that was trained on multiple languages.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
from transformers import pipeline

# English-only classifier
en_classifier = pipeline(
    "text-classification",
    model="facebook/roberta-hate-speech-dynabench-r4-target",
)

# Multilingual classifier (English, French, Spanish, Italian, Portuguese, Turkish, Russian)
multilingual_classifier = pipeline(
    "text-classification",
    model="unitary/multilingual-toxic-xlm-roberta",
)

def detect_multilingual(text: str, lang: str = "en") -> dict:
    if lang == "en":
        result = en_classifier(text)[0]
        is_hate = result["label"] == "hate"
    else:
        result = multilingual_classifier(text)[0]
        # This model outputs "toxic" / "not toxic" labels
        is_hate = result["label"] == "toxic"

    return {
        "text": text[:80],
        "label": result["label"],
        "score": round(result["score"], 4),
        "flagged": is_hate and result["score"] >= 0.50,
        "model": "multilingual" if lang != "en" else "english",
    }

# Test both paths
print(detect_multilingual("You are wonderful", lang="en"))
print(detect_multilingual("Je te deteste espece de...", lang="fr"))

The unitary/multilingual-toxic-xlm-roberta model handles seven languages out of the box. It uses toxic / not toxic labels instead of hate / nothate, so adjust your label mapping accordingly. For languages outside its training set, consider a translate-then-classify approach using a translation API before the English classifier.

Common Errors and Fixes

OSError: facebook/roberta-hate-speech-dynabench-r4-target is not a local folder and is not a valid model identifier

You’re behind a firewall or proxy that blocks Hugging Face Hub. Either download the model first with huggingface-cli download facebook/roberta-hate-speech-dynabench-r4-target and point to the local path, or set HF_HUB_OFFLINE=1 after downloading.

torch.cuda.OutOfMemoryError: CUDA out of memory

The RoBERTa model is small (~355MB), but if you’re running multiple models, GPU memory adds up. Force CPU inference with:

1
2
3
4
5
classifier = pipeline(
    "text-classification",
    model="facebook/roberta-hate-speech-dynabench-r4-target",
    device=-1,  # Force CPU
)

RuntimeError: The size of tensor a (512) must match the size of tensor b (514)

Your input text exceeds the model’s max token length (512 tokens). Truncate inputs before classification:

1
2
3
4
5
6
classifier = pipeline(
    "text-classification",
    model="facebook/roberta-hate-speech-dynabench-r4-target",
    truncation=True,
    max_length=512,
)

High false positive rate on slang or quoted text. The classifier doesn’t understand context well – a news article quoting hate speech will get flagged. Use the two-threshold approach (block vs. review) and route ambiguous cases to human reviewers. You can also fine-tune the model on your domain-specific data to reduce false positives.

ImportError: cannot import name 'pipeline' from 'transformers'

Your transformers library is too old. Update with pip install --upgrade transformers>=4.30.0.

Build the Multi-Layer Detection Pipeline#

Integrate as FastAPI Middleware#

Handle Multilingual Content#

Common Errors and Fixes#

Related Guides#

About the Author

Build the Multi-Layer Detection Pipeline

Integrate as FastAPI Middleware

Handle Multilingual Content

Common Errors and Fixes

Related Guides