No single model catches everything. Sarcasm slips past keyword filters. Coded language defeats classifiers trained on obvious slurs. Multilingual abuse evades English-only models. The answer is stacking multiple detectors and combining their signals.
Here’s the fast path – a Hugging Face toxicity classifier in five lines:
1
2
3
4
5
6
| from transformers import pipeline
toxicity = pipeline("text-classification", model="martin-ha/toxic-comment-model")
result = toxicity("You're an absolute moron and nobody likes you")
print(result)
# [{'label': 'toxic', 'score': 0.9987}]
|
That gets you surprisingly far. But for production, you want redundancy. This guide walks through three detection approaches and then combines them into an ensemble that’s harder to fool.
Hugging Face Toxicity Classifiers#
The martin-ha/toxic-comment-model is a fine-tuned DistilBERT that classifies text as toxic or non-toxic. It’s fast, free, and runs locally – no API keys needed.
For more granular labels, use unitary/toxic-bert, which outputs scores across six toxicity categories: toxic, severe toxic, obscene, threat, insult, and identity hate.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
| from transformers import pipeline
# Multi-label toxicity with category breakdown
multi_toxic = pipeline(
"text-classification",
model="unitary/toxic-bert",
top_k=None,
)
text = "I will find you and make you regret everything"
scores = multi_toxic(text)
print(scores)
# [[{'label': 'toxic', 'score': 0.98}, {'label': 'threat', 'score': 0.91},
# {'label': 'insult', 'score': 0.32}, {'label': 'obscene', 'score': 0.12},
# {'label': 'severe_toxic', 'score': 0.08}, {'label': 'identity_hate', 'score': 0.03}]]
# Flag if any category exceeds threshold
THRESHOLDS = {
"toxic": 0.7,
"severe_toxic": 0.5,
"threat": 0.5,
"insult": 0.7,
"obscene": 0.7,
"identity_hate": 0.5,
}
flagged_categories = []
for item in scores[0]:
threshold = THRESHOLDS.get(item["label"], 0.7)
if item["score"] >= threshold:
flagged_categories.append((item["label"], item["score"]))
if flagged_categories:
print(f"FLAGGED: {flagged_categories}")
else:
print("CLEAN")
|
Lower thresholds for severe categories (threats, identity hate) and higher for general toxicity. You’ll tune these based on your false positive tolerance.
The trade-off: local models are free and fast but have a fixed vocabulary. They struggle with novel slang, code-switching, and adversarial misspellings like “k1ll” or “sh!t.” That’s where API-based detectors help.
OpenAI Moderation Endpoint#
OpenAI’s moderation endpoint is free, fast, and covers categories beyond toxicity: self-harm, sexual content, violence, and hate speech. It’s the quickest way to add broad content safety.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| from openai import OpenAI
client = OpenAI()
def check_openai_moderation(text: str) -> dict:
"""Check text against OpenAI's moderation categories."""
response = client.moderations.create(
model="omni-moderation-latest",
input=text,
)
result = response.results[0]
return {
"flagged": result.flagged,
"categories": {
cat: flagged
for cat, flagged in result.categories.model_dump().items()
if flagged
},
"scores": {
cat: round(score, 4)
for cat, score in result.category_scores.model_dump().items()
if score > 0.1
},
}
output = check_openai_moderation("I'm going to destroy you in this game")
print(output)
# {'flagged': False, 'categories': {}, 'scores': {'violence': 0.1243}}
|
The moderation endpoint returns both boolean flags and continuous scores. The boolean flags use OpenAI’s default thresholds, but you should use the raw scores and apply your own. OpenAI’s defaults are conservative – they’ll miss borderline content that your community standards might not allow.
One limitation: the moderation endpoint doesn’t explain why something was flagged. You get a category and a score, not a rationale. For appeals or human review workflows, you’ll want to pair it with a model that gives explanations.
Google Perspective API#
The Perspective API from Jigsaw (a Google subsidiary) is purpose-built for comment moderation. It scores text across attributes like toxicity, severe toxicity, identity attack, insult, profanity, and threat. It also supports multiple languages out of the box – a major advantage over English-only models.
You’ll need a Perspective API key. Get one at perspectiveapi.com.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
| import httpx
import os
PERSPECTIVE_API_KEY = os.environ["PERSPECTIVE_API_KEY"]
PERSPECTIVE_URL = "https://commentanalyzer.googleapis.com/v1alpha1/comments:analyze"
def check_perspective(text: str, languages: list[str] | None = None) -> dict:
"""Score text using Google's Perspective API."""
body = {
"comment": {"text": text},
"requestedAttributes": {
"TOXICITY": {},
"SEVERE_TOXICITY": {},
"IDENTITY_ATTACK": {},
"INSULT": {},
"PROFANITY": {},
"THREAT": {},
},
}
if languages:
body["languages"] = languages
response = httpx.post(
PERSPECTIVE_URL,
params={"key": PERSPECTIVE_API_KEY},
json=body,
)
response.raise_for_status()
data = response.json()
scores = {}
for attr, value in data["attributeScores"].items():
scores[attr] = round(value["summaryScore"]["value"], 4)
return scores
result = check_perspective("You people are the worst and should be banned")
print(result)
# {'TOXICITY': 0.9312, 'SEVERE_TOXICITY': 0.4521, 'IDENTITY_ATTACK': 0.6234,
# 'INSULT': 0.8877, 'PROFANITY': 0.2103, 'THREAT': 0.1045}
|
Perspective handles multilingual content better than most alternatives. Pass languages=["es"] or languages=["fr"] to hint the language, or leave it out and let the API auto-detect.
The rate limit is 1 QPS by default. You can request a higher quota, but for high-volume use cases, batch your requests or use local models as a first pass and only send borderline content to Perspective.
Building a Multi-Model Ensemble#
Each detector has blind spots. Local classifiers miss creative misspellings. OpenAI moderation is conservative on edge cases. Perspective struggles with very short texts. Combining them fills the gaps.
The strategy: run all three, normalize scores to 0-1, and use a weighted average. Flag content if the ensemble score exceeds your threshold or if any single detector is highly confident.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
| from dataclasses import dataclass
@dataclass
class ToxicityResult:
is_toxic: bool
ensemble_score: float
details: dict[str, float]
flagged_by: list[str]
def detect_toxicity(text: str, threshold: float = 0.6) -> ToxicityResult:
"""
Run text through multiple toxicity detectors and combine signals.
Returns an ensemble result with individual detector scores.
"""
details = {}
flagged_by = []
# 1. Local Hugging Face model
hf_result = multi_toxic(text)
hf_toxic_score = next(
item["score"] for item in hf_result[0] if item["label"] == "toxic"
)
details["huggingface"] = round(hf_toxic_score, 4)
if hf_toxic_score > 0.7:
flagged_by.append("huggingface")
# 2. OpenAI moderation
oai_result = check_openai_moderation(text)
# Use the max category score as the overall signal
oai_scores = oai_result["scores"]
oai_max_score = max(oai_scores.values()) if oai_scores else 0.0
details["openai"] = round(oai_max_score, 4)
if oai_result["flagged"]:
flagged_by.append("openai")
# 3. Perspective API
persp_result = check_perspective(text)
persp_toxic_score = persp_result.get("TOXICITY", 0.0)
details["perspective"] = round(persp_toxic_score, 4)
if persp_toxic_score > 0.7:
flagged_by.append("perspective")
# Weighted ensemble -- Perspective gets higher weight for general toxicity
weights = {"huggingface": 0.25, "openai": 0.35, "perspective": 0.40}
ensemble_score = sum(details[k] * weights[k] for k in weights)
# Flag if ensemble exceeds threshold OR any single detector is very confident
high_confidence_flag = any(details[k] > 0.9 for k in details)
is_toxic = ensemble_score >= threshold or high_confidence_flag
return ToxicityResult(
is_toxic=is_toxic,
ensemble_score=round(ensemble_score, 4),
details=details,
flagged_by=flagged_by,
)
# Test it
result = detect_toxicity("You're a terrible person and everyone hates you")
print(f"Toxic: {result.is_toxic}")
print(f"Ensemble score: {result.ensemble_score}")
print(f"Details: {result.details}")
print(f"Flagged by: {result.flagged_by}")
|
The weights matter. Perspective tends to be the most calibrated for general toxicity scoring, so it gets the highest weight. OpenAI’s moderation is strong on clearly harmful content. The local model provides a fast baseline that works offline.
Handling Borderline Cases#
Content scoring between 0.4 and 0.7 is the gray zone. Don’t auto-reject it. Instead, route it to human review:
1
2
3
4
5
6
7
8
9
10
| def triage_content(text: str) -> str:
"""Route content based on toxicity score."""
result = detect_toxicity(text)
if result.ensemble_score >= 0.8:
return "BLOCK"
elif result.ensemble_score >= 0.5:
return "REVIEW"
else:
return "ALLOW"
|
For the review queue, include the per-model breakdown so moderators can see which detectors flagged it and why. That context speeds up their decisions significantly.
Edge Cases That Break Simple Detectors#
Sarcasm and irony. “Oh, what a brilliant contribution to humanity” reads as toxic to humans but often scores low on classifiers. There’s no great automated solution here – sarcasm detection is an open research problem. Your best bet is to flag content where sentiment is strongly negative but toxicity scores are moderate, then route those to human review.
Coded language and dog whistles. Terms like “1488,” “triple parentheses,” or rapidly evolving slang bypass keyword lists and classifiers trained on older data. Maintain a regularly updated blocklist and combine it with your ML models:
1
2
3
4
5
6
7
8
9
| CODED_TERMS = {
"1488", "14/88", "(((", ")))",
"day of the rope", "power level",
}
def check_coded_language(text: str) -> bool:
"""Check for known coded hate speech terms."""
text_lower = text.lower()
return any(term in text_lower for term in CODED_TERMS)
|
This list needs regular updating. Subscribe to hate speech monitoring organizations like the ADL or SPLC for new terms.
Multilingual abuse. Users switch languages mid-sentence or use non-English slurs. Perspective API handles multiple languages, but your local HF model probably doesn’t. For multilingual pipelines, either use a multilingual model like unitary/multilingual-toxic-xlm-roberta or translate to English first with a translation API, then classify.
Adversarial misspellings. “K!ll y0urs3lf” bypasses most classifiers. Normalize text before classification:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| import re
def normalize_leet(text: str) -> str:
"""Convert common adversarial character substitutions."""
replacements = {
"0": "o", "1": "i", "3": "e", "4": "a",
"5": "s", "7": "t", "@": "a", "$": "s",
"!": "i", "+": "t",
}
normalized = text
for char, replacement in replacements.items():
normalized = normalized.replace(char, replacement)
# Collapse repeated characters: "fuuuuck" -> "fuck"
normalized = re.sub(r"(.)\1{2,}", r"\1\1", normalized)
return normalized
print(normalize_leet("y0u're s0 stup!d"))
# "you're so stupid"
|
Run both the original and normalized text through your detectors. Flag if either version triggers.
Common Errors and Fixes#
ConnectionError or timeout from Perspective API. The default rate limit is 1 QPS. If you’re hitting it in bursts, add retry logic with exponential backoff:
1
2
3
4
5
6
7
8
9
10
11
12
13
| import time
def check_perspective_with_retry(text: str, max_retries: int = 3) -> dict:
for attempt in range(max_retries):
try:
return check_perspective(text)
except httpx.HTTPStatusError as e:
if e.response.status_code == 429:
wait = 2 ** attempt
time.sleep(wait)
else:
raise
raise RuntimeError("Perspective API failed after retries")
|
Hugging Face model returns unexpected label format. Some toxicity models use LABEL_0/LABEL_1 instead of toxic/non-toxic. Check the model card and map labels explicitly:
1
2
3
| LABEL_MAP = {"LABEL_0": "non-toxic", "LABEL_1": "toxic"}
result = toxicity("some text")
label = LABEL_MAP.get(result[0]["label"], result[0]["label"])
|
OpenAI moderation returns low scores for obviously toxic content. This usually means the content is in a language the model handles poorly, or uses Unicode tricks. Pre-process text to strip zero-width characters and normalize Unicode before sending:
1
2
3
4
5
6
7
8
9
| import unicodedata
def clean_unicode(text: str) -> str:
"""Remove zero-width chars and normalize Unicode."""
cleaned = "".join(
c for c in text
if unicodedata.category(c) != "Cf" # Remove format characters
)
return unicodedata.normalize("NFKC", cleaned)
|
Memory issues loading multiple models. Loading toxic-bert alongside other models eats GPU memory. Use device_map="cpu" for toxicity models – they’re small enough that CPU inference is fast (under 50ms for short texts):
1
2
3
4
5
| toxicity = pipeline(
"text-classification",
model="martin-ha/toxic-comment-model",
device=-1, # Force CPU
)
|