Manual data labeling is a bottleneck. You need thousands of labeled examples to train a classifier, but labeling each one by hand takes time you don’t have. Snorkel solves this with weak supervision: you write labeling functions (LFs) that encode domain knowledge, and Snorkel combines their noisy outputs into probabilistic training labels.
Here’s the core workflow: write multiple imperfect labeling functions, use a label model to denoise and combine them, then train your downstream classifier on the generated labels. You’ll hit acceptable accuracy in hours, not weeks.
Install Snorkel and Dependencies#
1
| pip install snorkel pandas scikit-learn transformers torch
|
Snorkel works with pandas DataFrames, so you’ll integrate it into existing pipelines without rewriting data loading code.
Write Labeling Functions for Text Classification#
Labeling functions return -1 (abstain), 0 (negative class), or 1 (positive class). Each LF captures a different heuristic or pattern. The label model learns to weight and combine them.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
| import pandas as pd
from snorkel.labeling import labeling_function, PandasLFApplier, LFAnalysis
from snorkel.labeling.model import LabelModel
# Sample dataset: spam detection
data = pd.DataFrame({
'text': [
"Buy cheap meds now!",
"Meeting at 3pm tomorrow",
"CLICK HERE FOR FREE MONEY",
"Can you review my pull request?",
"You've won a prize! Claim it now",
"The quarterly report is ready",
"Lose weight fast with this trick",
"Standup notes from today's sprint"
]
})
# Define labeling functions
SPAM = 1
NOT_SPAM = 0
ABSTAIN = -1
@labeling_function()
def lf_contains_free(x):
return SPAM if "free" in x.text.lower() else ABSTAIN
@labeling_function()
def lf_all_caps_words(x):
words = x.text.split()
caps_count = sum(1 for w in words if w.isupper() and len(w) > 3)
return SPAM if caps_count >= 2 else ABSTAIN
@labeling_function()
def lf_exclamation_marks(x):
return SPAM if x.text.count('!') >= 2 else ABSTAIN
@labeling_function()
def lf_business_keywords(x):
keywords = ['meeting', 'report', 'review', 'standup', 'sprint']
return NOT_SPAM if any(k in x.text.lower() for k in keywords) else ABSTAIN
@labeling_function()
def lf_short_professional(x):
# Professional emails tend to be more concise and structured
if len(x.text.split()) < 10 and not any(c in x.text for c in ['!', 'FREE', 'CLICK']):
return NOT_SPAM
return ABSTAIN
# Apply labeling functions to the dataset
lfs = [lf_contains_free, lf_all_caps_words, lf_exclamation_marks,
lf_business_keywords, lf_short_professional]
applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=data)
# Analyze labeling function performance
print(LFAnalysis(L=L_train, lfs=lfs).lf_summary())
|
The output shows coverage (what % of examples each LF labels), overlap (how often LFs agree), and conflicts (disagreements). Good LF sets have high coverage and some disagreement—the label model learns from diverse signals.
Train a Label Model to Denoise Weak Labels#
The label model is a generative model that learns the accuracies and correlations of your labeling functions without seeing ground truth labels. It outputs probabilistic labels for each example.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| # Train the label model
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train=L_train, n_epochs=500, lr=0.001, log_freq=100, seed=42)
# Get probabilistic training labels
probs_train = label_model.predict_proba(L=L_train)
# Get hard labels for downstream classifier (picks higher probability class)
labels_train = label_model.predict(L=L_train)
# Show results
results = pd.DataFrame({
'text': data['text'],
'spam_prob': probs_train[:, 1],
'predicted_label': labels_train
})
print(results)
|
The label model outputs probabilities for each class. Use predict() for hard labels or predict_proba() if your downstream model supports weighted training.
Train a Downstream Classifier on Generated Labels#
Now train your real classifier (logistic regression, neural network, etc.) on the labels from the label model. This is where Snorkel’s value shows up—you get a trained classifier without manually labeling anything.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
| from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
# Filter out abstained examples (where label model couldn't decide)
labeled_indices = labels_train != -1
X_train_labeled = data[labeled_indices]['text']
y_train_labeled = labels_train[labeled_indices]
# Vectorize text and train classifier
vectorizer = TfidfVectorizer(max_features=100)
X_train_vec = vectorizer.fit_transform(X_train_labeled)
classifier = LogisticRegression(max_iter=1000)
classifier.fit(X_train_vec, y_train_labeled)
# Predict on new data
new_texts = ["WIN BIG MONEY NOW", "Code review needed for PR #42"]
X_new = vectorizer.transform(new_texts)
predictions = classifier.predict(X_new)
for text, pred in zip(new_texts, predictions):
label = "SPAM" if pred == 1 else "NOT SPAM"
print(f"{text} -> {label}")
|
Use Snorkel for Named Entity Recognition#
Snorkel isn’t just for classification. You can label sequences for NER tasks by writing LFs that tag spans.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
| from snorkel.labeling import labeling_function
import re
# Sample NER data
ner_data = pd.DataFrame({
'text': [
"John Smith works at Google in California",
"The meeting is at 123 Main Street",
"Contact Sarah Johnson at [email protected]",
"Visit our office in New York City"
]
})
# Entity types: PERSON, ORG, LOC
PERSON = 0
ORG = 1
LOC = 2
ABSTAIN = -1
@labeling_function()
def lf_capitalized_words(x):
# Simple heuristic: consecutive capitalized words might be entities
pattern = r'\b[A-Z][a-z]+ [A-Z][a-z]+\b'
if re.search(pattern, x.text):
return PERSON # Assume person names for now
return ABSTAIN
@labeling_function()
def lf_common_orgs(x):
orgs = ['Google', 'Microsoft', 'Amazon', 'Meta', 'Apple']
return ORG if any(org in x.text for org in orgs) else ABSTAIN
@labeling_function()
def lf_location_keywords(x):
locations = ['California', 'New York', 'Street', 'City']
return LOC if any(loc in x.text for loc in locations) else ABSTAIN
@labeling_function()
def lf_email_pattern(x):
# If email exists, nearby capitalized word is likely a person
if re.search(r'\b[A-Z][a-z]+\s+[A-Z][a-z]+\b.*@', x.text):
return PERSON
return ABSTAIN
# Apply and train label model for NER
ner_lfs = [lf_capitalized_words, lf_common_orgs, lf_location_keywords, lf_email_pattern]
ner_applier = PandasLFApplier(lfs=ner_lfs)
L_ner = ner_applier.apply(df=ner_data)
ner_label_model = LabelModel(cardinality=3, verbose=True)
ner_label_model.fit(L_train=L_ner, n_epochs=500, lr=0.001, seed=42)
ner_labels = ner_label_model.predict(L=L_ner)
entity_types = {0: 'PERSON', 1: 'ORG', 2: 'LOC', -1: 'UNKNOWN'}
ner_results = pd.DataFrame({
'text': ner_data['text'],
'predicted_entity': [entity_types[label] for label in ner_labels]
})
print(ner_results)
|
For production NER, you’d use span-level labeling with Snorkel’s SpanLabelingFunction, but this token-level approach shows the concept.
Common Errors and Fixes#
“Label model predicts all -1 (abstains)”: Your labeling functions have too little coverage or too much agreement. Add more diverse LFs that fire on different subsets of data. Check LFAnalysis output—aim for 40%+ coverage per LF.
“Label model accuracy is worse than random”: Your LFs might be negatively correlated with the true labels. Use a small dev set to check LF accuracy. Fix inverted logic (e.g., returning SPAM when you meant NOT_SPAM).
“Downstream classifier performs poorly”: The label model’s probabilistic labels might be noisy. Filter examples where max(probs_train[i]) is below a threshold (e.g., 0.7). Train only on high-confidence labels.
“LFs are too slow on large datasets”: Vectorize your LF logic or use multiprocessing. For regex-heavy LFs, precompile patterns outside the function. Snorkel’s PandasLFApplier uses parallelization by default—set n_parallel explicitly if needed.
“ImportError: cannot import name ‘LabelModel’”: You installed the wrong package. Run pip uninstall snorkel then pip install snorkel==0.9.9 (or latest stable version). Don’t confuse it with the older snorkel-metal package.
When to Use Snorkel vs Manual Labeling#
Use Snorkel when you have domain knowledge that translates into heuristics (keywords, patterns, rules) but not enough time to label thousands of examples. It works best for tasks where 80-90% accuracy is acceptable and you can iterate on labeling functions quickly.
Stick with manual labeling for tasks requiring near-perfect accuracy, highly subjective judgments, or domains where heuristics don’t generalize. Combine both: use Snorkel to bootstrap labels, then manually label a subset for validation and fine-tuning.