How to Build Programmatic Labeling Pipelines with Snorkel

Manual data labeling is a bottleneck. You need thousands of labeled examples to train a classifier, but labeling each one by hand takes time you don’t have. Snorkel solves this with weak supervision: you write labeling functions (LFs) that encode domain knowledge, and Snorkel combines their noisy outputs into probabilistic training labels.

Here’s the core workflow: write multiple imperfect labeling functions, use a label model to denoise and combine them, then train your downstream classifier on the generated labels. You’ll hit acceptable accuracy in hours, not weeks.

Install Snorkel and Dependencies

1
pip install snorkel pandas scikit-learn transformers torch

Snorkel works with pandas DataFrames, so you’ll integrate it into existing pipelines without rewriting data loading code.

Write Labeling Functions for Text Classification

Labeling functions return -1 (abstain), 0 (negative class), or 1 (positive class). Each LF captures a different heuristic or pattern. The label model learns to weight and combine them.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import pandas as pd
from snorkel.labeling import labeling_function, PandasLFApplier, LFAnalysis
from snorkel.labeling.model import LabelModel

# Sample dataset: spam detection
data = pd.DataFrame({
    'text': [
        "Buy cheap meds now!",
        "Meeting at 3pm tomorrow",
        "CLICK HERE FOR FREE MONEY",
        "Can you review my pull request?",
        "You've won a prize! Claim it now",
        "The quarterly report is ready",
        "Lose weight fast with this trick",
        "Standup notes from today's sprint"
    ]
})

# Define labeling functions
SPAM = 1
NOT_SPAM = 0
ABSTAIN = -1

@labeling_function()
def lf_contains_free(x):
    return SPAM if "free" in x.text.lower() else ABSTAIN

@labeling_function()
def lf_all_caps_words(x):
    words = x.text.split()
    caps_count = sum(1 for w in words if w.isupper() and len(w) > 3)
    return SPAM if caps_count >= 2 else ABSTAIN

@labeling_function()
def lf_exclamation_marks(x):
    return SPAM if x.text.count('!') >= 2 else ABSTAIN

@labeling_function()
def lf_business_keywords(x):
    keywords = ['meeting', 'report', 'review', 'standup', 'sprint']
    return NOT_SPAM if any(k in x.text.lower() for k in keywords) else ABSTAIN

@labeling_function()
def lf_short_professional(x):
    # Professional emails tend to be more concise and structured
    if len(x.text.split()) < 10 and not any(c in x.text for c in ['!', 'FREE', 'CLICK']):
        return NOT_SPAM
    return ABSTAIN

# Apply labeling functions to the dataset
lfs = [lf_contains_free, lf_all_caps_words, lf_exclamation_marks,
       lf_business_keywords, lf_short_professional]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=data)

# Analyze labeling function performance
print(LFAnalysis(L=L_train, lfs=lfs).lf_summary())

The output shows coverage (what % of examples each LF labels), overlap (how often LFs agree), and conflicts (disagreements). Good LF sets have high coverage and some disagreement—the label model learns from diverse signals.

Train a Label Model to Denoise Weak Labels

The label model is a generative model that learns the accuracies and correlations of your labeling functions without seeing ground truth labels. It outputs probabilistic labels for each example.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# Train the label model
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train=L_train, n_epochs=500, lr=0.001, log_freq=100, seed=42)

# Get probabilistic training labels
probs_train = label_model.predict_proba(L=L_train)

# Get hard labels for downstream classifier (picks higher probability class)
labels_train = label_model.predict(L=L_train)

# Show results
results = pd.DataFrame({
    'text': data['text'],
    'spam_prob': probs_train[:, 1],
    'predicted_label': labels_train
})
print(results)

The label model outputs probabilities for each class. Use predict() for hard labels or predict_proba() if your downstream model supports weighted training.

Train a Downstream Classifier on Generated Labels

Now train your real classifier (logistic regression, neural network, etc.) on the labels from the label model. This is where Snorkel’s value shows up—you get a trained classifier without manually labeling anything.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Filter out abstained examples (where label model couldn't decide)
labeled_indices = labels_train != -1
X_train_labeled = data[labeled_indices]['text']
y_train_labeled = labels_train[labeled_indices]

# Vectorize text and train classifier
vectorizer = TfidfVectorizer(max_features=100)
X_train_vec = vectorizer.fit_transform(X_train_labeled)

classifier = LogisticRegression(max_iter=1000)
classifier.fit(X_train_vec, y_train_labeled)

# Predict on new data
new_texts = ["WIN BIG MONEY NOW", "Code review needed for PR #42"]
X_new = vectorizer.transform(new_texts)
predictions = classifier.predict(X_new)

for text, pred in zip(new_texts, predictions):
    label = "SPAM" if pred == 1 else "NOT SPAM"
    print(f"{text} -> {label}")

Use Snorkel for Named Entity Recognition

Snorkel isn’t just for classification. You can label sequences for NER tasks by writing LFs that tag spans.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
from snorkel.labeling import labeling_function
import re

# Sample NER data
ner_data = pd.DataFrame({
    'text': [
        "John Smith works at Google in California",
        "The meeting is at 123 Main Street",
        "Contact Sarah Johnson at [email protected]",
        "Visit our office in New York City"
    ]
})

# Entity types: PERSON, ORG, LOC
PERSON = 0
ORG = 1
LOC = 2
ABSTAIN = -1

@labeling_function()
def lf_capitalized_words(x):
    # Simple heuristic: consecutive capitalized words might be entities
    pattern = r'\b[A-Z][a-z]+ [A-Z][a-z]+\b'
    if re.search(pattern, x.text):
        return PERSON  # Assume person names for now
    return ABSTAIN

@labeling_function()
def lf_common_orgs(x):
    orgs = ['Google', 'Microsoft', 'Amazon', 'Meta', 'Apple']
    return ORG if any(org in x.text for org in orgs) else ABSTAIN

@labeling_function()
def lf_location_keywords(x):
    locations = ['California', 'New York', 'Street', 'City']
    return LOC if any(loc in x.text for loc in locations) else ABSTAIN

@labeling_function()
def lf_email_pattern(x):
    # If email exists, nearby capitalized word is likely a person
    if re.search(r'\b[A-Z][a-z]+\s+[A-Z][a-z]+\b.*@', x.text):
        return PERSON
    return ABSTAIN

# Apply and train label model for NER
ner_lfs = [lf_capitalized_words, lf_common_orgs, lf_location_keywords, lf_email_pattern]
ner_applier = PandasLFApplier(lfs=ner_lfs)
L_ner = ner_applier.apply(df=ner_data)

ner_label_model = LabelModel(cardinality=3, verbose=True)
ner_label_model.fit(L_train=L_ner, n_epochs=500, lr=0.001, seed=42)

ner_labels = ner_label_model.predict(L=L_ner)
entity_types = {0: 'PERSON', 1: 'ORG', 2: 'LOC', -1: 'UNKNOWN'}

ner_results = pd.DataFrame({
    'text': ner_data['text'],
    'predicted_entity': [entity_types[label] for label in ner_labels]
})
print(ner_results)

For production NER, you’d use span-level labeling with Snorkel’s SpanLabelingFunction, but this token-level approach shows the concept.

Common Errors and Fixes

“Label model predicts all -1 (abstains)”: Your labeling functions have too little coverage or too much agreement. Add more diverse LFs that fire on different subsets of data. Check LFAnalysis output—aim for 40%+ coverage per LF.

“Label model accuracy is worse than random”: Your LFs might be negatively correlated with the true labels. Use a small dev set to check LF accuracy. Fix inverted logic (e.g., returning SPAM when you meant NOT_SPAM).

“Downstream classifier performs poorly”: The label model’s probabilistic labels might be noisy. Filter examples where max(probs_train[i]) is below a threshold (e.g., 0.7). Train only on high-confidence labels.

“LFs are too slow on large datasets”: Vectorize your LF logic or use multiprocessing. For regex-heavy LFs, precompile patterns outside the function. Snorkel’s PandasLFApplier uses parallelization by default—set n_parallel explicitly if needed.

“ImportError: cannot import name ‘LabelModel’”: You installed the wrong package. Run pip uninstall snorkel then pip install snorkel==0.9.9 (or latest stable version). Don’t confuse it with the older snorkel-metal package.

When to Use Snorkel vs Manual Labeling

Use Snorkel when you have domain knowledge that translates into heuristics (keywords, patterns, rules) but not enough time to label thousands of examples. It works best for tasks where 80-90% accuracy is acceptable and you can iterate on labeling functions quickly.

Stick with manual labeling for tasks requiring near-perfect accuracy, highly subjective judgments, or domains where heuristics don’t generalize. Combine both: use Snorkel to bootstrap labels, then manually label a subset for validation and fine-tuning.

Install Snorkel and Dependencies#

Write Labeling Functions for Text Classification#

Train a Label Model to Denoise Weak Labels#

Train a Downstream Classifier on Generated Labels#

Use Snorkel for Named Entity Recognition#

Common Errors and Fixes#

When to Use Snorkel vs Manual Labeling#

Related Guides#

About the Author