How to Build a Data Quality Pipeline with Cleanlab

Mislabeled training data silently tanks model performance. Cleanlab finds those bad labels automatically using your model’s own predictions. Here’s how to build a full data quality pipeline around it.

Find Label Issues in 30 Seconds

Install cleanlab and run an audit on any classification dataset. This works with scikit-learn, PyTorch, TensorFlow, or any model that outputs predicted probabilities.

1
pip install cleanlab

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
from cleanlab import Datalab

# Load a dataset (swap this with your own)
iris = load_iris()
X, y = iris.data, iris.target

# Inject 10% label noise so we have something to find
np.random.seed(42)
noisy_indices = np.random.choice(len(y), size=15, replace=False)
y_noisy = y.copy()
y_noisy[noisy_indices] = (y_noisy[noisy_indices] + 1) % 3

# Get out-of-sample predicted probabilities via cross-validation
model = LogisticRegression(max_iter=200)
pred_probs = cross_val_predict(model, X, y_noisy, cv=5, method="predict_proba")

# Run the audit
lab = Datalab(data={"features": X, "labels": y_noisy}, label_name="labels")
lab.find_issues(pred_probs=pred_probs, features=X)
lab.report()

That lab.report() call prints a summary of every issue type detected: label errors, outliers, near-duplicates, and class imbalance. The most actionable output is the label issues – those are your mislabeled examples.

How Confident Learning Works

Cleanlab uses an algorithm called Confident Learning to find label errors. The core idea: if your model consistently predicts class A for an example labeled class B (across cross-validation folds), that label is probably wrong.

The algorithm builds a confident joint – a matrix estimating how often each true class gets mislabeled as another class. It does not need a perfect model. Even a mediocre classifier provides enough signal to catch obvious labeling mistakes.

This is why you use cross_val_predict with method="predict_proba". Out-of-sample probabilities prevent the model from memorizing noisy labels, which gives you honest confidence estimates.

Drill Into Specific Issues

After find_issues, pull out the detailed results for each issue type:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Get label issue details
label_issues = lab.get_issues("label")
print(label_issues.columns.tolist())
# ['is_label_issue', 'label_score', 'given_label', 'predicted_label']

# Filter to flagged examples, sorted by confidence
flagged = label_issues.query("is_label_issue").sort_values("label_score")
print(f"Found {len(flagged)} suspected label errors")
print(flagged.head(10))

# Check for outliers and near-duplicates too
outliers = lab.get_issues("outlier").query("is_outlier_issue")
duplicates = lab.get_issues("near_duplicate").query("is_near_duplicate_issue")
print(f"Outliers: {len(outliers)}, Near-duplicates: {len(duplicates)}")

The label_score column ranges from 0 to 1. Lower scores mean the label is more likely wrong. Sort by this score and review the worst offenders first.

Use CleanLearning for Automatic Cleanup

If you want to skip manual review and just train on the clean subset, CleanLearning wraps any scikit-learn classifier and automatically drops suspicious examples during training:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
from cleanlab.classification import CleanLearning

cl = CleanLearning(LogisticRegression(max_iter=200))

# This finds label issues, drops them, and trains on the clean data
cl.fit(X, y_noisy)

# Predict as if your training data was clean
predictions = cl.predict(X)

# Access the identified label issues
issue_df = cl.get_label_issues()
print(issue_df[issue_df["is_label_issue"]].head())

CleanLearning handles cross-validation internally, so you don’t need to compute pred_probs yourself. It’s the fastest path from noisy labels to a trained model.

Use find_label_issues Directly

For more control, use cleanlab.filter.find_label_issues with pre-computed probabilities. This is useful when your model isn’t sklearn-compatible or when you want to tune filtering behavior:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from cleanlab.filter import find_label_issues

# Get ranked indices of likely label errors
issue_indices = find_label_issues(
    labels=y_noisy,
    pred_probs=pred_probs,
    return_indices_ranked_by="self_confidence",
)

print(f"Top 10 most likely mislabeled examples: {issue_indices[:10]}")

# Remove them and retrain
clean_mask = np.ones(len(y_noisy), dtype=bool)
clean_mask[issue_indices] = False
X_clean, y_clean = X[clean_mask], y_noisy[clean_mask]

Set return_indices_ranked_by="self_confidence" to get indices sorted by how confident the model is that the given label is wrong. Without this parameter, the function returns a boolean mask instead.

Clean Text Classification Datasets

Text datasets collect label errors fast, especially from crowd-sourced annotation. Cleanlab works the same way – you just need embeddings and predicted probabilities:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
from cleanlab import Datalab

# Example text classification data
texts = [
    "The stock price surged after earnings beat expectations",
    "New treatment shows promise in clinical trials",
    "Team wins championship in overtime thriller",
    "Governor signs new education funding bill",
    "Researchers discover high-temperature superconductor",
    "Market crashes amid fears of recession",
    "Hospital opens new cardiac care wing",
    "Player traded to rival team for record sum",
    "Senate passes infrastructure spending package",
    "Study finds high caffeine intake linked to sleep disruption",
]
labels = np.array([0, 1, 2, 3, 1, 0, 1, 2, 3, 1])  # 0=business, 1=health, 2=sports, 3=politics
# labels[4] is intentionally wrong: science story labeled as "health"

# Vectorize and get predicted probabilities
vectorizer = TfidfVectorizer(max_features=500)
X_text = vectorizer.fit_transform(texts).toarray()

model = LogisticRegression(max_iter=300)
pred_probs = cross_val_predict(model, X_text, labels, cv=3, method="predict_proba")

# Audit
lab = Datalab(data={"text": texts, "label": labels}, label_name="label")
lab.find_issues(pred_probs=pred_probs, features=X_text)

label_results = lab.get_issues("label")
print(label_results[["is_label_issue", "label_score", "given_label", "predicted_label"]])

For larger text datasets, swap TF-IDF with sentence embeddings from sentence-transformers for better detection accuracy. The embeddings feed into both the label issue detection and the outlier/duplicate checks.

Use Cleanlab with Deep Learning Models

Cleanlab doesn’t train your deep learning model for you – it just needs the pred_probs output. Train your PyTorch or TensorFlow model with k-fold cross-validation, collect the out-of-sample probabilities, and pass them in:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import torch
import numpy as np
from sklearn.model_selection import StratifiedKFold
from cleanlab.filter import find_label_issues

def get_pred_probs_pytorch(model_class, X_tensor, y_array, n_splits=5):
    """Collect out-of-sample predicted probabilities from a PyTorch model."""
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
    pred_probs = np.zeros((len(y_array), len(np.unique(y_array))))

    for train_idx, val_idx in skf.split(X_tensor, y_array):
        model = model_class()  # fresh model each fold
        # ... your training loop here ...
        model.eval()
        with torch.no_grad():
            logits = model(X_tensor[val_idx])
            probs = torch.softmax(logits, dim=1).numpy()
        pred_probs[val_idx] = probs

    return pred_probs

# After collecting pred_probs from your model:
# issue_indices = find_label_issues(labels=y, pred_probs=pred_probs,
#                                    return_indices_ranked_by="self_confidence")

The key requirement: every example must have a predicted probability from a fold where it was not in the training set. If you use the same model that trained on an example to predict its probability, the model memorizes noisy labels and Cleanlab can’t catch the errors.

Common Errors and Fixes

ValueError: pred_probs is not a valid matrix of predicted probabilities

Your probability rows don’t sum to 1. This happens with raw logits or when you forget the softmax step. Fix it:

1
2
3
4
5
# For PyTorch logits
probs = torch.softmax(logits, dim=1).numpy()

# For numpy arrays that are close but not exact
pred_probs = pred_probs / pred_probs.sum(axis=1, keepdims=True)

ValueError: labels and pred_probs must have the same number of examples

You filtered your dataset after generating probabilities, or your cross-validation dropped some examples. Make sure len(labels) == len(pred_probs) and the indices align.

pred_probs columns don’t match number of classes

pred_probs must have shape (n_samples, n_classes). If you have 5 classes, each row needs 5 probability values. Check with pred_probs.shape[1] == len(np.unique(labels)).

CleanLearning fails with your custom model

CleanLearning requires sklearn-compatible estimators (must implement fit, predict, and predict_proba). For non-sklearn models, use find_label_issues directly with pre-computed probabilities instead.

Cross-validation gives different number of classes per fold

This happens with rare classes in small datasets. Use StratifiedKFold instead of regular KFold to ensure each fold contains examples from every class:

1
2
3
4
from sklearn.model_selection import cross_val_predict, StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
pred_probs = cross_val_predict(model, X, y, cv=cv, method="predict_proba")

Find Label Issues in 30 Seconds#

How Confident Learning Works#

Drill Into Specific Issues#

Use CleanLearning for Automatic Cleanup#

Use find_label_issues Directly#

Clean Text Classification Datasets#

Use Cleanlab with Deep Learning Models#

Common Errors and Fixes#

Related Guides#

About the Author