Mislabeled training data silently tanks model performance. Cleanlab finds those bad labels automatically using your model’s own predictions. Here’s how to build a full data quality pipeline around it.
Find Label Issues in 30 Seconds
Install cleanlab and run an audit on any classification dataset. This works with scikit-learn, PyTorch, TensorFlow, or any model that outputs predicted probabilities.
| |
| |
That lab.report() call prints a summary of every issue type detected: label errors, outliers, near-duplicates, and class imbalance. The most actionable output is the label issues – those are your mislabeled examples.
How Confident Learning Works
Cleanlab uses an algorithm called Confident Learning to find label errors. The core idea: if your model consistently predicts class A for an example labeled class B (across cross-validation folds), that label is probably wrong.
The algorithm builds a confident joint – a matrix estimating how often each true class gets mislabeled as another class. It does not need a perfect model. Even a mediocre classifier provides enough signal to catch obvious labeling mistakes.
This is why you use cross_val_predict with method="predict_proba". Out-of-sample probabilities prevent the model from memorizing noisy labels, which gives you honest confidence estimates.
Drill Into Specific Issues
After find_issues, pull out the detailed results for each issue type:
| |
The label_score column ranges from 0 to 1. Lower scores mean the label is more likely wrong. Sort by this score and review the worst offenders first.
Use CleanLearning for Automatic Cleanup
If you want to skip manual review and just train on the clean subset, CleanLearning wraps any scikit-learn classifier and automatically drops suspicious examples during training:
| |
CleanLearning handles cross-validation internally, so you don’t need to compute pred_probs yourself. It’s the fastest path from noisy labels to a trained model.
Use find_label_issues Directly
For more control, use cleanlab.filter.find_label_issues with pre-computed probabilities. This is useful when your model isn’t sklearn-compatible or when you want to tune filtering behavior:
| |
Set return_indices_ranked_by="self_confidence" to get indices sorted by how confident the model is that the given label is wrong. Without this parameter, the function returns a boolean mask instead.
Clean Text Classification Datasets
Text datasets collect label errors fast, especially from crowd-sourced annotation. Cleanlab works the same way – you just need embeddings and predicted probabilities:
| |
For larger text datasets, swap TF-IDF with sentence embeddings from sentence-transformers for better detection accuracy. The embeddings feed into both the label issue detection and the outlier/duplicate checks.
Use Cleanlab with Deep Learning Models
Cleanlab doesn’t train your deep learning model for you – it just needs the pred_probs output. Train your PyTorch or TensorFlow model with k-fold cross-validation, collect the out-of-sample probabilities, and pass them in:
| |
The key requirement: every example must have a predicted probability from a fold where it was not in the training set. If you use the same model that trained on an example to predict its probability, the model memorizes noisy labels and Cleanlab can’t catch the errors.
Common Errors and Fixes
ValueError: pred_probs is not a valid matrix of predicted probabilities
Your probability rows don’t sum to 1. This happens with raw logits or when you forget the softmax step. Fix it:
| |
ValueError: labels and pred_probs must have the same number of examples
You filtered your dataset after generating probabilities, or your cross-validation dropped some examples. Make sure len(labels) == len(pred_probs) and the indices align.
pred_probs columns don’t match number of classes
pred_probs must have shape (n_samples, n_classes). If you have 5 classes, each row needs 5 probability values. Check with pred_probs.shape[1] == len(np.unique(labels)).
CleanLearning fails with your custom model
CleanLearning requires sklearn-compatible estimators (must implement fit, predict, and predict_proba). For non-sklearn models, use find_label_issues directly with pre-computed probabilities instead.
Cross-validation gives different number of classes per fold
This happens with rare classes in small datasets. Use StratifiedKFold instead of regular KFold to ensure each fold contains examples from every class:
| |
Related Guides
- How to Build a Data Outlier Detection Pipeline with PyOD
- How to Build a Dataset Bias Detection Pipeline with Python
- How to Build a Data Labeling Pipeline with Label Studio
- How to Build a Data Slicing and Stratification Pipeline for ML
- How to Build a Data Reconciliation Pipeline for ML Training Sets
- How to Build a Data Contamination Detection Pipeline for LLM Training
- How to Build a Feature Importance and Selection Pipeline with Scikit-Learn
- How to Build a Data Freshness Monitoring Pipeline with Python
- How to Build a Data Profiling and Auto-Cleaning Pipeline with Python
- How to Build a Data Annotation Pipeline with Argilla