Bad data points wreck models. A single cluster of outliers in your training set can shift decision boundaries, inflate loss, and produce predictions that look fine on average but fail hard on real inputs. PyOD gives you 40+ outlier detection algorithms behind a consistent scikit-learn-style API, so you can run multiple detectors and combine their votes without writing glue code.
Install PyOD and Generate Test Data
| |
First, create a synthetic dataset with known outliers. This lets you measure whether your pipeline actually catches them.
| |
generate_data produces Gaussian inliers and uniform outliers. The y_train and y_test arrays contain ground truth labels: 0 for inlier, 1 for outlier. You will use these later to evaluate detection quality.
Run Three Detectors: Isolation Forest, LOF, ECOD
Each algorithm catches different kinds of anomalies. Isolation Forest works well on global outliers. LOF (Local Outlier Factor) catches local density deviations. ECOD (Empirical Cumulative Distribution) is fast and nonparametric – it flags points in the tails of each feature’s distribution.
| |
After calling .fit(), every PyOD model exposes two attributes: labels_ (binary predictions on training data) and decision_scores_ (continuous anomaly scores where higher means more anomalous). This consistent interface is what makes PyOD worth using over raw scikit-learn estimators.
Combine Multiple Detectors
A single detector can miss outliers that another catches. PyOD’s combination utilities let you aggregate scores from multiple models. The two most useful strategies are average scoring and majority vote.
| |
Majority vote is more conservative – it only flags a point when most detectors agree. Average scoring gives you a continuous score you can threshold however you want. For cleaning training data, majority vote is usually the safer choice because it reduces false positives.
Visualize Outlier Scores and Decisions
Plotting helps you verify the pipeline is flagging the right points, especially when you have ground truth.
| |
If the red points in the prediction plot roughly match the ground truth plot, your pipeline is working. Look for false negatives near cluster edges – those are the hardest cases for any detector.
Build the Automated Cleaning Pipeline
Wrap everything into a reusable function that takes raw data and returns a cleaned version with outliers removed.
| |
The min_votes parameter controls strictness. Set it to 1 to be aggressive (any detector flags it), or to 3 to only remove points that all three detectors agree on.
Evaluate Detection Quality
When you have ground truth labels, measure how well the pipeline performs using precision and recall.
| |
You can also check per-detector performance to see which algorithm works best for your data distribution:
| |
The combined score almost always beats individual detectors. If one detector consistently underperforms, drop it from the ensemble – carrying dead weight lowers the combined AUC.
Tuning the Contamination Rate
The contamination parameter tells each detector what fraction of the data to treat as outliers. If you set it too low, you miss real outliers. Too high, and you throw away good data.
A practical approach when you do not have labels:
- Start with
contamination=0.05(5%) - Inspect the flagged points manually or with visualizations
- Increase to 0.1 or 0.15 if you see obvious outliers surviving
- Check downstream model performance with and without the flagged points removed
If your dataset has a known noise rate from a labeling process, use that as your starting contamination estimate.
Common Errors and Fixes
ValueError: contamination must be in (0, 0.5]
PyOD caps contamination at 50%. If your data is more than half outliers, you have a bigger problem than outlier detection. Filter or resample first.
ImportError: cannot import name 'IForest' from 'pyod.models.iforest'
You probably have an old PyOD version. Run pip install --upgrade pyod to get the latest. The IForest wrapper has been stable since PyOD 0.9+.
LOF runs extremely slowly on large datasets
LOF has O(n^2) complexity because it computes pairwise distances. For datasets over 50k rows, switch to ECOD or IForest, both of which scale linearly. Alternatively, subsample your data before fitting LOF.
Scores from different detectors are on wildly different scales
This is expected. Use pyod.models.combination.average which normalizes scores before averaging, or manually standardize with scipy.stats.zscore before combining.
decision_scores_ has NaN values
Usually caused by features containing NaN. PyOD detectors do not handle missing values. Impute or drop NaN rows before fitting:
| |
Majority vote flags zero outliers
This happens when detectors disagree completely. Lower min_votes to 1 or 2. Also check that all detectors use the same contamination value – mismatched rates cause inconsistent thresholds.
Related Guides
- How to Build a Data Profiling and Auto-Cleaning Pipeline with Python
- How to Build a Data Quality Pipeline with Cleanlab
- How to Build a Dataset Bias Detection Pipeline with Python
- How to Build a Data Labeling Pipeline with Label Studio
- How to Build a Data Slicing and Stratification Pipeline for ML
- How to Build a Data Reconciliation Pipeline for ML Training Sets
- How to Build a Data Contamination Detection Pipeline for LLM Training
- How to Build a Feature Importance and Selection Pipeline with Scikit-Learn
- How to Build a Data Freshness Monitoring Pipeline with Python
- How to Build a Data Annotation Pipeline with Argilla