Bad data points wreck models. A single cluster of outliers in your training set can shift decision boundaries, inflate loss, and produce predictions that look fine on average but fail hard on real inputs. PyOD gives you 40+ outlier detection algorithms behind a consistent scikit-learn-style API, so you can run multiple detectors and combine their votes without writing glue code.

Install PyOD and Generate Test Data

1
pip install pyod scikit-learn numpy matplotlib

First, create a synthetic dataset with known outliers. This lets you measure whether your pipeline actually catches them.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import numpy as np
from pyod.utils.data import generate_data

# 500 samples, 2 features, 10% contamination (50 outliers)
X_train, X_test, y_train, y_test = generate_data(
    n_train=500,
    n_test=200,
    n_features=2,
    contamination=0.1,
    random_state=42,
)

print(f"Training set: {X_train.shape[0]} samples, {int(y_train.sum())} outliers")
print(f"Test set: {X_test.shape[0]} samples, {int(y_test.sum())} outliers")

generate_data produces Gaussian inliers and uniform outliers. The y_train and y_test arrays contain ground truth labels: 0 for inlier, 1 for outlier. You will use these later to evaluate detection quality.

Run Three Detectors: Isolation Forest, LOF, ECOD

Each algorithm catches different kinds of anomalies. Isolation Forest works well on global outliers. LOF (Local Outlier Factor) catches local density deviations. ECOD (Empirical Cumulative Distribution) is fast and nonparametric – it flags points in the tails of each feature’s distribution.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
from pyod.models.iforest import IForest
from pyod.models.lof import LOF
from pyod.models.ecod import ECOD

contamination = 0.1

# Initialize detectors
detectors = {
    "Isolation Forest": IForest(contamination=contamination, random_state=42),
    "LOF": LOF(contamination=contamination, n_neighbors=20),
    "ECOD": ECOD(contamination=contamination),
}

# Fit each detector on training data
for name, model in detectors.items():
    model.fit(X_train)
    train_preds = model.labels_       # binary labels: 0=inlier, 1=outlier
    train_scores = model.decision_scores_  # raw anomaly scores
    n_detected = int(train_preds.sum())
    print(f"{name}: detected {n_detected} outliers in training set")

After calling .fit(), every PyOD model exposes two attributes: labels_ (binary predictions on training data) and decision_scores_ (continuous anomaly scores where higher means more anomalous). This consistent interface is what makes PyOD worth using over raw scikit-learn estimators.

Combine Multiple Detectors

A single detector can miss outliers that another catches. PyOD’s combination utilities let you aggregate scores from multiple models. The two most useful strategies are average scoring and majority vote.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from pyod.models.combination import average, majority_vote

# Collect raw scores and binary predictions from all detectors
scores_matrix = np.column_stack([
    detectors["Isolation Forest"].decision_scores_,
    detectors["LOF"].decision_scores_,
    detectors["ECOD"].decision_scores_,
])

labels_matrix = np.column_stack([
    detectors["Isolation Forest"].labels_,
    detectors["LOF"].labels_,
    detectors["ECOD"].labels_,
])

# Average of normalized scores
combined_scores = average(scores_matrix)

# Majority vote: flagged as outlier if 2 out of 3 detectors agree
combined_labels = majority_vote(labels_matrix)

n_avg = int((combined_scores > np.percentile(combined_scores, 90)).sum())
n_vote = int(combined_labels.sum())
print(f"Average scoring flags: {n_avg} outliers (top 10% of scores)")
print(f"Majority vote flags: {n_vote} outliers")

Majority vote is more conservative – it only flags a point when most detectors agree. Average scoring gives you a continuous score you can threshold however you want. For cleaning training data, majority vote is usually the safer choice because it reduces false positives.

Visualize Outlier Scores and Decisions

Plotting helps you verify the pipeline is flagging the right points, especially when you have ground truth.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Ground truth
axes[0].scatter(
    X_train[:, 0], X_train[:, 1],
    c=y_train, cmap="coolwarm", s=15, alpha=0.7,
)
axes[0].set_title("Ground Truth (red = outlier)")

# Combined scores heatmap
scatter = axes[1].scatter(
    X_train[:, 0], X_train[:, 1],
    c=combined_scores, cmap="YlOrRd", s=15, alpha=0.7,
)
axes[1].set_title("Combined Anomaly Scores")
plt.colorbar(scatter, ax=axes[1])

# Majority vote predictions
axes[2].scatter(
    X_train[:, 0], X_train[:, 1],
    c=combined_labels, cmap="coolwarm", s=15, alpha=0.7,
)
axes[2].set_title("Majority Vote Predictions")

plt.tight_layout()
plt.savefig("outlier_detection_results.png", dpi=150)
plt.show()

If the red points in the prediction plot roughly match the ground truth plot, your pipeline is working. Look for false negatives near cluster edges – those are the hardest cases for any detector.

Build the Automated Cleaning Pipeline

Wrap everything into a reusable function that takes raw data and returns a cleaned version with outliers removed.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from pyod.models.iforest import IForest
from pyod.models.lof import LOF
from pyod.models.ecod import ECOD
from pyod.models.combination import majority_vote
import numpy as np


def remove_outliers(X, contamination=0.1, min_votes=2):
    """
    Fit multiple outlier detectors and remove points flagged
    by at least `min_votes` detectors.

    Returns:
        X_clean: array with outliers removed
        outlier_mask: boolean array (True = outlier)
    """
    models = [
        IForest(contamination=contamination, random_state=42),
        LOF(contamination=contamination, n_neighbors=20),
        ECOD(contamination=contamination),
    ]

    all_labels = []
    for model in models:
        model.fit(X)
        all_labels.append(model.labels_)

    labels_matrix = np.column_stack(all_labels)
    vote_counts = labels_matrix.sum(axis=1)
    outlier_mask = vote_counts >= min_votes

    X_clean = X[~outlier_mask]
    n_removed = int(outlier_mask.sum())
    print(f"Removed {n_removed} outliers ({n_removed / len(X) * 100:.1f}%) from {len(X)} samples")
    return X_clean, outlier_mask


# Run the pipeline
X_clean, mask = remove_outliers(X_train, contamination=0.1, min_votes=2)

The min_votes parameter controls strictness. Set it to 1 to be aggressive (any detector flags it), or to 3 to only remove points that all three detectors agree on.

Evaluate Detection Quality

When you have ground truth labels, measure how well the pipeline performs using precision and recall.

1
2
3
4
from sklearn.metrics import classification_report

# Use majority vote labels from the combined pipeline
print(classification_report(y_train, combined_labels, target_names=["inlier", "outlier"]))

You can also check per-detector performance to see which algorithm works best for your data distribution:

1
2
3
4
5
6
7
8
9
from sklearn.metrics import roc_auc_score

for name, model in detectors.items():
    auc = roc_auc_score(y_train, model.decision_scores_)
    print(f"{name} ROC AUC: {auc:.3f}")

# Combined score AUC
combined_auc = roc_auc_score(y_train, combined_scores)
print(f"Combined (average) ROC AUC: {combined_auc:.3f}")

The combined score almost always beats individual detectors. If one detector consistently underperforms, drop it from the ensemble – carrying dead weight lowers the combined AUC.

Tuning the Contamination Rate

The contamination parameter tells each detector what fraction of the data to treat as outliers. If you set it too low, you miss real outliers. Too high, and you throw away good data.

A practical approach when you do not have labels:

  • Start with contamination=0.05 (5%)
  • Inspect the flagged points manually or with visualizations
  • Increase to 0.1 or 0.15 if you see obvious outliers surviving
  • Check downstream model performance with and without the flagged points removed

If your dataset has a known noise rate from a labeling process, use that as your starting contamination estimate.

Common Errors and Fixes

ValueError: contamination must be in (0, 0.5] PyOD caps contamination at 50%. If your data is more than half outliers, you have a bigger problem than outlier detection. Filter or resample first.

ImportError: cannot import name 'IForest' from 'pyod.models.iforest' You probably have an old PyOD version. Run pip install --upgrade pyod to get the latest. The IForest wrapper has been stable since PyOD 0.9+.

LOF runs extremely slowly on large datasets LOF has O(n^2) complexity because it computes pairwise distances. For datasets over 50k rows, switch to ECOD or IForest, both of which scale linearly. Alternatively, subsample your data before fitting LOF.

Scores from different detectors are on wildly different scales This is expected. Use pyod.models.combination.average which normalizes scores before averaging, or manually standardize with scipy.stats.zscore before combining.

decision_scores_ has NaN values Usually caused by features containing NaN. PyOD detectors do not handle missing values. Impute or drop NaN rows before fitting:

1
2
3
4
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")
X_imputed = imputer.fit_transform(X_train)

Majority vote flags zero outliers This happens when detectors disagree completely. Lower min_votes to 1 or 2. Also check that all detectors use the same contamination value – mismatched rates cause inconsistent thresholds.