Most real-world datasets carry dead weight. Redundant columns, noisy signals, features that correlate with nothing useful. Training a model on all of them wastes compute, inflates overfitting risk, and makes your pipeline harder to debug. Feature selection fixes this by keeping only the columns that actually help your model predict.

This guide walks through four approaches to feature importance and selection using scikit-learn: tree-based importance, permutation importance, recursive feature elimination, and an automated pipeline that chains them together. Every example uses real sklearn datasets so you can run the code directly.

Tree-Based Feature Importance

Random forests and gradient boosted trees track how much each feature reduces impurity (Gini or entropy) across all splits. Scikit-learn exposes this as the feature_importances_ attribute after fitting.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Load a real dataset with 30 features
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Fit a random forest and extract importances
rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)

importances = rf.feature_importances_
indices = np.argsort(importances)[::-1]

# Plot top 15 features
top_n = 15
plt.figure(figsize=(10, 6))
plt.barh(
    range(top_n),
    importances[indices[:top_n]][::-1],
    color="#34d399",
    edgecolor="#064e3b",
)
plt.yticks(range(top_n), feature_names[indices[:top_n]][::-1])
plt.xlabel("Mean Decrease in Impurity")
plt.title("Top 15 Features by Tree-Based Importance")
plt.tight_layout()
plt.savefig("feature_importance.png", dpi=150)
plt.show()

print(f"Test accuracy with all {X.shape[1]} features: {rf.score(X_test, y_test):.4f}")

This gives you a fast first look at which features the forest relies on. But there is a catch: impurity-based importance is biased toward high-cardinality and continuous features. A random ID column with many unique values can score high even though it has zero predictive value. That is where permutation importance comes in.

Permutation Importance

Permutation importance measures how much your model’s score drops when you shuffle a single feature’s values. If shuffling a column tanks accuracy, that feature matters. If shuffling does nothing, the feature is expendable.

This approach is model-agnostic and does not suffer from the cardinality bias of tree-based importance.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from sklearn.inspection import permutation_importance

# Compute permutation importance on the test set
perm_result = permutation_importance(
    rf, X_test, y_test, n_repeats=30, random_state=42, n_jobs=-1
)

# Sort by mean importance
perm_sorted_idx = perm_result.importances_mean.argsort()[::-1]

plt.figure(figsize=(10, 6))
plt.boxplot(
    perm_result.importances[perm_sorted_idx[:top_n]].T,
    vert=False,
    labels=feature_names[perm_sorted_idx[:top_n]],
)
plt.xlabel("Decrease in Accuracy")
plt.title("Top 15 Features by Permutation Importance")
plt.tight_layout()
plt.savefig("permutation_importance.png", dpi=150)
plt.show()

# Print features with near-zero importance
low_importance = feature_names[perm_result.importances_mean < 0.001]
print(f"Features with negligible permutation importance: {list(low_importance)}")

The boxplot shows variance across repeats. Features with wide boxes have unstable importance, which often means they interact with other features or carry noise. Use permutation importance on the test set to avoid overfitting the importance estimates themselves.

One practical tip: always compare tree-based and permutation importance side by side. Features that rank high in both methods are your safest bets. Features that rank high only in tree-based importance are suspect.

Recursive Feature Elimination (RFE)

RFE takes a different approach. Instead of scoring features independently, it fits the model, drops the least important feature, refits, drops the next, and repeats until you reach the desired number. RFECV wraps this with cross-validation to automatically find the optimal count.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
from sklearn.feature_selection import RFECV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold

gbc = GradientBoostingClassifier(
    n_estimators=100, max_depth=3, random_state=42
)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

rfecv = RFECV(
    estimator=gbc,
    step=1,
    cv=cv,
    scoring="accuracy",
    min_features_to_select=5,
    n_jobs=-1,
)
rfecv.fit(X_train, y_train)

print(f"Optimal number of features: {rfecv.n_features_}")
print(f"Selected features: {list(feature_names[rfecv.support_])}")
print(f"Test accuracy with selected features: {rfecv.score(X_test, y_test):.4f}")

# Plot number of features vs. cross-validation score
n_features_range = range(rfecv.min_features_to_select, len(feature_names) + 1)
plt.figure(figsize=(8, 5))
plt.plot(n_features_range, rfecv.cv_results_["mean_test_score"], marker="o", markersize=3)
plt.fill_between(
    n_features_range,
    rfecv.cv_results_["mean_test_score"] - rfecv.cv_results_["std_test_score"],
    rfecv.cv_results_["mean_test_score"] + rfecv.cv_results_["std_test_score"],
    alpha=0.2,
)
plt.xlabel("Number of Features")
plt.ylabel("Cross-Validation Accuracy")
plt.title("RFECV: Accuracy vs. Number of Features")
plt.tight_layout()
plt.savefig("rfecv_scores.png", dpi=150)
plt.show()

RFECV is slower than the other methods because it fits the model many times. For datasets with hundreds of features, set step to something larger than 1 (like 5 or 10) to drop multiple features per round and cut runtime significantly.

Building an Automated Selection Pipeline

You can chain feature selection directly into a scikit-learn Pipeline so the selection step runs during fit() and transforms new data automatically during predict(). This prevents data leakage and keeps your preprocessing reproducible.

Here is a pipeline that uses SelectKBest with mutual information, followed by a model-based selector, then a final classifier:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
from sklearn.datasets import load_breast_cancer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, mutual_info_classif, SelectFromModel
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, train_test_split

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Pipeline: scale -> filter by mutual info -> model-based selection -> classify
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("mutual_info", SelectKBest(score_func=mutual_info_classif, k=20)),
    ("model_selection", SelectFromModel(
        RandomForestClassifier(n_estimators=100, random_state=42),
        threshold="median",
    )),
    ("classifier", GradientBoostingClassifier(
        n_estimators=100, max_depth=3, random_state=42
    )),
])

# Cross-validate the entire pipeline
scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring="accuracy")
print(f"CV accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")

# Fit and evaluate on test set
pipeline.fit(X_train, y_train)
test_score = pipeline.score(X_test, y_test)
print(f"Test accuracy: {test_score:.4f}")

# Inspect which features survived both selection steps
mi_selector = pipeline.named_steps["mutual_info"]
model_selector = pipeline.named_steps["model_selection"]

mi_mask = mi_selector.get_support()
mi_features = data.feature_names[mi_mask]

model_mask = model_selector.get_support()
final_features = mi_features[model_mask]

print(f"Features after mutual info filter: {len(mi_features)}")
print(f"Features after model-based selection: {len(final_features)}")
print(f"Final selected features: {list(final_features)}")

The key advantage of wrapping selection in a pipeline is that feature selection happens inside cross-validation folds. If you select features on the full dataset and then cross-validate, you leak information from the validation set into the selection step. The pipeline approach avoids this entirely.

Common Errors and Fixes

ValueError: Input contains NaN – Feature selection methods do not handle missing values. Impute or drop NaN rows before fitting. Add SimpleImputer as the first step in your pipeline.

SelectKBest with k larger than the number of features – If you set k=50 but your dataset has 30 features, scikit-learn raises an error. Either set k="all" or make sure k does not exceed X.shape[1].

Permutation importance is slow on large datasets – Reduce n_repeats from 30 to 10. Subsample the evaluation set with max_samples=0.5 to cut the computation in half:

1
2
3
perm_result = permutation_importance(
    rf, X_test, y_test, n_repeats=10, random_state=42, n_jobs=-1, max_samples=0.5
)

Tree-based importance shows random features as important – This is the cardinality bias mentioned earlier. Always validate with permutation importance. Add a random noise column to your data as a baseline. Any real feature that scores below the noise column should be dropped.

RFECV takes forever – Increase the step parameter. Setting step=5 removes 5 features per iteration instead of 1. For datasets with thousands of features, start with a filter method like SelectKBest to narrow down to 100-200 candidates, then run RFECV on the reduced set.

SelectFromModel with threshold="median" keeps too many or too few features – Try setting an explicit threshold like threshold="1.25*mean" or a float value. You can also use max_features to set a hard cap on the number of selected features.