Most real-world datasets carry dead weight. Redundant columns, noisy signals, features that correlate with nothing useful. Training a model on all of them wastes compute, inflates overfitting risk, and makes your pipeline harder to debug. Feature selection fixes this by keeping only the columns that actually help your model predict.
This guide walks through four approaches to feature importance and selection using scikit-learn: tree-based importance, permutation importance, recursive feature elimination, and an automated pipeline that chains them together. Every example uses real sklearn datasets so you can run the code directly.
Tree-Based Feature Importance
Random forests and gradient boosted trees track how much each feature reduces impurity (Gini or entropy) across all splits. Scikit-learn exposes this as the feature_importances_ attribute after fitting.
| |
This gives you a fast first look at which features the forest relies on. But there is a catch: impurity-based importance is biased toward high-cardinality and continuous features. A random ID column with many unique values can score high even though it has zero predictive value. That is where permutation importance comes in.
Permutation Importance
Permutation importance measures how much your model’s score drops when you shuffle a single feature’s values. If shuffling a column tanks accuracy, that feature matters. If shuffling does nothing, the feature is expendable.
This approach is model-agnostic and does not suffer from the cardinality bias of tree-based importance.
| |
The boxplot shows variance across repeats. Features with wide boxes have unstable importance, which often means they interact with other features or carry noise. Use permutation importance on the test set to avoid overfitting the importance estimates themselves.
One practical tip: always compare tree-based and permutation importance side by side. Features that rank high in both methods are your safest bets. Features that rank high only in tree-based importance are suspect.
Recursive Feature Elimination (RFE)
RFE takes a different approach. Instead of scoring features independently, it fits the model, drops the least important feature, refits, drops the next, and repeats until you reach the desired number. RFECV wraps this with cross-validation to automatically find the optimal count.
| |
RFECV is slower than the other methods because it fits the model many times. For datasets with hundreds of features, set step to something larger than 1 (like 5 or 10) to drop multiple features per round and cut runtime significantly.
Building an Automated Selection Pipeline
You can chain feature selection directly into a scikit-learn Pipeline so the selection step runs during fit() and transforms new data automatically during predict(). This prevents data leakage and keeps your preprocessing reproducible.
Here is a pipeline that uses SelectKBest with mutual information, followed by a model-based selector, then a final classifier:
| |
The key advantage of wrapping selection in a pipeline is that feature selection happens inside cross-validation folds. If you select features on the full dataset and then cross-validate, you leak information from the validation set into the selection step. The pipeline approach avoids this entirely.
Common Errors and Fixes
ValueError: Input contains NaN – Feature selection methods do not handle missing values. Impute or drop NaN rows before fitting. Add SimpleImputer as the first step in your pipeline.
SelectKBest with k larger than the number of features – If you set k=50 but your dataset has 30 features, scikit-learn raises an error. Either set k="all" or make sure k does not exceed X.shape[1].
Permutation importance is slow on large datasets – Reduce n_repeats from 30 to 10. Subsample the evaluation set with max_samples=0.5 to cut the computation in half:
| |
Tree-based importance shows random features as important – This is the cardinality bias mentioned earlier. Always validate with permutation importance. Add a random noise column to your data as a baseline. Any real feature that scores below the noise column should be dropped.
RFECV takes forever – Increase the step parameter. Setting step=5 removes 5 features per iteration instead of 1. For datasets with thousands of features, start with a filter method like SelectKBest to narrow down to 100-200 candidates, then run RFECV on the reduced set.
SelectFromModel with threshold="median" keeps too many or too few features – Try setting an explicit threshold like threshold="1.25*mean" or a float value. You can also use max_features to set a hard cap on the number of selected features.
Related Guides
- How to Build a Data Contamination Detection Pipeline for LLM Training
- How to Build a Data Outlier Detection Pipeline with PyOD
- How to Build a Data Lake Ingestion Pipeline with MinIO and PyArrow
- How to Build a Data Freshness Monitoring Pipeline with Python
- How to Build a Data Versioning Pipeline with Delta Lake for ML
- How to Build a Data Labeling Pipeline with Label Studio
- How to Build a Dataset Export Pipeline with Multiple Format Support
- How to Build a Synthetic Tabular Data Pipeline with CTGAN
- How to Build a Feature Engineering Pipeline with Featuretools
- How to Build a Data Quality Pipeline with Cleanlab