scikit-learnrandom-forestrfe

RFECV is starting over w/ original number of features


I'm trying to use RFECV to suggest the optimal number of features I should keep in X to predict y. My X is a dataframe with 121 variables (mix of dtypes, some continuous, some categorical) and my y is a single column df containing 0s and 1s. I ran this code with some other amazing StackOverflow pages out there:

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold
rfe = RandomForestClassifier(random_state = 32) 
rfecv  = RFECV(estimator= rfe, step=5, cv=StratifiedKFold(2), scoring="accuracy", verbose=3) 
fit = rfecv.fit(X, y.values.ravel())
print("Optimal number of features : %d" % rfecv.n_features_)

I intentionally did not select a minimum number of features because I figured I would settle with what RFECV delivered (if it's one feature, I accept). I chose step=5 because going 1 step at a time was taking forever (because I have 121 variables that truly seem important to keep at this stage).

The whole operation looked like it was going fine...until the RFECV started over with 121 features. Here's what the output looks like, no actual final suggested number of features is emerging:

Fitting estimator with 121 features.
Fitting estimator with 116 features.
Fitting estimator with 111 features.
Fitting estimator with 106 features.
Fitting estimator with 101 features.
Fitting estimator with 96 features.
Fitting estimator with 91 features.
Fitting estimator with 86 features.
Fitting estimator with 81 features.
Fitting estimator with 76 features.
Fitting estimator with 71 features.
Fitting estimator with 66 features.
Fitting estimator with 61 features.
Fitting estimator with 56 features.
Fitting estimator with 51 features.
Fitting estimator with 46 features.
Fitting estimator with 41 features.
Fitting estimator with 36 features.
Fitting estimator with 31 features.
Fitting estimator with 26 features.
Fitting estimator with 21 features.
Fitting estimator with 16 features.
Fitting estimator with 11 features.
Fitting estimator with 6 features.
Fitting estimator with 121 features.
Fitting estimator with 116 features.

This is my first time doing this, and I have absolutely no idea why this might be happening, nor how to perhaps fix my RFECV to give me a suggested number of features so I can move forward. Should I bite the bullet and reduce to step=1 (even though it's going to take a lifetime)? Advice?


Solution

  • The first lines are being printed from an RFE on the first training fold of your cv, while the next ones are on the second training fold. There will be one more set printed for an RFE on the entire training set, but that one will end early (at the selected number of features).

    See also https://stackoverflow.com/a/65557483/10495893