I am using sklearn.feature_selection.RFECV
to reduce the number of features in my final model. With non-cross-validated RFE, you can choose exactly how many features to select. However, with RFECV, you can only specify min_number_features_to_select
, which acts more like a lower bound.
So how does RFECV drop features in each iteration? I understand normal RFE, but how does cross validation come into play?
Here are my instances:
clf = GradientBoostingClassifier(loss='deviance', learning_rate=0.03, n_estimators=500,
subsample=1.0, criterion='friedman_mse', min_samples_leaf=100,
max_depth=7, max_features='sqrt', random_state=123)
rfe = RFECV(estimator=clf, step=1, min_features_to_select=35, cv=5, scoring='roc_auc',
verbose=1, n_jobs=-1)
rfe.fit(X_train, y_train)
I could not find anything more specific in the documentation or user guide.
Your guess (edited out now) thinks of an algorithm that cross-validates the elimination step itself, but that is not how RFECV
works. (Indeed, such an algorithm might stabilize RFE itself, but it wouldn't inform about the optimal number of features, and that is the goal of RFECV
.)
Instead, RFECV
runs separate RFE
s on each of the training folds, down to min_features_to_select
. These are very likely to result in different orders of elimination and final features, but none of that is taken into consideration: only the scores of the resulting models, for each number of features, on the test fold is retained. (Note that RFECV
has a scorer
parameter that RFE
lacks.) Those scores are then averaged, and the best score corresponds to the chosen n_features_
. Finally, a last RFE
is run on the entire dataset with that target number of features.