pythonscikit-learnrfe

How does cross-validated recursive feature elimination drop features in each iteration (sklearn RFECV)?


I am using sklearn.feature_selection.RFECV to reduce the number of features in my final model. With non-cross-validated RFE, you can choose exactly how many features to select. However, with RFECV, you can only specify min_number_features_to_select, which acts more like a lower bound.

So how does RFECV drop features in each iteration? I understand normal RFE, but how does cross validation come into play?

Here are my instances:

clf = GradientBoostingClassifier(loss='deviance', learning_rate=0.03, n_estimators=500,
                                 subsample=1.0, criterion='friedman_mse', min_samples_leaf=100,
                                 max_depth=7, max_features='sqrt', random_state=123)
rfe = RFECV(estimator=clf, step=1, min_features_to_select=35, cv=5, scoring='roc_auc',
            verbose=1, n_jobs=-1)
rfe.fit(X_train, y_train)

I could not find anything more specific in the documentation or user guide.


Solution

  • Your guess (edited out now) thinks of an algorithm that cross-validates the elimination step itself, but that is not how RFECV works. (Indeed, such an algorithm might stabilize RFE itself, but it wouldn't inform about the optimal number of features, and that is the goal of RFECV.)

    Instead, RFECV runs separate RFEs on each of the training folds, down to min_features_to_select. These are very likely to result in different orders of elimination and final features, but none of that is taken into consideration: only the scores of the resulting models, for each number of features, on the test fold is retained. (Note that RFECV has a scorer parameter that RFE lacks.) Those scores are then averaged, and the best score corresponds to the chosen n_features_. Finally, a last RFE is run on the entire dataset with that target number of features.

    source code