pythonmachine-learningscikit-learncross-validation

How does scikit's RFECV class compute cv_results_?


My understanding of Recursive Feature Elimination Cross Validation: (sklearn.feature_selection.RFECV) You provide an algorithm which is trained on the entire dataset and creates a feature importance ranking using attributes coef_ or feature_importances_. Now with all features included, this algorithm is evaluated by cross validation. Then the feature ranked at the bottom is removed and the model is retrained on the dataset and creates a new ranking, once again assessed by cross validation. This continues until all but one feature remain (or as specified by min_features_to_select), and the final number of features chosen depends on what yielded the highest CV score. (Source)

Question: The CV score for each number of features is stored in rfecv.cv_results_["mean_test_score"], and I've been facing trouble trying to replicate these scores without using scikit's built in method.

This is what I have tried to obtain the score for n-1 features, where n is the total number of features.

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_validate
from sklearn.feature_selection import RFECV

alg = DecisionTreeClassifier(random_state = 0)
cv_split = StratifiedKFold(5)
# train is a pandas dataframe, x_var and y_var are both lists containing variable strings
X = train[x_var]
y = np.ravel(train[y_var])

alg.fit(X, y)
lowest_ranked_feature = np.argmin(alg.feature_importances_)
x_var.pop(lowest_ranked_feature)

one_removed_feature = train[x_var]
alg.fit(one_removed_feature, y)
cv_score = cross_validate(alg, one_removed_feature, y, cv=cv_split, scoring="accuracy")
np.mean(cv_score["test_score"])

And this is the inbuilt method that provides a different score:

rfecv = RFECV(
    estimator=alg,
    step=1,
    cv=cv_split,
    scoring="accuracy",
)

rfecv.fit(X, y)
rfecv.cv_results_["mean_test_score"][-2]

How do I get the exact scores as calculated in the inbuilt method?

I would also like to mention that I did try this first with all n features, and my method matched with rfecv.cv_results_["mean_test_score"][-1].


Solution

  • As pointed out by Ben, the reason you are finding different answers is because your understanding of RFECV is fundamentally flawed. Cross validation is not implemented at each step within RFE, but rather RFE is implemented within each fold of cross validation.

    Your method currently removes a feature from the data and then performs CV to score it, and essentially performs a different CV to score each subset of features. However, RFECV performs CV only a single time at the start, and then cycles between removing a feature and scoring the model within a singular fold.

    Note: From experimentation, I have found that sklearn RFECV computes the score on a singular fold directly by using whichever evaluation metric you specify, and not another layer of cross validation or anything. In your case, you have chosen scoring="accuracy". So within a fold, the model will score the test split based on accuracy, and repeat this for each subset of features.

    You can implement RFECV from scratch as follows:

    import numpy as np
    import pandas as pd
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.model_selection import StratifiedKFold
    from sklearn.metrics import accuracy_score
    
    def score(alg, x_train, y_train, x_test, y_test):
        """Calculate the accuracy score of the algorithm."""
        alg.fit(x_train, y_train)
        y_pred = alg.predict(x_test)
        return accuracy_score(y_test, y_pred)
    
    def rfecv_function(alg, x_var, X, y, cv):
        """Perform RFECV and return a dictionary to store an array of test scores
        for each cv, where the array contains test scores for each number of 
        features selected 1,2,...,n (n is total number of features).
        """
        dic = {}
        # Iterate through folds
        for fold_index, (train_index, test_index) in enumerate(cv.split(X, y)):
            x_train_fold, y_train_fold = X.iloc[train_index], y.iloc[train_index]
            x_test_fold, y_test_fold = X.iloc[test_index], y.iloc[test_index]
            features = x_var.copy()
            
            # Array to store test scores for each feature subset
            scores_array = np.empty(len(x_var))
            
            # Iterate through the feature subsets
            for i in range(len(x_var)):
                # Calculate and store the scores in the array
                scores = score(alg, x_train_fold[features], y_train_fold, 
                               x_test_fold[features], y_test_fold)
                scores_array[-i-1] = scores 
                # Find and remove the lowest ranked feature
                alg.fit(x_train_fold[features], y_train_fold)
                lowest_rank = features[np.argmin(alg.feature_importances_)]
                features.remove(lowest_rank)
                
            dic[f"split{fold_index}_test_score"] = scores_array
        
        return dic
    
    dtree = DecisionTreeClassifier(random_state = 0)
    cv_split = StratifiedKFold(5, shuffle=True, random_state=0)
    
    # Assume train is a pandas dataframe
    x_var = ["var1", "var2", "var3"]
    y_var = ["target_var"]
    rfecv_scores = rfecv_function(dtree, x_var, train[x_var], train[y_var], cv_split)
    

    The rfecv_scores provide identical values to those computed in the inbuilt class RFECV.cv_results_