pythonscikit-learnrfe

Sklearn RFE, pipeline and cross validation


I'm trying to figure out how to use RFE for regression problems, and I was reading some tutorials.

I found an example on how to use RFECV to automatically select the ideal number of features, and it goes something like:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV


rfecv = RFECV(estimator=RandomForestClassifier(random_state=101), step=1, cv=StratifiedKFold(10), scoring='accuracy')
rfecv.fit(X, target)
print(np.where(rfecv.support_ == False)[0])

which I find pretty straightforward.

However, I was checking how to do the same thing using a RFE object, but in order to include cross-validation I only found solutions involving the use of pipelines, like:

X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)
# create pipeline
rfe = RFE(estimator=DecisionTreeRegressor(), n_features_to_select=5)
model = DecisionTreeRegressor()
pipeline = Pipeline(steps=[('s',rfe),('m',model)])
# evaluate model
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(pipeline, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')

# report performance
print(f'MAE: {mean(n_scores):.3f}')

I'm not sure about what precisely is happening here. The pipeline is used to queue the RFE algorithm and the second DecisionTreeRegressor (model). If I'm not wrong, the idea is that for every iteration in the cross-validation, the RFE is executed, the desired number of best features is selected, and then the second model is run using only those features. But how/when did the RFE pass the information about which features have been selected to the DecisionTreeRegressor? Did it even happen, or is the code missing this part?


Solution

  • Well, first, let's point it out that RFECV and RFE are doing two separate jobs in your script: the former is selecting the optimal number of features, while the latter is selecting the most five important features (or, the best combination of 5 features, given their importance for the DecisionTreeRegressor).

    Back to your question: "When did the RFE pass the information about which features have been selected to the Decision Tree?" It is worth noting that the RFE does not explicitly tell the Decision Tree which features are selected. Simply, it takes a matrix as input (the training set) and transforms it in a matrix of N columns, based on the n_features_to_select=N parameter. That matrix (i.e., transformed training set) is passed as input to the Decision Tree, along with the target variable, which returns a fitted model that can be used to predict unseen instances.

    Let's dive into an example for classification:

    """ Import dependencies and load data """
    import numpy as np
    import pandas as pd
    
    from sklearn.datasets import load_breast_cancer
    from sklearn.feature_selection import RFE
    from sklearn.metrics import precision_score
    from sklearn.tree import DecisionTreeClassifier
    
    X, y = load_breast_cancer(return_X_y=True)
    rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=2)
    

    We have now loaded the breast_cancer dataset and instantiated a RFE object (I used a DecisionTreeClassifier, but other algorithms can be used as well).

    To see how the training data is handled within a pipeline, let's start with a manual example that show how a pipeline would works if decomposed in its "basic steps":

    from sklearn.model_selection import train_test_split
    
    def test_and_train(X, y, random_state):
        # For simplicity, let's use 80%-20% splitting
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_state)
    
        # Fit and transform the training data by applying Recursive Feature Elimination
        X_train_transformed = rfe.fit_transform(X_train, y_train)
        # Transform the testing data to select the same features
        X_test_transformed = rfe.transform(X_test)  
    
        print(X_train[0:3])
        print(X_train_transformed[0:3])
        print(X_test_transformed[0:3])
    
        # Train on the transformed trained data
        fitted_model = DecisionTreeClassifier().fit(X_train_transformed, y_train)
    
        # Predict on the transformed testing data
        y_pred = fitted_model.predict(X_test_transformed)
    
        print('True labels: ', y_test)
        print('Predicted labels:', y_pred)
    
        return y_test, y_pred
    
    precisions = list() # to store the precision scores (can be replaced by any other evaluation measure)
    
    y_test, y_pred = test_and_train(X, y, 42)
    precisions.append(precision_score(y_test, y_pred))
    
    y_test, y_pred = test_and_train(X, y, 84)
    precisions.append(precision_score(y_test, y_pred))
    
    y_test, y_pred = test_and_train(X, y, 168)
    precisions.append(precision_score(y_test, y_pred))
    
    print('Average precision:', np.mean(precisions))
    """
    Average precision: 0.92
    """
    

    In the above script, we created a function that, given a dataset X and a target variable y

    1. Creates a training and testing set following the 80%-20% splitting rule.
    2. Transforms them using RFE (i.e., selects the best 2 features, as specified in the former code snippet). While calling fit_transform on the RFE, it runs the Recursive Feature Elimination, and it saves information about the selected features in its object state. To know which are the selected features, call rfe.support_. Note: on the testing set only transform is executed, so that the features in rfe.support_ are used to filter out other features from the testing set.
    3. Fits a model and return a tuple (y_test, y_pred).

    The y_test and y_pred can be used to analyze the performance of the model, e.g., its precision. The precision in saved in an array, and the procedure is repeated 3 times. Finally, we print the average precision.

    We simulated a cross-validation procedure, by splitting the original data 3 times in their respective training and testing set, fitted a model, computed and averaged its performance (i.e., precision) across the three folds. This process can be simplified using a RepeatedKFold validation:

    from sklearn.model_selection import RepeatedKFold
    
    precisions = list()
    rkf = RepeatedKFold(n_splits=2, n_repeats=3, random_state=1)
    
    for train_index, test_index in rkf.split(X, y):
        # print("TRAIN:", train_index, "TEST:", test_index)
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        
        X_train_transformed = rfe.fit_transform(X_train, y_train)
        X_test_transformed = rfe.transform(X_test)
        
        fitted_model = DecisionTreeClassifier().fit(X_train_transformed, y_train)
        y_pred = fitted_model.predict(X_test_transformed)
    
        precisions.append(precision_score(y_test, y_pred))
    
    print('Average precision:', np.mean(precisions))
    """
    Average precision: 0.93
    """
    

    and even further with Pipeline:

    from sklearn.pipeline import Pipeline
    from sklearn.model_selection import cross_val_score
    
    rkf = RepeatedKFold(n_splits=2, n_repeats=3, random_state=1)
    pipeline = Pipeline(steps=[('s',rfe),('m',DecisionTreeClassifier())])
    precisions = cross_val_score(pipeline, X, y, scoring='precision', cv=rkf)
    
    print('Average precision:', np.mean(precisions))
    """
    Average precision: 0.93
    """
    
    

    In summary, when the original data is passed to the Pipeline, the latter:

    1. splits it in training and testing data;
    2. calls RFE.fit_transform() on the training data;
    3. applies RFE.transform() on the testing data so that it consists of the same features;
    4. calls estimator.fit() on the training data to fit (i.e., train) a model;
    5. calls estimator.predict() on the testing data to predict it.
    6. compares the predictions with the actual values and save the performance results (the one you passed to the scoring parameter) internally.
    7. Repeats steps 1-6 for every split in the cross-validation procedure

    At the end of the procedure, someone can access the performance results and average them across the folds.