pythonscikit-learnpipelinerfe

How to get the support_ values from RFE pipeline?


I created a Pipeline with RFE and RandomForestClassifer in it and then applied RandomizedSearchCV to find the best hyperparameter values for both. This is what my code looks like -

from sklearn.esemble_learning import RandomForestClassifier
from sklearn.feature_selection import RFE
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV

steps = [
         ("rfe", RFE(estimator = RandomForestClassifier(random_state = 42))),
         ("est", RandomForestClassifier())
]
rf_clf_pl = Pipeline(steps = steps)

params = {
    "rfe__n_features_to_select" : range(2, smote_X_train.shape[1] + 1),
    "est__random_state" : np.linspace(0, 42, 5).astype(int),
    "est__n_estimators" : range(50, 201, 10),
    "est__max_depth" : [None] + list(range(5, max_depth, 3)),
    "est__max_leaf_nodes" : [None] + list(range(100, max_leaf_nodes, 20))
}

rs = RandomizedSearchCV(estimator = rf_clf_pl, cv = 4, param_distributions = params, n_jobs = -1, n_iter = 100, random_state = 42)
rs.fit(smote_X_train, smote_y_train)

I tried using the code below but got an error -

rf_clf_pl.named_steps["rfe"].support_

Error -

AttributeError                            Traceback (most recent call last)
<ipython-input-53-c73290f0e090> in <module>()
----> 1 rf_clf_pl.named_steps["rfe"].support_

AttributeError: 'RFE' object has no attribute 'support_'

How can I get the name of the retained features?


Solution

  • You can access the retained features of the best estimator as follows:

    rs.best_estimator_.named_steps['rfe'].support_
    

    Namely, you should access the best_estimator_ attribute of the RandomizedSearchCV fitted instance (i.e. the pipeline re-fitted with the best found hyperparameters thanks to the default parameter refit=True of RandomizedSearchCV).

    The way you were trying to access attribute support_ from the pipeline instance does not work because you've not explicitly fitted the pipeline itself nor the fitted RandomizedSearchCV returns the fitted base estimator (despite calling .fit() on it while running the search) with the exception of the best_estimator_ in the case described above.

    Here's an example:

    from sklearn.datasets import load_iris
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.feature_selection import RFE
    from sklearn.pipeline import Pipeline
    from sklearn.model_selection import RandomizedSearchCV, train_test_split
    
    iris = load_iris(as_frame=True)
    X, y = iris.data, iris.target
    X_train, X_test, y_train, y_test= train_test_split(X, y, random_state=0)
    
    steps = [
         ("rfe", RFE(estimator = RandomForestClassifier(random_state = 42))),
         ("est", RandomForestClassifier())
    ]
    rf_clf_pl = Pipeline(steps = steps)
    
    params = {
        "rfe__n_features_to_select" : range(2, X_train.shape[1] + 1),
        "est__random_state" : np.linspace(0, 42, 5).astype(int),
        "est__n_estimators" : range(50, 201, 10),
        "est__max_depth" : [None] + list(range(5, 16, 3)),
        "est__max_leaf_nodes" : [None] + list(range(100, 201, 20))
    }
    
    rs = RandomizedSearchCV(estimator = rf_clf_pl, cv = 4, param_distributions = params, n_jobs = -1, n_iter = 100, random_state = 42)
    
    rs.fit(X_train, y_train)
    
    rs.best_estimator_.named_steps['rfe'].support_
    

    Eventually, if you want to access the explicit names of the retained features, you can retrieve them via rs.feature_names_in_[np.where(rs.best_estimator_.named_steps['rfe'].support_)[0]].