pythonscikit-learncross-validationfeature-selectioneli5

Right way to use RFECV and Permutation Importance - Sklearn


There is a proposal to implement this in Sklearn #15075, but in the meantime, eli5 is suggested as a solution. However, I'm not sure if I'm using it the right way. This is my code:

from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn.svm import SVR
import eli5
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = SVR(kernel="linear")
perm = eli5.sklearn.PermutationImportance(estimator,  scoring='r2', n_iter=10, random_state=42, cv=3)
selector = RFECV(perm, step=1, min_features_to_select=1, scoring='r2', cv=3)
selector = selector.fit(X, y)
selector.ranking_

There are a few issues:

  1. I am not sure if I am using cross-validation the right way. PermutationImportance is using cv to validate importance on the validation set, or cross-validation should be only with RFECV? (in the example, I used cv=3 in both cases, but not sure if that's the right thing to do)

  2. If I run eli5.show_weights(perm), I'll get: AttributeError: 'PermutationImportance' object has no attribute 'feature_importances_'. Is this because I fit using RFECV? what I'm doing is similar to the last snippet here: https://eli5.readthedocs.io/en/latest/blackbox/permutation_importance.html

  3. as a less important issue, this gives me a warning when I set cv in eli5.sklearn.PermutationImportance :

.../lib/python3.8/site-packages/sklearn/utils/validation.py:68: FutureWarning: Pass classifier=False as keyword args. From version 0.25 passing these as positional arguments will result in an error warnings.warn("Pass {} as keyword args. From version 0.25 "

The whole process is a bit vague. Is there a way to do it directly in Sklearn? e.g. by adding a feature_importances attribute?


Solution

  • Since the objective is to select the optimal number of features with permutation importance and recursive feature elimination, I suggest using RFECV and PermutationImportance in conjunction with a CV splitter like KFold. The code could then look like this:

    import warnings
    from eli5 import show_weights
    from eli5.sklearn import PermutationImportance
    from sklearn.datasets import make_friedman1
    from sklearn.feature_selection import RFECV
    from sklearn.model_selection import KFold
    from sklearn.svm import SVR
    
    
    warnings.filterwarnings("ignore", category=FutureWarning)
    
    X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
    
    splitter = KFold(n_splits=3) # 3 folds as in the example
    
    estimator = SVR(kernel="linear")
    selector = RFECV(
        PermutationImportance(estimator,  scoring='r2', n_iter=10, random_state=42, cv=splitter),
        cv=splitter,
        scoring='r2',
        step=1
    )
    selector = selector.fit(X, y)
    selector.ranking_
    
    show_weights(selector.estimator_)
    

    Regarding your issues:

    1. PermutationImportance will calculate the feature importance and RFECV the r2 scoring with the same strategy according to the splits provided by KFold.

    2. You called show_weights on the unfitted PermutationImportance object. That is why you got an error. You should access the fitted object with the estimator_ attribute instead.

    3. Can be ignored.