pythonmachine-learningscikit-learnfeature-extractionsequentialfeatureselector

How to extract best estimator of a SequentialFeatureSelector


I have trained a SequentialFeatureSelector from sklearn and am now interested in the best model (based on the given scoring method) it produced. Is there a possible way of extracting the parameters and using them generate the model that was used?

I have seen that there exists a get_params() function for the SequentialFeatureSelector, but I don't undestand how to interpret the output and retrieve the best estimator.


Solution

  • The main result of this model is which features it decided to select. You can access that information in various ways. Suppose you have fitted a selector=SequentialFeatureSelector(...).fit(...).

    selector.support_ is a boolean vector, where True means it selected that feature. If you started off with 5 features, and told it to select 2, then the vector will be [True, False, False, False, True] if it selected the first and last feature.

    You can get the same output as above using selector.get_support(). If you want the indices rather than a boolean vector, you can use selector.get_support(indices=True) - it'll return [0, 4] in this case indicating feature number 0 and feature number 3.

    To get the feature names (only applies if you fed the model a dataframe):

    selector.feature_names_in_[selector.support_]

    After fitting the selector, if you want it to strip out the unselected features, you can use selector.transform(X_test). The .transform(X_test) will apply the already-fitted selector to the supplied data. In this example, if X_test is 100 x 5, then it'll return a 100 x 2 version where it has only kept the features determined from the initial .fit().

    SequentialFeatureSelector doesn't keep any of the models fitted during cross-validation. So I think you'd need to fit a new model using the selected features:

    #Fit selector
    selector = SequentialFeatureSelector(
        LogisticRegression(), n_features_to_select=2
    ).fit(X, y)
    
    print('Selected feature numbers are', selector.get_support(indices=True))
    
    #Use fitted selector to reduce X
    X_reduced = selector.transform(X)
    
    #Fit logreg model on the selected features
    logreg_fitted = LogisticRegression().fit(X_reduced, y)
    

    Alternatively, this ensures consistency with the original estimator by saving you from needing to manually specify all the original parameters:

    from sklearn.base import clone
    
    best_model = clone(selector.estimator)(**selector.estimator.get_params()).fit(selector.transform(X), y).
    

    If you want identical models (down to the random seed) it'll also be necessary to set up the CV appropriately.