I have trained a SequentialFeatureSelector
from sklearn and am now interested in the best model (based on the given scoring method) it produced. Is there a possible way of extracting the parameters and using them generate the model that was used?
I have seen that there exists a get_params()
function for the SequentialFeatureSelector
, but I don't undestand how to interpret the output and retrieve the best estimator.
The main result of this model is which features it decided to select. You can access that information in various ways. Suppose you have fitted a selector=SequentialFeatureSelector(...).fit(...)
.
selector.support_
is a boolean vector, where True
means it selected that feature. If you started off with 5 features, and told it to select 2, then the vector will be [True, False, False, False, True]
if it selected the first and last feature.
You can get the same output as above using selector.get_support()
. If you want the indices rather than a boolean vector, you can use selector.get_support(indices=True)
- it'll return [0, 4]
in this case indicating feature number 0 and feature number 3.
To get the feature names (only applies if you fed the model a dataframe):
selector.feature_names_in_[selector.support_]
After fitting the selector, if you want it to strip out the unselected features, you can use selector.transform(X_test)
. The .transform(X_test)
will apply the already-fitted selector to the supplied data. In this example, if X_test
is 100 x 5, then it'll return a 100 x 2 version where it has only kept the features determined from the initial .fit()
.
SequentialFeatureSelector
doesn't keep any of the models fitted during cross-validation. So I think you'd need to fit a new model using the selected features:
#Fit selector
selector = SequentialFeatureSelector(
LogisticRegression(), n_features_to_select=2
).fit(X, y)
print('Selected feature numbers are', selector.get_support(indices=True))
#Use fitted selector to reduce X
X_reduced = selector.transform(X)
#Fit logreg model on the selected features
logreg_fitted = LogisticRegression().fit(X_reduced, y)
Alternatively, this ensures consistency with the original estimator by saving you from needing to manually specify all the original parameters:
from sklearn.base import clone
best_model = clone(selector.estimator)(**selector.estimator.get_params()).fit(selector.transform(X), y).
If you want identical models (down to the random seed) it'll also be necessary to set up the CV appropriately.