pythonmachine-learningscikit-learn

What's the best way to use a sklearn feature selector in a grid search, to evaluate the usefulness of all features?


I am training a sklearn classifier, and inserted in a pipeline a feature selection step. Via grid search, I would like to determine what's the number of features that allows me to maximize performance. Still, I'd like to explore in the grid search the possibility that no feature selection, just a "passthrough" step, is the optimal choice to maximize performance.

Here's a reproducible example:

import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

# Load the Titanic dataset
titanic = sns.load_dataset('titanic')

# Select features and target
features = ['age', 'fare', 'sex']
X = titanic[features]
y = titanic['survived']

# Preprocessing pipelines for numeric and categorical features
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant')),
    ('scaler', StandardScaler())
])

categorical_features = ['sex']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant')),
    ('onehot', OneHotEncoder(drop='first'))
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

# Initialize classifier and feature selector
clf = LogisticRegression(max_iter=1000, solver='liblinear')
sfs = SequentialFeatureSelector(clf, direction='forward')

# Create a pipeline that includes preprocessing, feature selection, and classification
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('feature_selection', sfs),
    ('classifier', clf)
])

# Define the parameter grid to search over
param_grid = {
    'feature_selection__n_features_to_select': [2],
    'classifier__C': [0.1, 1.0, 10.0],  # Regularization strength
}

# Create and run the grid search
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(X, y)

# Output the best parameters and score
print("Best parameters found:", grid_search.best_params_)
print("Best cross-validation score:", grid_search.best_score_)

X here has three features (even after the preprocessor step), but the grid search code above doesn't allow to explore models in which all 3 features are used, as setting

 feature_selection__n_features_to_select: [2,3]

will give a ValueError: n_features_to_select must be < n_features.

The obstacle here is that SequentialFeatureSelector doesn't consider the selection of all features (aka a passthrough selector) as a valid feature selection.

In other words, I would like to run a grid search that considers also the setting of

('feature_selection', 'passthrough')

in the space of possible pipeline configurations. Is there an idiomatic/nice way to do that?


Solution

  • The parameter n_features_to_select can be an integer (number of features) or a float (proportion of features). So instead of [1, 2, 3], the pipeline can run with [1/3, 2/3, 1.0].

    To get the scores for each combination of parameters in the grid search, you can run

    display(pd.DataFrame(grid_search.cv_results_))
    

    The results for n_features = 1.0 and those for a pipeline without the SequentialFeatureSelector (e.g. setting that to 'passthrough') should be the same.