I have a training dataset with six features and I am using SequentialFeatureSelector
to find an "optimal" subset of the features for a linear regression model. The following code returns three features, which I will call X1, X2, X3
.
sfs = SequentialFeatureSelector(LinearRegression(), n_features_to_select='auto',
tol=0.05, direction='forward',
scoring='neg_root_mean_squared_error', cv=8)
sfs.fit_transform(X_train, y_train)
To check the results, I decided to run the same code using the subset of features X1, X2, X3
instead of X_train
. I was expecting to see the features X1, X2, X3
returned again, but instead it was only the features X1, X2
. Similarly, using these two features again in the same code returned only X1
. It seems that the behavior of sfs
is always to return a proper subset of the input features with at most n_features_in_ - 1
columns, but I cannot seem to find this information in the scikit-learn docs. Is this correct, and if so, what is the reasoning for not allowing
sfs
to return the full set of features?
I also checked to see if using backward selection would return a full feature set.
sfs = SequentialFeatureSelector(LinearRegression(), n_features_to_select='auto',
tol=1000, direction='backward',
scoring='neg_root_mean_squared_error', cv=8)
sfs.fit_transform(X_train, y_train)
I set the threshold tol
to be a large value in the hope that there would be no satisfactory improvement from the full set of features of X_train
. But, instead of returning the six original features, it only returned five. The docs simply state
If the score is not incremented by at least tol between two consecutive feature additions or removals, stop adding or removing.
So it seems that the full feature set is not being considered during cross-validation, and the behavior of sfs
is different at the very end of a forward selection or at the very beginning of a backwards selection. If the full set of features outperforms any proper subset of the features, then don't we want sfs
to return that possibility? Is there a standard method to compare a selected proper subset of the features and the full set of features using cross-validation?
Check the source code, lines 240-46 inside the method fit()
:
if self.n_features_to_select == "auto":
if self.tol is not None:
# With auto feature selection, `n_features_to_select_` will be updated
# to `support_.sum()` after features are selected.
self.n_features_to_select_ = n_features - 1
else:
self.n_features_to_select_ = n_features // 2
As can be seen, even with auto
selection mode and a given tol
, maximum numbers of features that can be added is bounded by n_features - 1
for some reason (may be we can report this issue in github).
We can override the implementation in the following way, by defining a function get_best_new_feature_score()
(similar to the method _get_best_new_feature_score()
from the source code), as shown below:
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.model_selection import cross_val_score
def get_best_new_feature_score(estimator, X, y, cv, current_mask, direction, scoring):
candidate_feature_indices = np.flatnonzero(~current_mask)
scores = {}
for feature_idx in candidate_feature_indices:
candidate_mask = current_mask.copy()
candidate_mask[feature_idx] = True
if direction == "backward":
candidate_mask = ~candidate_mask
X_new = X[:, candidate_mask]
scores[feature_idx] = cross_val_score(
estimator,
X_new,
y,
cv=cv,
scoring=scoring
).mean()
new_feature_idx = max(scores, key=lambda feature_idx: scores[feature_idx])
return new_feature_idx, scores[new_feature_idx]
Now, let's implement the auto
(forward) selection, using a regression dataset with 5 features, let' add all the features one-by-one, reporting the improvement in score and stopping by comparing with provided tol
:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
X, y = make_regression(n_features=5) # data to be used
X.shape
# (100, 5)
lm = LinearRegression() # model to be used
# now implement 'auto' feature selection (forward selection)
cur_mask = np.zeros(X.shape[1]).astype(bool) # no feature selected initially
cv, direction, scoring = 8, 'forward', 'neg_root_mean_squared_error'
tol = 1 # if score improvement > tol, feature will be added in forward selection
old_score = -np.inf
ids, scores = [], []
for i in range(X.shape[1]):
idx, new_score = get_best_new_feature_score(lm, X, y, current_mask=cur_mask, cv=cv, direction=direction, scoring=scoring)
print(new_score - old_score, tol, score - old_score > tol)
if (new_score - old_score) > tol:
cur_mask[idx] = True
ids.append(idx)
scores.append(new_score)
old_score = new_score
print(f'feature {idx} added, CV score {score}, mask {cur_mask}')
# feature 3 added, CV score -90.66899644023539, mask [False False False True False]
# feature 1 added, CV score -59.21188041830155, mask [False True False True False]
# feature 2 added, CV score -16.709218665372905, mask [False True True True False]
# feature 4 added, CV score -3.1862116620446166, mask [False True True True True]
# feature 0 added, CV score -1.4011801838814216e-13, mask [ True True True True True]
If tol=10
, set to 10 instead, then only 4 features will be added in forward-selection. Similarly, if tol=20
, then only 3 features will be added in forward-selection, as expected.