pythonscikit-learnlinear-discriminant

How to use sklearn RFECV to select the optimal features to pass to a dimensionality reduction step before fitting my estimator


How can I use sklearn RFECV method to select the optimal features to pass to a LinearDiscriminantAnalysis(n_components=2) method for dimensionality reduction, before fitting my estimator using a KNN.

pipeline = make_pipeline(Normalizer(), LinearDiscriminantAnalysis(n_components=2), KNeighborsClassifier(n_neighbors=10))

X = self.dataset
y = self.postures

min_features_to_select = 1  # Minimum number of features to consider
rfecv = RFECV(svc, step=1, cv=None, scoring='f1_weighted', min_features_to_select=min_features_to_select)

rfecv.fit(X, y)

print(rfecv.support_)
print(rfecv.ranking_)
print("Optimal number of features : %d" % rfecv.n_features_)

Plot number of features VS. cross-validation scores
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(min_features_to_select,
len(rfecv.grid_scores_) + min_features_to_select),
rfecv.grid_scores_)
plt.show()

I get the following error from this code. If I run this code without the LinearDiscriminantAnalysis() step then it works, but this an important part of my processing.

*** ValueError: when `importance_getter=='auto'`, the underlying estimator Pipeline should have `coef_` or `feature_importances_` attribute. Either pass a fitted estimator to feature selector or call fit before calling transform.

Solution

  • Your approach has an overall problem: the KNeighborsClassifier does not have an intrinsic measure of feature importance. Thus, it is not compatible with RFECV as its documentation states about the classifier:

    A supervised learning estimator with a fit method that provides information about feature importance either through a coef_ attribute or through a feature_importances_ attribute.

    You will definitely fail with KNeighborsClassifier. You definitely need another classifier like RandomForestClassifier or SVC.

    If you can shoose another classifier, your pipeline still needs to expose the feature importance of the estimator in your pipeline. For this you can refer to this answer here which defines a custom pipeline for this purpose:

    class Mypipeline(Pipeline):
        @property
        def coef_(self):
            return self._final_estimator.coef_
        @property
        def feature_importances_(self):
            return self._final_estimator.feature_importances_
    

    Define your pipeline like:

    pipeline = MyPipeline([
        ('normalizer', Normalizer()),
        ('ldm', LinearDiscriminantAnalysis(n_components=2)),
        ('rf', RandomForestClassifier())
    ])
    

    and it should work.