scikit-learnpipelinegridsearchcvimblearnmlxtend

What does it mean AttributeError: 'ColumnSelector' object has no attribute 'n_features_in_'?


I am making a grid search for tuning hyperparameters of a stacking estimator(StackingClassifier object from sklearn.ensemble library). I making use of the scikit library for ML, and the RandomizedSearchCV function. In adition to this, the base estimators of the stack to tune are pipelines (Pipeline object from imblearn.pipeline library) where the first step of each pipeline is a ColumnSelector object from the mlxtend library. The grid search is intended to look over a long list of combinations of variables, so the distribution of parameters for the grid goes only over the parameters "cols" for the ColumnSelector object. The first time I ran this code, everything was working well, then I set aside the project and come back after a few days to find it was not working anymore. Everything in the code is the same as I left it, but when I ran the method fit on the RandomizedSearchCV object, I get the following error:

AttributeError: 'ColumnSelector' object has no attribute 'n_features_in_'

I don't get what's worng. I have tried many things, even unninstalling Anaconda, mlxtend, imblearn, and reinstalling with the recent versions, but it keeps shouting the same error. I have made a search on google but it seems there is no info about this.

Can you help me with this issue?

Thanks in advance.


Addendum: the scikit version is 0.23.1, mlxtend version 0.17.3 and imbalanced-learn version is 0.7.0.

The full traceback is below, the object gr2 corresponds to a RandomizedSearchCV object which is intended to tune the stacking classifier. I want to note that if I make use of the StackingClassifier object from the mlxtend everything works fine, but this object does not have the parameter cv, which does have the StackingClassifier from sklearn.ensemble, and which I need in order to have better performance(which I had before when everything was working fine).

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-94-9d8f412d45a3> in <module>
----> 1 gr2.fit(x_train,y_train)

~\anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     71                           FutureWarning)
     72         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 73         return f(**kwargs)
     74     return inner_f
     75 

~\anaconda3\lib\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params)
    763             refit_start_time = time.time()
    764             if y is not None:
--> 765                 self.best_estimator_.fit(X, y, **fit_params)
    766             else:
    767                 self.best_estimator_.fit(X, **fit_params)

~\anaconda3\lib\site-packages\sklearn\ensemble\_stacking.py in fit(self, X, y, sample_weight)
    423         self._le = LabelEncoder().fit(y)
    424         self.classes_ = self._le.classes_
--> 425         return super().fit(X, self._le.transform(y), sample_weight)
    426 
    427     @if_delegate_has_method(delegate='final_estimator_')

~\anaconda3\lib\site-packages\sklearn\ensemble\_stacking.py in fit(self, X, y, sample_weight)
    147             for est in all_estimators if est != 'drop'
    148         )
--> 149         self.n_features_in_ = self.estimators_[0].n_features_in_
    150 
    151         self.named_estimators_ = Bunch()

~\anaconda3\lib\site-packages\sklearn\pipeline.py in n_features_in_(self)
    623     def n_features_in_(self):
    624         # delegate to first step (which will call _check_is_fitted)
--> 625         return self.steps[0][1].n_features_in_
    626 
    627     def _sk_visual_block_(self):

AttributeError: 'ColumnSelector' object has no attribute 'n_features_in_'

Solution

  • sklearn has been adding checks for the number of features, with the attribute n_features_in_. It appears mlxtend has not yet added that to its ColumnSelector, and hence the error (noting that sklearn's Pipeline doesn't have its own attribute n_features_in_, instead delegating to the first step, as you can see in the comment in the code at the end of the traceback).

    Ideally, submit an Issue with mlxtend to add n_features_in_ (and perhaps relevant checks) to ColumnSelector. But in the meantime, a couple of workarounds come to mind:

    1. mlxtend has a StackingClassifierCV, which is probably preferred to the ordinary StackingClassifier anyway, and has the cv parameter you want. That might never look for the n_features_in_ attribute and resolve things (as long as the Pipeline never tries to call its getter...)
    2. Using sklearn's ColumnTransformer may be preferable to using mlxtend's ColumnSelector. Then you don't need mlxtend at all, it seems.
    3. Downgrading your sklearn may be enough, to avoid the n_features_in_ checks altogether.