pythonscikit-learnpipelineeli5

How to get feature names from ELI5 when transformer includes an embedded pipeline


The ELI5 library provides the function transform_feature_names to retrieve the feature names for the output of an sklearn transformer. The documentation says that the function works out of the box when the transformer includes nested Pipelines.

I'm trying to get the function to work on a simplified version of the example in the answer to SO 57528350. My simplified example doesn't need Pipeline, but in real life I will need it in order to add steps to categorical_transformer, and I will also want to add transformers to the ColumnTransformer.

import eli5
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

X_train = pd.DataFrame({'age': [23, 12, 12, 18],
                        'gender': ['M', 'F', 'F', 'F'],
                        'income': ['high', 'low', 'low', 'medium'],
                        'y': [0, 1, 1, 1]})

categorical_features = ['gender', 'income']
categorical_transformer = Pipeline(
    steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])

transformers=[('categorical', categorical_transformer, categorical_features)]
preprocessor = ColumnTransformer(transformers)
X_train_transformed = preprocessor.fit(X_train)

eli5.transform_feature_names(preprocessor, list(X_train.columns))

This dies with the message

AttributeError: Transformer categorical (type Pipeline) does not provide get_feature_names.

Since the Pipeline is nested in the ColumnTransformer, I understood from the ELI5 documentation that it would be handled.

Do I need to create a modified version of Pipeline with a get_feature_names method or make other custom modifications in order to take advantage of the ELI5 function?

I'm using python 3.7.6, eli5 0.10.1, pandas 0.25.3, and sklearn 0.22.1.


Solution

  • I think the problem is that eli5 is relying on the ColumnTransformer method get_feature_names, which itself asks the Pipeline to get_feature_names, which is not yet implemented in sklearn.

    I've opened an Issue with eli5 with your example.

    One possible fix: adding a transform_feature_names dispatch for ColumnTransformer; this can be just a modification of its existing get_feature_names to call for eli5 transform_feature_names for each of its component transformers (instead of sklearn's own get_feature_names). The below seems to work, although I'm not sure how to handle when input_names differs from the training dataframe columns, available in the ColumnTransformer as _df_columns.

    from eli5 import transform_feature_names
    
    @transform_feature_names.register(ColumnTransformer)
    def col_tfm_names(transformer, in_names=None):
        if in_names is None:
            from eli5.sklearn.utils import get_feature_names
            # generate default feature names
            in_names = get_feature_names(transformer, num_features=transformer._n_features)
        # return a list of strings derived from in_names
        feature_names = []
        for name, trans, column, _ in transformer._iter(fitted=True):
            if hasattr(transformer, '_df_columns'):
                if ((not isinstance(column, slice))
                        and all(isinstance(col, str) for col in column)):
                    names = column
                else:
                    names = transformer._df_columns[column]
            else:
                indices = np.arange(transformer._n_features)
                names = ['x%d' % i for i in indices[column]]
            # erm, want to be able to override with in_names maybe???
    
            if trans == 'drop' or (
                    hasattr(column, '__len__') and not len(column)):
                continue
            if trans == 'passthrough':
                feature_names.extend(names)
                continue
            feature_names.extend([name + "__" + f for f in
                                  transform_feature_names(trans, in_names=names)])
        return feature_names
    

    I also needed to create a dispatch for OneHotEncoder, because its get_feature_names needs the parameter input_features:

    @transform_feature_names.register(OneHotEncoder)
    def _ohe_names(est, in_names=None):
        return est.get_feature_names(input_features=in_names)
    

    Relevant links:
    https://eli5.readthedocs.io/en/latest/autodocs/eli5.html#eli5.transform_feature_names
    https://github.com/TeamHG-Memex/eli5/blob/4839d1927c4a68aeff051935d1d4d8a4fb69b46d/eli5/sklearn/transform.py