pythonpandasscikit-learnmlxtend

ColumnTransformer(s) in various parts of a pipeline do not play well


I am using sklearn and mlxtend.regressor.StackingRegressor to build a stacked regression model. For example, say I want the following small pipeline:

  1. A Stacking Regressor with two regressors:
    • A pipeline which:
      • Performs data imputation
      • 1-hot encodes categorical features
      • Performs linear regression
    • A pipeline which:
      • Performs data imputation
      • Performs regression using a Decision Tree

Unfortunately this is not possible, because StackingRegressor doesn't accept NaN in its input data. This is even if its regressors know how to handle NaN, as it would be in my case where the regressors are actually pipelines which perform data imputation.

However, this is not a problem: I can just move data imputation outside the stacked regressor. Now my pipeline looks like this:

  1. Perform data imputation
  2. Apply a Stacking Regressor with two regressors:
    • A pipeline which:
      • 1-hot encodes categorical features
      • Standardises numerical features
      • Performs linear regression
    • An sklearn.tree.DecisionTreeRegressor.

One might try to implement it as follows (the entire minimal working example in this gist, with comments):

sr_linear = Pipeline(steps=[
    ('preprocessing', ColumnTransformer(transformers=[
        ('categorical',
             make_pipeline(OneHotEncoder(), StandardScaler()),
             make_column_selector(dtype_include='category')),
        ('numerical',
             StandardScaler(),
             make_column_selector(dtype_include=np.number))
    ])),
    ('model', LinearRegression())
])

sr_tree = DecisionTreeRegressor()

ct_imputation = ColumnTransformer(transformers=[
    ('categorical',
        SimpleImputer(strategy='constant', fill_value='None'),
        make_column_selector(dtype_include='category')),
    ('numerical',
        SimpleImputer(strategy='median'),
        make_column_selector(dtype_include=np.number))
])

stacked_regressor = Pipeline(steps=[
    ('imputation', ct_imputation),
    ('back_to_pandas', FunctionTransformer(
        func=lambda values: pd.DataFrame(values, columns=ct_imputation.get_feature_names_out())
    )),
    ('model', StackingRegressor(
        regressors=[sr_linear, sr_tree],
        meta_regressor=DecisionTreeRegressor(),
        use_features_in_secondary=True
    ))
])

Note that the "outer" ColumnTransformer (in stacked_regressor) returns a numpy matrix. But the "inner" ColumnTransformer (in sr_linear) expects a pandas.DataFrame, so I had to convert the matrix back to a data frame using step back_to_pandas. (To use get_feature_names_out I had to use the nightly version of sklearn, because the current stable 1.0.2 version does not support it yet. Fortunately it can be installed with one simple command.)

The above code fails when calling stacked_regressor.fit(), with the following error (the entire stacktrace is again in the gist):

ValueError: make_column_selector can only be applied to pandas dataframes

However, because I added the back_to_pandas step to my outer pipeline, the inner pipelines should be getting a pandas data frame! In fact, if I only fit_transform() my ct_imputation object, I clearly obtain a pandas data frame. I cannot understand where and when exactly the data which gets passed around ceases to be a data frame. Why is my code failing?


Solution

  • The correct thing to do was:

    1. Move from mlxtend's to sklearn's StackingRegressor. I believe the former was creater when sklearn still didn't have a stacking regressor. Now there is no need to use more 'obscure' solutions. sklearn's stacking regressor works pretty well.
    2. Move the 1-hot-encoding step to the outer pipeline, because (surprisingly!) sklearn's DecisionTreeRegressor cannot handle categorical data among the features.

    A working version of the code is given below:

    from sklearn.datasets import fetch_openml
    from sklearn.pipeline import Pipeline, make_pipeline
    from sklearn.compose import ColumnTransformer, make_column_selector
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import StandardScaler, OneHotEncoder
    from sklearn.linear_model import LinearRegression
    from sklearn.tree import DecisionTreeRegressor
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import StackingRegressor
    
    import numpy as np
    import pandas as pd
    
    def set_correct_categories(df: pd.DataFrame) -> pd.DataFrame:
        for column in df.columns:
            if df[column].dtype == object or 'MSSubClass' in column:
                df[column] = pd.Categorical(df[column])
    
        return df
    
    d = fetch_openml('house_prices', as_frame=True).frame
    d = set_correct_categories(d).drop(columns='Id')
    
    sr_linear = Pipeline(steps=[
        ('preprocessing', StandardScaler()),
        ('model', LinearRegression())
    ])
    
    ct_preprocessing = ColumnTransformer(transformers=[
        ('categorical',
            make_pipeline(
                SimpleImputer(strategy='constant', fill_value='None'),
                OneHotEncoder(sparse=False, handle_unknown='ignore')
            ),
            make_column_selector(dtype_include='category')),
        ('numerical',
            SimpleImputer(strategy='median'),
            make_column_selector(dtype_include=np.number))
    ])
    
    stacking_regressor = Pipeline(steps=[
        ('preprocessing', ct_preprocessing),
        ('model', StackingRegressor(
            estimators=[('linear_regression', sr_linear), ('regression_tree', DecisionTreeRegressor())],
            final_estimator=DecisionTreeRegressor(),
            passthrough=True
        ))
    ])
    
    label = 'SalePrice'
    features = [col for col in d.columns if col != label]
    X, y = d[features], d[label]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)
    
    stacking_regressor.fit(X_train, y_train)
    

    Thanks to user amiola for his answer putting me on the right track.