python-3.xdataframescikit-learnencoder

how to use ColumnTransformer() to return a dataframe?


I have a dataframe like this:

department      review  projects salary satisfaction bonus  avg_hrs_month   left
0   operations  0.577569    3   low         0.626759    0   180.866070      0
1   operations  0.751900    3   medium      0.443679    0   182.708149      0
2   support     0.722548    3   medium      0.446823    0   184.416084      0
3   logistics   0.675158    4   high        0.440139    0   188.707545      0
4   sales       0.676203    3   high        0.577607    1   179.821083      0

I want to try ColumnTransformer() and return a transformed dataframe.

ord_features = ["salary"]
ordinal_transformer = OrdinalEncoder()


cat_features = ["department"]
categorical_transformer = OneHotEncoder(handle_unknown="ignore")

ct = ColumnTransformer(
    transformers=[
        ("ord", ordinal_transformer, ord_features),
        ("cat", categorical_transformer, cat_features ),
           ]
)

df_new = ct.fit_transform(df)
df_new

which gives me a 'sparse matrix of type '<class 'numpy.float64'>'

if I use pd.DataFrame(ct.fit_transform(df)) then I'm getting a single column:

                            0
0   (0, 0)\t1.0\n (0, 7)\t1.0
1   (0, 0)\t2.0\n (0, 7)\t1.0
2   (0, 0)\t2.0\n (0, 10)\t1.0
3   (0, 5)\t1.0
4   (0, 9)\t1.0

however, I was expecting to see the transformed dataframe like this?

    review  projects salary satisfaction bonus  avg_hrs_month   operations support ...
0   0.577569    3    1      0.626759     0      180.866070      1           0
1   0.751900    3    2      0.443679     0      182.708149      1           0  
2   0.722548    3    2      0.446823     0      184.416084      0           1
3   0.675158    4    3      0.440139     0      188.707545      0           0
4   0.676203    3    3      0.577607     1      179.821083      0           0

Is it possible with ColumnTransformer()?


Solution

  • As quickly sketched in the comment there are a couple of considerations to be done on your example:

    The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list. Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.

    I would aggest reading Preserve column order after applying sklearn.compose.ColumnTransformer at this proposal.

    Update 10/2022 - sklearn version 1.2.dev0

    With sklearn version 1.2.0 it will be possible to solve the problem of returning a DataFrame when transforming a ColumnTransformer instance much more easily. Such version has not been released yet, but you can test the following in dev (version 1.2.dev0), by installing the nightly builds as such:

    pip install --pre --extra-index https://pypi.anaconda.org/scipy-wheels-nightly/simple scikit-learn -U
    

    The ColumnTransformer (and other transformers as well) now exposes a .set_output() method which gives the possibility to configure a transformer to output pandas DataFrames, by passing parameter transform='pandas' to it.

    Therefore, the example becomes:

    import pandas as pd
    from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
    from sklearn.compose import ColumnTransformer
    from sklearn.pipeline import Pipeline
    from sklearn.model_selection import train_test_split, GridSearchCV
    from sklearn.ensemble import RandomForestClassifier
    
    df = pd.DataFrame({
        'department': ['operations', 'operations', 'support', 'logistics', 'sales'],
        'review': [0.577569, 0.751900, 0.722548, 0.675158, 0.676203],
        'projects': [3, 3, 3, 4, 3],
        'salary': ['low', 'medium', 'medium', 'low', 'high'],
        'satisfaction': [0.626759, 0.751900, 0.722548, 0.675158, 0.676203],
        'bonus': [0, 0, 0, 0, 1],
        'avg_hrs_month': [180.866070, 182.708149, 184.416084, 188.707545, 179.821083],
        'left': [0, 0, 1, 0, 0]
    })
    
    ord_features = ["salary"]
    ordinal_transformer = OrdinalEncoder()
    
    cat_features = ["department"]
    categorical_transformer = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
    
    ct = ColumnTransformer(transformers=[
        ("ord", ordinal_transformer, ord_features),
        ("cat", categorical_transformer, cat_features )],
        remainder='passthrough'
    )
    
    ct.set_output('pandas')
    df_pandas = ct.fit_transform(df)
    df_pandas
    

    enter image description here

    The output also becomes much easier to read as it has proper column names (indeed, at each step, the transformers of which ColumnTransformer is made of do have the attribute feature_names_in_; so you don't lose column names anymore while transforming the input).

    Last note. Observe that the example now requires parameter sparse_output=False to be passed to the OneHotEncoder instance in order to work.