pythonpandasscikit-learnencoder

ColumnTransformer output columns order


I am experiencing an issue with the columns order after applying ColumnTransformer. If you run the following code:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder


df = pd.DataFrame({
    'FeatureA': [1.05, 0.5, 2.5],
    'FeatureB': [0, -5, -15],
    'CatFeatureA': ['feat1', 'feat2', 'feat3'],
    'CatFeatureB': ['cat1', 'cat2', 'cat3'],
    'FeatureC': [250, 125.5, 300]
})

transformer = ColumnTransformer(
    [("drop", "drop", ["FeatureC"]),
     ("ordinal", OrdinalEncoder(), ["CatFeatureA", "CatFeatureB"])],
    remainder="passthrough"
)

features = pd.DataFrame(columns=df.drop("FeatureC", axis=1).columns, index=df.index, data=transformer.fit_transform(df))

You will notice that the output is:

Out[70]: 
   FeatureA  FeatureB  CatFeatureA  CatFeatureB
0       0.0       0.0         1.05          0.0
1       1.0       1.0         0.50         -5.0
2       2.0       2.0         2.50        -15.0

Basically the values are not correctly aligned with the columns: the values under FeatureA and FeatureB are actually the values that should be under CatFeatureA and CatFeatureB, and viceversa.

How can I make sure that values are correctly aligned? It seems that the features encoded with OrdinalEncoder always go first, however I would like to have a more robust approach, as the transformer could be expanded in the future.


Solution

  • You can access the column names in the order of the output with:

    transformer.get_feature_names_out()
    
    array(['ordinal__CatFeatureA', 'ordinal__CatFeatureB',
           'remainder__FeatureA', 'remainder__FeatureB'], dtype=object)
    

    You could thus use:

    features = pd.DataFrame(data=transformer.fit_transform(df),
                            index=df.index,
                            columns=transformer.get_feature_names_out(),
                           )
    

    Or, better with the set_output API to request a DataFrame as output:

    transformer.set_output(transform='pandas')
    features = transformer.fit_transform(df)
    

    Output:

       ordinal__CatFeatureA  ordinal__CatFeatureB  remainder__FeatureA  remainder__FeatureB
    0                   0.0                   0.0                 1.05                  0.0
    1                   1.0                   1.0                 0.50                 -5.0
    2                   2.0                   1.0                 2.50                -15.0
    

    And if you don't want the leading substring:

    features = features.rename(columns=lambda x: x.split('__', 1)[-1])
    

    Output:

       CatFeatureA  CatFeatureB  FeatureA  FeatureB
    0          0.0          0.0      1.05         0
    1          1.0          1.0      0.50        -5
    2          2.0          1.0      2.50       -15