I am experiencing an issue with the columns order after applying ColumnTransformer. If you run the following code:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
df = pd.DataFrame({
'FeatureA': [1.05, 0.5, 2.5],
'FeatureB': [0, -5, -15],
'CatFeatureA': ['feat1', 'feat2', 'feat3'],
'CatFeatureB': ['cat1', 'cat2', 'cat3'],
'FeatureC': [250, 125.5, 300]
})
transformer = ColumnTransformer(
[("drop", "drop", ["FeatureC"]),
("ordinal", OrdinalEncoder(), ["CatFeatureA", "CatFeatureB"])],
remainder="passthrough"
)
features = pd.DataFrame(columns=df.drop("FeatureC", axis=1).columns, index=df.index, data=transformer.fit_transform(df))
You will notice that the output is:
Out[70]:
FeatureA FeatureB CatFeatureA CatFeatureB
0 0.0 0.0 1.05 0.0
1 1.0 1.0 0.50 -5.0
2 2.0 2.0 2.50 -15.0
Basically the values are not correctly aligned with the columns: the values under FeatureA and FeatureB are actually the values that should be under CatFeatureA and CatFeatureB, and viceversa.
How can I make sure that values are correctly aligned? It seems that the features encoded with OrdinalEncoder always go first, however I would like to have a more robust approach, as the transformer could be expanded in the future.
You can access the column names in the order of the output with:
transformer.get_feature_names_out()
array(['ordinal__CatFeatureA', 'ordinal__CatFeatureB',
'remainder__FeatureA', 'remainder__FeatureB'], dtype=object)
You could thus use:
features = pd.DataFrame(data=transformer.fit_transform(df),
index=df.index,
columns=transformer.get_feature_names_out(),
)
Or, better with the set_output
API to request a DataFrame as output:
transformer.set_output(transform='pandas')
features = transformer.fit_transform(df)
Output:
ordinal__CatFeatureA ordinal__CatFeatureB remainder__FeatureA remainder__FeatureB
0 0.0 0.0 1.05 0.0
1 1.0 1.0 0.50 -5.0
2 2.0 1.0 2.50 -15.0
And if you don't want the leading substring:
features = features.rename(columns=lambda x: x.split('__', 1)[-1])
Output:
CatFeatureA CatFeatureB FeatureA FeatureB
0 0.0 0.0 1.05 0
1 1.0 1.0 0.50 -5
2 2.0 1.0 2.50 -15