I have a dataframe with categorical value like city name for instance.
For ML algo., I need then encode the data into numerical value.
I do it like this:
df[cat_columns] = df[cat_columns].apply(preprocessing.LabelEncoder().fit_transform)
My question is that if I want later to know for instance to what city correspond the encoded value 2.
2 could be for instance "Paris".
For the moment before encoding i do this so that i can get back the info:
encoders = {c: preprocessing.LabelEncoder().fit(df[c]) for c in cat_columns}
Is it useless? How do you proceed ? Thanks
LabelEncoder
should only be used to encode your labels, i.e. your target y
.
To transform categorical columns in the same way you should use OrdinalEncoder
(however, ordinal encoding might not always be desired - you should look up OneHotEncoder
and decide if that's a better fit for your problem).
Let's use an example dataset to explore the correct transformations:
import pandas as pd
df = pd.DataFrame(
{
"country": ["France", "France", "Japan", "Netherlands"],
"city": ["Paris", "Lyon", "Tokyo", "Amsterdam"],
"population": [13024518, 2323221, 37468000, 2480394]
}
)
Applying OrdinalEncoder
directly to our full dataset will result in encoding numerical columns as well:
>>> from sklearn.preprocessing import OrdinalEncoder
>>> enc = OrdinalEncoder()
>>> enc.fit_transform(df)
array([[0., 2., 2.],
[0., 1., 0.],
[1., 3., 3.],
[2., 0., 1.]])
The expected way to perform this transformation is through the use of ColumnTransformer
to specify the columns we'd like to perform the transformation on:
>>> from sklearn.compose import ColumnTransformer
>>> from sklearn.preprocessing import OrdinalEncoder
>>> ct = ColumnTransformer(
... [("enc", OrdinalEncoder(), ["country", "city"])],
... remainder="passthrough"
... )
>>> ct.fit_transform(df)
array([[0.0000000e+00, 2.0000000e+00, 1.3024518e+07],
[0.0000000e+00, 1.0000000e+00, 2.3232210e+06],
[1.0000000e+00, 3.0000000e+00, 3.7468000e+07],
[2.0000000e+00, 0.0000000e+00, 2.4803940e+06]])
We can access the original categories like so (note the indexes in the following array):
>>> ct.named_transformers_["enc"].categories_
[array(['France', 'Japan', 'Netherlands'], dtype=object), array(['Amsterdam', 'Lyon', 'Paris', 'Tokyo'], dtype=object)]