scikit-learnlabel-encoding

What is the the good way to proceed with LabelEncoder with sklearn to get back the coulples?


I have a dataframe with categorical value like city name for instance.

For ML algo., I need then encode the data into numerical value.

I do it like this:

df[cat_columns] = df[cat_columns].apply(preprocessing.LabelEncoder().fit_transform)

My question is that if I want later to know for instance to what city correspond the encoded value 2.

2 could be for instance "Paris".

For the moment before encoding i do this so that i can get back the info:

encoders = {c: preprocessing.LabelEncoder().fit(df[c]) for c in cat_columns}

Is it useless? How do you proceed ? Thanks


Solution

  • LabelEncoder should only be used to encode your labels, i.e. your target y.

    To transform categorical columns in the same way you should use OrdinalEncoder (however, ordinal encoding might not always be desired - you should look up OneHotEncoder and decide if that's a better fit for your problem).

    Let's use an example dataset to explore the correct transformations:

    import pandas as pd
    
    df = pd.DataFrame(
        {
            "country": ["France", "France", "Japan", "Netherlands"],
            "city": ["Paris", "Lyon", "Tokyo", "Amsterdam"],
            "population": [13024518, 2323221, 37468000, 2480394]
        }
    )
    

    Applying OrdinalEncoder directly to our full dataset will result in encoding numerical columns as well:

    >>> from sklearn.preprocessing import OrdinalEncoder
    >>> enc = OrdinalEncoder()
    >>> enc.fit_transform(df)
    array([[0., 2., 2.],
           [0., 1., 0.],
           [1., 3., 3.],
           [2., 0., 1.]])
    

    The expected way to perform this transformation is through the use of ColumnTransformer to specify the columns we'd like to perform the transformation on:

    >>> from sklearn.compose import ColumnTransformer
    >>> from sklearn.preprocessing import OrdinalEncoder
    >>> ct = ColumnTransformer(
    ...     [("enc", OrdinalEncoder(), ["country", "city"])],
    ...     remainder="passthrough"
    ... )
    >>> ct.fit_transform(df)
    array([[0.0000000e+00, 2.0000000e+00, 1.3024518e+07],
           [0.0000000e+00, 1.0000000e+00, 2.3232210e+06],
           [1.0000000e+00, 3.0000000e+00, 3.7468000e+07],
           [2.0000000e+00, 0.0000000e+00, 2.4803940e+06]])
    

    We can access the original categories like so (note the indexes in the following array):

    >>> ct.named_transformers_["enc"].categories_                
    [array(['France', 'Japan', 'Netherlands'], dtype=object), array(['Amsterdam', 'Lyon', 'Paris', 'Tokyo'], dtype=object)]