pythonmachine-learningscikit-learnone-hot-encoding

return the labels and their encoded values in sklearn LabelEncoder


I'm using LabelEncoder and OneHotEncoder from sklearn in a Machine Learning project to encode the labels (country names) in the dataset. Everything works good and my model runs perfectly. The project is to classify whether a bank customer will continue with or leave the bank based on a number of features(data), including the customer's country.

My issue arises when I want to predict (classify) a new customer (one only). The data for the new customer is still not pre-processed (i.e., country names are not encoded). Something like the following:

new_customer = np.array([['France', 600, 'Male', 40, 3, 60000, 2, 1,1, 50000]])

In the online course, where I learn machine learning, the instructor opened the pre-processed dataset that included the encoded data and manually checked the code for France and updated it in the new_customer, as the following:

new_customer = np.array([[0, 0, 600, 'Male', 40, 3, 60000, 2, 1,1, 50000]])

I believe that this is not practical, there must be a way to automatically encode France to the same code used in the original dataset, or at least a way to return a list of the countries and their encoded values. Manually encoding a label seems tedious and error-prone. So how can I automate this process, or generate the codes for the labels? Thanks in advance.


Solution

  • It seems like you may be looking for the .transform() method of your estimator.

    >>> from sklearn.preprocessing import LabelEncoder
    
    >>> c = ['France', 'UK', 'US', 'US', 'UK', 'China', 'France']
    >>> enc = LabelEncoder().fit(c)
    >>> encoded = enc.transform(c)
    >>> encoded
    array([1, 2, 3, 3, 2, 0, 1])
    
    >>> encoded.transform(['France'])
    array([1])
    

    This takes the "mapping" that was learned when you called fit(c) and applies it to new data (in this case, a new label). You can see this mapping in reverse:

    >>> enc.inverse_transform(encoded)
    array(['France', 'UK', 'US', 'US', 'UK', 'China', 'France'], dtype='<U6')
    

    As mentioned by the answer here, if you want to do this between Python sessions, you could serialize the estimator to disk like this:

    import pickle
    
    with open('enc.pickle', 'wb') as file:
        pickle.dump(enc, file, pickle.HIGHEST_PROTOCOL)
    

    Then load this in a new session and transform incoming data with it.