python machine-learning scikit-learn one-hot-encoding

return the labels and their encoded values in sklearn LabelEncoder

I'm using LabelEncoder and OneHotEncoder from sklearn in a Machine Learning project to encode the labels (country names) in the dataset. Everything works good and my model runs perfectly. The project is to classify whether a bank customer will continue with or leave the bank based on a number of features(data), including the customer's country.

My issue arises when I want to predict (classify) a new customer (one only). The data for the new customer is still not pre-processed (i.e., country names are not encoded). Something like the following:

new_customer = np.array([['France', 600, 'Male', 40, 3, 60000, 2, 1,1, 50000]])

In the online course, where I learn machine learning, the instructor opened the pre-processed dataset that included the encoded data and manually checked the code for France and updated it in the new_customer, as the following:

new_customer = np.array([[0, 0, 600, 'Male', 40, 3, 60000, 2, 1,1, 50000]])

I believe that this is not practical, there must be a way to automatically encode France to the same code used in the original dataset, or at least a way to return a list of the countries and their encoded values. Manually encoding a label seems tedious and error-prone. So how can I automate this process, or generate the codes for the labels? Thanks in advance.

Solution

It seems like you may be looking for the .transform() method of your estimator.

>>> from sklearn.preprocessing import LabelEncoder

>>> c = ['France', 'UK', 'US', 'US', 'UK', 'China', 'France']
>>> enc = LabelEncoder().fit(c)
>>> encoded = enc.transform(c)
>>> encoded
array([1, 2, 3, 3, 2, 0, 1])

>>> encoded.transform(['France'])
array([1])

This takes the "mapping" that was learned when you called fit(c) and applies it to new data (in this case, a new label). You can see this mapping in reverse:

>>> enc.inverse_transform(encoded)
array(['France', 'UK', 'US', 'US', 'UK', 'China', 'France'], dtype='<U6')

As mentioned by the answer here, if you want to do this between Python sessions, you could serialize the estimator to disk like this:

import pickle

with open('enc.pickle', 'wb') as file:
    pickle.dump(enc, file, pickle.HIGHEST_PROTOCOL)

Then load this in a new session and transform incoming data with it.