Suppose I have a dataframe like the following
df = pd.DataFrame({'animal': ['Dog', 'Bird', 'Dog', 'Cat'],
'color': ['Black', 'Blue', 'Brown', 'Black'],
'age': [1, 10, 3, 6],
'pet': [1, 0, 1, 1],
'sex': ['m', 'm', 'f', 'f'],
'name': ['Rex', 'Gizmo', 'Suzy', 'Boo']})
I want to use label encoder to encode "animal", "color", "sex" and "name", but I don't need to encode the other two columns. I also want to be able to inverse_transform the columns afterwards.
I have tried the following, and although encoding works as I'd expect it to, reversing does not.
to_encode = ["animal", "color", "sex", "name"]
le = LabelEncoder()
for col in to_encode:
df[col] = fit_transform(df[col])
## to inverse:
for col in to_encode:
df[col] = inverse_transform(df[col])
The inverse_transform function results in the following dataframe:
animal | color | age | pet | sex | name |
---|---|---|---|---|---|
Rex | Boo | 1 | 1 | Gizmo | Rex |
Boo | Gizmo | 10 | 0 | Gizmo | Gizmo |
Rex | Rex | 3 | 1 | Boo | Suzy |
Gizmo | Boo | 6 | 1 | Boo | Boo |
It's obviously not right, but I'm not sure how else I'd accomplish this?
Any advice would be appreciated!
As you can see in your output, when you are trying to inverse_transform
, it seems that the code is only using the information he obtained for the last column "name". You can see that because now, all the rows of your columns have values related to names. You should have one LabelEncoder()
for each column.
The key here is to have one LabelEncoder
fitted for each different column. To do this, I recommend you save them in a dictionary:
to_encode = ["animal", "color", "sex", "name"]
d={}
for col in to_encode:
d[col]=preprocessing.LabelEncoder().fit(df[col]) #For each column, we create one instance in the dictionary. Take care we are only fitting now.
If we print the dictionary now, we will obtain something like this:
{'animal': LabelEncoder(),
'color': LabelEncoder(),
'sex': LabelEncoder(),
'name': LabelEncoder()}
As we can see, for each column we want to transform, we have his LabelEncoder()
information. This means, for example, that for the animal LabelEncoder
it saves that 0 is equal to bird, 1 equal to cat, ... And the same for each column.
Once we have every column fitted, we can proceed to transform, and then, if we want to inverse_transform
. The only thing to be aware is that every transform/inverse_transform have to use the corresponding LabelEncoder
of this column.
Here we transform:
for col in to_encode:
df[col] = d[col].transform(df[col]) #Be aware we are using the dictionary
df
animal color age pet sex name
0 2 0 1 1 1 2
1 0 1 10 0 1 1
2 2 2 3 1 0 3
3 1 0 6 1 0 0
And, once the df is transformed, we can inverse_transform
:
for col in to_encode:
df[col] = d[col].inverse_transform(df[col])
df
animal color age pet sex name
0 Dog Black 1 1 m Rex
1 Bird Blue 10 0 m Gizmo
2 Dog Brown 3 1 f Suzy
3 Cat Black 6 1 f Boo
One interesting idea could be using ColumnTransformer
, but unfortunately, it doesn't suppport inverse_transform()
.