I would like to LabelEncode a column in pandas where each row contains a list of strings. Since a similar string/text carries a same meaning across rows, encoding should respect that, and ideally encode it with a unique number. Imagine:
import pandas as pd
df =pd.DataFrame({
'A':[['OK', 'NG', 'Repair', 'Peace'],['Sky', 'NG', 'Fixed', 'Conflict'],['Crossed', 'OK', 'Engine', 'Peace'],['OK', 'Beats', 'RPi', 'Country']]
})
# df
A
0 [OK, NG, Repair, Peace]
1 [Sky, NG, Fixed, Conflict]
2 [Crossed, OK, Engine, Peace]
3 [OK, Beats, RPi, Country]
when I do the the followings:
le = LabelEncoder()
df['LabelEncodedA'] = df['A'].apply(le.fit_transform)
it returns:
A LabelEncodedA
0 [OK, NG, Repair, Peace] [1, 0, 3, 2]
1 [Sky, NG, Fixed, Conflict] [1, 3, 2, 0]
2 [Crossed, OK, Engine, Peace] [0, 2, 1, 3]
3 [OK, Beats, RPi, Country] [2, 0, 3, 1]
Which is not the intended result. Here each row is LabelEncoded in isolation. And a string e.g. 'OK' in the first row is not encoded to as the one in third or fourth row. Ideally I would like to have them encoded globally across rows. Perhaps one way may be to create a corpus out of that column, and using Tokenization or LabelEncoding obtain a mapping to encode manually the lists? How to convert then in pandas column containing list of strings to a corpus text? Or are there any better approaches?
Expected result (hypothetical):
A LabelEncodedA
0 [OK, NG, Repair, Peace] [0, 1, 2, 3]
1 [Sky, NG, Fixed, Conflict] [4, 1, 5, 6]
2 [Crossed, OK, Engine, Peace] [7, 0, 8, 9]
3 [OK, Beats, RPi, Country] [0, 10, 11, 12]
One approach would be to explode
the column, then factorize
to encode the column as categorical variable, then group the encoded column and aggregate using list
a = df['A'].explode()
a[:] = a.factorize()[0]
df['Encoded'] = a.groupby(level=0).agg(list)
Result
A Encoded
0 [OK, NG, Repair, Peace] [0, 1, 2, 3]
1 [Sky, NG, Fixed, Conflict] [4, 1, 5, 6]
2 [Crossed, OK, Engine, Peace] [7, 0, 8, 3]
3 [OK, Beats, RPi, Country] [0, 9, 10, 11]