This encoding process will generate a mapping between each categorical value and its corresponding numeric value:
import category_encoders as ce
cols_a = ['group1', 'group2']
dfa = pd.DataFrame([['A1', 'A2', 1], ['B1', 'B2', 4], ['A1', 'C2', 3], ['B1', 'B2', 5]], columns=['group1', 'group2', 'label'])
enc = ce.TargetEncoder(cols=cols_a)
enc.fit(dfa[cols_a], dfa['label'])
enc.mapping
Maybe you can ignore the encoding process and just remember the output mapping.
How to flatten this mapping into the expected dataframe below?
Follow-up: I eventually want to replace the 'cat_val' with its original categorical values from the mapping enc.ordinal_encoder.mapping
. Is there any easy way to achieve this?
My solution is to group by 'group' -> find the corresponding dictionary -> replace it with the value from the dictionary.
Here's one approach:
Step 1: convert enc.mapping
to df
Using pd.concat
with names
+ df.reset_index
with name
:
names = ['group', 'cat_val']
enc_mapping = (pd.concat(enc.mapping, names=names)
.reset_index(name='value')
)
Output:
group cat_val value
0 group1 1 3.072686
1 group1 2 3.427314
2 group1 -1 3.250000
3 group1 -2 3.250000
4 group2 1 2.957256
5 group2 2 3.427314
6 group2 3 3.217473
7 group2 -1 3.250000
8 group2 -2 3.250000
Step 2: map based on enc.ordinal_encoder.mapping
m = (pd.concat({item['col']: pd.Series(item['mapping'].index, item['mapping'])
for item in enc.ordinal_encoder.mapping})
)
enc_mapping['cat_val'] = enc_mapping.set_index(names).index.map(m)
# alternative:
# enc_mapping['cat_val'] = enc_mapping[names].apply(tuple, axis=1).map(m)
Output:
group cat_val value
0 group1 A1 3.072686
1 group1 B1 3.427314
2 group1 NaN 3.250000
3 group1 NaN 3.250000
4 group2 A2 2.957256
5 group2 B2 3.427314
6 group2 C2 3.217473
7 group2 NaN 3.250000
8 group2 NaN 3.250000
Explanation / intermediates
'col'
key in each dict in enc.ordinal_encoder.mapping
and values from 'mapping'
, but swapping index and values (cf. here). Pass this to pd.concat
:m
group1 1 A1
2 B1
-2 NaN
group2 1 A2
2 B2
3 C2
-2 NaN
dtype: object
enc_mapping
to names
with df.set_index
, apply index.map
with m
, and assign.(On step 2: I can imagine that there is an easier way to get the code mappings. Via enc.ordinal_encoder.transform(dfa[cols_a])
could be promising.)