python pandas dataframe dictionary mapping

flatten dictionary with dataframe value to a dataframe

This encoding process will generate a mapping between each categorical value and its corresponding numeric value:

import category_encoders as ce

cols_a = ['group1', 'group2']
dfa = pd.DataFrame([['A1', 'A2', 1], ['B1', 'B2', 4], ['A1', 'C2', 3], ['B1', 'B2', 5]], columns=['group1', 'group2', 'label'])
enc = ce.TargetEncoder(cols=cols_a)
enc.fit(dfa[cols_a], dfa['label'])

enc.mapping

Maybe you can ignore the encoding process and just remember the output mapping.

How to flatten this mapping into the expected dataframe below?

Follow-up: I eventually want to replace the 'cat_val' with its original categorical values from the mapping enc.ordinal_encoder.mapping. Is there any easy way to achieve this?

My solution is to group by 'group' -> find the corresponding dictionary -> replace it with the value from the dictionary.

Solution

Here's one approach:

Step 1: convert enc.mapping to df

Using pd.concat with names + df.reset_index with name:

names = ['group', 'cat_val']

enc_mapping = (pd.concat(enc.mapping, names=names)
               .reset_index(name='value')
               )

Output:

    group  cat_val     value
0  group1        1  3.072686
1  group1        2  3.427314
2  group1       -1  3.250000
3  group1       -2  3.250000
4  group2        1  2.957256
5  group2        2  3.427314
6  group2        3  3.217473
7  group2       -1  3.250000
8  group2       -2  3.250000

Step 2: map based on enc.ordinal_encoder.mapping

m = (pd.concat({item['col']: pd.Series(item['mapping'].index, item['mapping']) 
                for item in enc.ordinal_encoder.mapping})
     )

enc_mapping['cat_val'] = enc_mapping.set_index(names).index.map(m)

# alternative:
# enc_mapping['cat_val'] = enc_mapping[names].apply(tuple, axis=1).map(m)

Output:

    group cat_val     value
0  group1      A1  3.072686
1  group1      B1  3.427314
2  group1     NaN  3.250000
3  group1     NaN  3.250000
4  group2      A2  2.957256
5  group2      B2  3.427314
6  group2      C2  3.217473
7  group2     NaN  3.250000
8  group2     NaN  3.250000

Explanation / intermediates

Use a dict comprehension to get keys from 'col' key in each dict in enc.ordinal_encoder.mapping and values from 'mapping', but swapping index and values (cf. here). Pass this to pd.concat:

m

group1   1     A1
         2     B1
        -2    NaN
group2   1     A2
         2     B2
         3     C2
        -2    NaN
dtype: object

Now, set the index of enc_mapping to names with df.set_index , apply index.map with m, and assign.

(On step 2: I can imagine that there is an easier way to get the code mappings. Via enc.ordinal_encoder.transform(dfa[cols_a]) could be promising.)