pythonpandasdataframedictionarymapping

flatten dictionary with dataframe value to a dataframe


This encoding process will generate a mapping between each categorical value and its corresponding numeric value:

import category_encoders as ce

cols_a = ['group1', 'group2']
dfa = pd.DataFrame([['A1', 'A2', 1], ['B1', 'B2', 4], ['A1', 'C2', 3], ['B1', 'B2', 5]], columns=['group1', 'group2', 'label'])
enc = ce.TargetEncoder(cols=cols_a)
enc.fit(dfa[cols_a], dfa['label'])

enc.mapping

enter image description here

Maybe you can ignore the encoding process and just remember the output mapping.

How to flatten this mapping into the expected dataframe below?

enter image description here

Follow-up: I eventually want to replace the 'cat_val' with its original categorical values from the mapping enc.ordinal_encoder.mapping. Is there any easy way to achieve this?

My solution is to group by 'group' -> find the corresponding dictionary -> replace it with the value from the dictionary.

enter image description here


Solution

  • Here's one approach:

    Step 1: convert enc.mapping to df

    Using pd.concat with names + df.reset_index with name:

    names = ['group', 'cat_val']
    
    enc_mapping = (pd.concat(enc.mapping, names=names)
                   .reset_index(name='value')
                   )
    

    Output:

        group  cat_val     value
    0  group1        1  3.072686
    1  group1        2  3.427314
    2  group1       -1  3.250000
    3  group1       -2  3.250000
    4  group2        1  2.957256
    5  group2        2  3.427314
    6  group2        3  3.217473
    7  group2       -1  3.250000
    8  group2       -2  3.250000
    

    Step 2: map based on enc.ordinal_encoder.mapping

    m = (pd.concat({item['col']: pd.Series(item['mapping'].index, item['mapping']) 
                    for item in enc.ordinal_encoder.mapping})
         )
    
    enc_mapping['cat_val'] = enc_mapping.set_index(names).index.map(m)
    
    # alternative:
    # enc_mapping['cat_val'] = enc_mapping[names].apply(tuple, axis=1).map(m)
    

    Output:

        group cat_val     value
    0  group1      A1  3.072686
    1  group1      B1  3.427314
    2  group1     NaN  3.250000
    3  group1     NaN  3.250000
    4  group2      A2  2.957256
    5  group2      B2  3.427314
    6  group2      C2  3.217473
    7  group2     NaN  3.250000
    8  group2     NaN  3.250000
    

    Explanation / intermediates

    m
    
    group1   1     A1
             2     B1
            -2    NaN
    group2   1     A2
             2     B2
             3     C2
            -2    NaN
    dtype: object
    

    (On step 2: I can imagine that there is an easier way to get the code mappings. Via enc.ordinal_encoder.transform(dfa[cols_a]) could be promising.)