pythonpandasdataframeexploratory-data-analysis

Pandas Groupby Operation For Condition Based Feature Creation


Having difficulties to create a feature based on the some groupby + conditions

The data that I've looks similar to

ir_id pli pli_missing err_type
0 name1 1.0 no UNKNOWN
1 name1 2.0 no NaN
2 name1 3.0 no NaN
3 name1 NaN yes UNKNOWN
4 name2 4.0 no NaN
5 name2 5.0 no NaN
6 name2 NaN yes UNKNOWN
7 name3 6.0 no NaN
8 name3 7.0 no NaN
9 name3 8.0 no NaN
10 name3 9.0 no UNKNOWN
11 name4 10.0 no NaN
12 name4 11.0 no NaN
13 name4 12.0 no NaN
14 name5 NaN yes UNKNOWN
15 name5 NaN yes UNKNOWN
16 name5 NaN yes UNKNOWN
17 name5 NaN yes UNKNOWN

I want to groupby at ir_id such that I can create err_flag column which is:

ir_id pli pli_missing err_type err_flag
4 name2 4.0 no NaN type1
5 name2 5.0 no NaN type1
6 name2 NaN yes UNKNOWN type1
ir_id pli pli_missing err_type err_flag
14 name5 NaN yes UNKNOWN type1
15 name5 NaN yes UNKNOWN type1
16 name5 NaN yes UNKNOWN type1
17 name5 NaN yes UNKNOWN type1
ir_id pli pli_missing err_type err_flag
7 name3 6.0 no NaN type2
8 name3 7.0 no NaN type2
9 name3 8.0 no NaN type2
10 name3 9.0 no UNKNOWN type2
ir_id pli pli_missing err_type err_flag
11 name4 10.0 no NaN type3
12 name4 11.0 no NaN type3
13 name4 12.0 no NaN type3
ir_id pli pli_missing err_type err_flag
0 name1 1.0 no UNKNOWN both_type
1 name1 2.0 no NaN both_type
2 name1 3.0 no NaN both_type
3 name1 NaN yes UNKNOWN both_type

Which results in final O/p as:

ir_id pli pli_missing err_type err_flag
0 name1 1.0 no UNKNOWN both_type
1 name1 2.0 no NaN both_type
2 name1 3.0 no NaN both_type
3 name1 NaN yes UNKNOWN both_type
4 name2 4.0 no NaN type1
5 name2 5.0 no NaN type1
6 name2 NaN yes UNKNOWN type1
7 name3 6.0 no NaN type2
8 name3 7.0 no NaN type2
9 name3 8.0 no NaN type2
10 name3 9.0 no UNKNOWN type2
11 name4 10.0 no NaN type3
12 name4 11.0 no NaN type3
13 name4 12.0 no NaN type3
14 name5 NaN yes UNKNOWN type1
15 name5 NaN yes UNKNOWN type1
16 name5 NaN yes UNKNOWN type1
17 name5 NaN yes UNKNOWN type1

dataset used:

custom_df = pd.DataFrame.from_dict({
    'ir_id':['name1', 'name1', 'name1', 'name1', 'name2', 'name2', 'name2', 'name3', 'name3', 'name3', 'name3', 'name4', 'name4', 'name4', 'name5', 'name5', 'name5', 'name5']
    , 'pli': [1,       2,       3,       np.nan,  4,       5,       np.nan,  6,       7,       8,       9,       10,      11,      12,      np.nan,  np.nan,  np.nan,  np.nan]
    , 'pli_missing': ["no","no","no","yes","no","no","yes","no","no","no","no","no","no","no","yes","yes","yes","yes"]
    , 'err_type': ["UNKNOWN",np.nan,np.nan,"UNKNOWN",np.nan,np.nan,"UNKNOWN",np.nan,np.nan,np.nan,"UNKNOWN",np.nan,np.nan,np.nan,"UNKNOWN","UNKNOWN","UNKNOWN","UNKNOWN"]
    , 'err_flag': ["both_type", "both_type", "both_type", "both_type", "type1", "type1", "type1", "type2", "type2", "type2", "type2", "type3", "type3", "type3", "type1", "type1", "type1", "type1"]
})

custom_df

PS Earlier solution can't handle cases for ir_id = name5


Solution

  • I think we could apply here sort of categorical logic. Here's what I mean.

    Let's say ['type3','type1','type2','both_types'] are verbal representatives of codes [0, 1, 2, 3]. Why this order? As I can see, type3 is sort of a default value. type1 and type2 are equal by nature, but we have somehow to differ them. So let's follow their names and say that type1 is 1 and type2 is 2. And both_types has index 3 as a sum of previos two codes. Now we can separate identifying if a record can be of type 1 or 2, and get their sum as a final output. If a record is neither type 1 nor type 2 kind of error, the sum will result in 0. If there's only one of them, the sum will keep it as is. If a record can be of both types, the sum will show 3.

    Let's see how it looks in code:

    err = custom_df['err_type'] == 'UNKNOWN'
    pli = custom_df['pli_missing'] == 'yes'
    grouper = custom_df['ir_id']
    
    # with transform get the same dimension as custom_df
    type1 = (err & pli).groupby(grouper).transform(any)
    type2 = (err & ~pli).groupby(grouper).transform(any)
    
    codes = type1 + 2*type2
    categories = ['type3','type1','type2','both_types']
    custom_df['err_flag_new'] = pd.Categorical.from_codes(codes, categories)
    

    Here's what I've got in the end:

    image


    update

    We can look at it this way. Suppose there's a system with some number of independent states. A system can be described as having any combination of them. It can be described mathematically as a binary code. For each independant state we assing a unique place in this code, where values 1 or 0 are interpreted as a logical answer whether the system has a corresponding state. Binary means a sum like this one:

    state[0]*2^0 + state[1]*2^1 + state[2]*2^2 + state[3]*2^3 + ...
    

    In our case we have only 2 independent states type1 and type2. Two others are their combinations: type3 means neather of them, and both_types tells for itself. So we have only first two terms of the sum above, where state[0] is a logical value for type1 and state[1] is a logical value for type2. That's why I used codes = type1 + 2*type2 which is equal to codes = type1 * 2**0 + type2 * 2**1

    As for the order in ['type3','type1','type2','both_types'], in this list indexes of values resemble the corresponding codes, ie. each type here has the index equal to their binary code. The binary code for type3 in this model is 0b00 which is zero, for type1 it is 0b01 which is one, for type2 - 0b10 which is 2 and for both_types it's 0b11 which is equal to 3. These codes are atomatically assigned when creating Categorical sequence from codes, ie. pandas is using codes as indexes to get corresponding values from the list and place them instead of the codes.

    See also Enum.IntFlag as an abstract realization of this idea, and Flags in the regular expression module as an example of how it can be used.