Having difficulties to create a feature based on the some groupby + conditions
The data that I've looks similar to
ir_id | pli | pli_missing | err_type | |
---|---|---|---|---|
0 | name1 | 1.0 | no | UNKNOWN |
1 | name1 | 2.0 | no | NaN |
2 | name1 | 3.0 | no | NaN |
3 | name1 | NaN | yes | UNKNOWN |
4 | name2 | 4.0 | no | NaN |
5 | name2 | 5.0 | no | NaN |
6 | name2 | NaN | yes | UNKNOWN |
7 | name3 | 6.0 | no | NaN |
8 | name3 | 7.0 | no | NaN |
9 | name3 | 8.0 | no | NaN |
10 | name3 | 9.0 | no | UNKNOWN |
11 | name4 | 10.0 | no | NaN |
12 | name4 | 11.0 | no | NaN |
13 | name4 | 12.0 | no | NaN |
14 | name5 | NaN | yes | UNKNOWN |
15 | name5 | NaN | yes | UNKNOWN |
16 | name5 | NaN | yes | UNKNOWN |
17 | name5 | NaN | yes | UNKNOWN |
I want to groupby at ir_id such that I can create err_flag
column which is:
err_type
column, and also "yes" in pli_missing
ir_id | pli | pli_missing | err_type | err_flag | |
---|---|---|---|---|---|
4 | name2 | 4.0 | no | NaN | type1 |
5 | name2 | 5.0 | no | NaN | type1 |
6 | name2 | NaN | yes | UNKNOWN | type1 |
ir_id | pli | pli_missing | err_type | err_flag | |
---|---|---|---|---|---|
14 | name5 | NaN | yes | UNKNOWN | type1 |
15 | name5 | NaN | yes | UNKNOWN | type1 |
16 | name5 | NaN | yes | UNKNOWN | type1 |
17 | name5 | NaN | yes | UNKNOWN | type1 |
err_type
column, and also "no" in pli_missing
ir_id | pli | pli_missing | err_type | err_flag | |
---|---|---|---|---|---|
7 | name3 | 6.0 | no | NaN | type2 |
8 | name3 | 7.0 | no | NaN | type2 |
9 | name3 | 8.0 | no | NaN | type2 |
10 | name3 | 9.0 | no | UNKNOWN | type2 |
err_type
column, and also "no" in pli_missing
ir_id | pli | pli_missing | err_type | err_flag | |
---|---|---|---|---|---|
11 | name4 | 10.0 | no | NaN | type3 |
12 | name4 | 11.0 | no | NaN | type3 |
13 | name4 | 12.0 | no | NaN | type3 |
ir_id | pli | pli_missing | err_type | err_flag | |
---|---|---|---|---|---|
0 | name1 | 1.0 | no | UNKNOWN | both_type |
1 | name1 | 2.0 | no | NaN | both_type |
2 | name1 | 3.0 | no | NaN | both_type |
3 | name1 | NaN | yes | UNKNOWN | both_type |
Which results in final O/p as:
ir_id | pli | pli_missing | err_type | err_flag | |
---|---|---|---|---|---|
0 | name1 | 1.0 | no | UNKNOWN | both_type |
1 | name1 | 2.0 | no | NaN | both_type |
2 | name1 | 3.0 | no | NaN | both_type |
3 | name1 | NaN | yes | UNKNOWN | both_type |
4 | name2 | 4.0 | no | NaN | type1 |
5 | name2 | 5.0 | no | NaN | type1 |
6 | name2 | NaN | yes | UNKNOWN | type1 |
7 | name3 | 6.0 | no | NaN | type2 |
8 | name3 | 7.0 | no | NaN | type2 |
9 | name3 | 8.0 | no | NaN | type2 |
10 | name3 | 9.0 | no | UNKNOWN | type2 |
11 | name4 | 10.0 | no | NaN | type3 |
12 | name4 | 11.0 | no | NaN | type3 |
13 | name4 | 12.0 | no | NaN | type3 |
14 | name5 | NaN | yes | UNKNOWN | type1 |
15 | name5 | NaN | yes | UNKNOWN | type1 |
16 | name5 | NaN | yes | UNKNOWN | type1 |
17 | name5 | NaN | yes | UNKNOWN | type1 |
dataset used:
custom_df = pd.DataFrame.from_dict({
'ir_id':['name1', 'name1', 'name1', 'name1', 'name2', 'name2', 'name2', 'name3', 'name3', 'name3', 'name3', 'name4', 'name4', 'name4', 'name5', 'name5', 'name5', 'name5']
, 'pli': [1, 2, 3, np.nan, 4, 5, np.nan, 6, 7, 8, 9, 10, 11, 12, np.nan, np.nan, np.nan, np.nan]
, 'pli_missing': ["no","no","no","yes","no","no","yes","no","no","no","no","no","no","no","yes","yes","yes","yes"]
, 'err_type': ["UNKNOWN",np.nan,np.nan,"UNKNOWN",np.nan,np.nan,"UNKNOWN",np.nan,np.nan,np.nan,"UNKNOWN",np.nan,np.nan,np.nan,"UNKNOWN","UNKNOWN","UNKNOWN","UNKNOWN"]
, 'err_flag': ["both_type", "both_type", "both_type", "both_type", "type1", "type1", "type1", "type2", "type2", "type2", "type2", "type3", "type3", "type3", "type1", "type1", "type1", "type1"]
})
custom_df
PS Earlier solution can't handle cases for ir_id = name5
I think we could apply here sort of categorical logic. Here's what I mean.
Let's say ['type3','type1','type2','both_types']
are verbal representatives of codes [0, 1, 2, 3]
. Why this order? As I can see, type3
is sort of a default value. type1
and type2
are equal by nature, but we have somehow to differ them. So let's follow their names and say that type1
is 1 and type2
is 2. And both_types
has index 3 as a sum of previos two codes. Now we can separate identifying if a record can be of type 1 or 2, and get their sum as a final output. If a record is neither type 1 nor type 2 kind of error, the sum will result in 0. If there's only one of them, the sum will keep it as is. If a record can be of both types, the sum will show 3.
Let's see how it looks in code:
err = custom_df['err_type'] == 'UNKNOWN'
pli = custom_df['pli_missing'] == 'yes'
grouper = custom_df['ir_id']
# with transform get the same dimension as custom_df
type1 = (err & pli).groupby(grouper).transform(any)
type2 = (err & ~pli).groupby(grouper).transform(any)
codes = type1 + 2*type2
categories = ['type3','type1','type2','both_types']
custom_df['err_flag_new'] = pd.Categorical.from_codes(codes, categories)
Here's what I've got in the end:
update
We can look at it this way. Suppose there's a system with some number of independent states. A system can be described as having any combination of them. It can be described mathematically as a binary code. For each independant state we assing a unique place in this code, where values 1 or 0 are interpreted as a logical answer whether the system has a corresponding state. Binary means a sum like this one:
state[0]*2^0 + state[1]*2^1 + state[2]*2^2 + state[3]*2^3 + ...
In our case we have only 2 independent states type1
and type2
. Two others are their combinations: type3
means neather of them, and both_types
tells for itself. So we have only first two terms of the sum above, where state[0]
is a logical value for type1
and state[1]
is a logical value for type2
. That's why I used codes = type1 + 2*type2
which is equal to codes = type1 * 2**0 + type2 * 2**1
As for the order in ['type3','type1','type2','both_types']
, in this list indexes of values resemble the corresponding codes, ie. each type here has the index equal to their binary code. The binary code for type3
in this model is 0b00
which is zero, for type1
it is 0b01
which is one, for type2
- 0b10
which is 2 and for both_types
it's 0b11
which is equal to 3. These codes are atomatically assigned when creating Categorical
sequence from codes
, ie. pandas is using codes as indexes to get corresponding values from the list and place them instead of the codes.
See also Enum.IntFlag as an abstract realization of this idea, and Flags in the regular expression module as an example of how it can be used.