pythonpandasgroup-bymultiple-conditions

Group by id and look at previous row value to determine next row value based on multiple conditions


I hope someone can help me out with this! I haven't found anything online that comes close enough.

Sample data:

import pandas as pd

sample_data = {
'id': [1,1,1,1,1,2,2,2,2,2],
'date_rank': [1,2,3,4,5,1,2,3,4,5],
'candidates': [1,0,0,3,0,0,0,0,2,0],
'desired_output':['New_filled','New_open','Double_open','Double_filled','New_open','New_open','Double_open','Double_open','Double_filled','New_open']
}

df = pd.DataFrame(sample_data, columns=['id', 'date_rank','candidates', 'desired_output'])
df

In the sample_data output below the "desired_output" column shows the desired result:

    id  date_rank   candidates  desired_output
0   1   1           1           New_filled
1   1   2           0           New_open
2   1   3           0           Double_open
3   1   4           3           Double_filled
4   1   5           0           New_open
5   2   1           0           New_open
6   2   2           0           Double_open
7   2   3           0           Double_open
8   2   4           2           Double_filled
9   2   5           0           New_open

The date_rank column isn't that important except for the first entry.

The first entry will always be "new" but could be either "filled" or "open". It's open when 0 candidates were hired and closed if one or more candidates were hired. This applies to the rest of the entries as well.

If an entry is filled, the next row will always be new.
If an entry is open because there were no candidates, the next entry will always be double.

If you look at the fourth row you'll see that an entry can be double and filled as long as the previous row was open.

There are four possible values/conditions in the desired_ouptut column. I can make this work with less conditions but not with four, especially when the value depends on the previous row value.


Solution

  • You can use two simple conditionals with numpy.where on the current row, and the previous one (with groupby.shift):

    m = df['candidates'].eq(0)
    
    df['output'] = pd.Series(np.where(m.groupby(df['id']).shift(fill_value=False),
                                      'Double_', 'New_'), index=df.index
                             ).add(np.where(m, 'open', 'filled'))
    

    In :

    m = df['candidates'].eq(0)
    
    a1 = np.where(m.groupby(df['id']).shift(fill_value=False), 'Double_', 'New_')
    a2 = np.where(m, 'open', 'filled')
    df['output'] = np.core.defchararray.add(a1, a2)
    

    Output:

       id  date_rank  candidates desired_output         output
    0   1          1           1     New_filled     New_filled
    1   1          2           0       New_open       New_open
    2   1          3           0    Double_open    Double_open
    3   1          4           3  Double_filled  Double_filled
    4   1          5           0       New_open       New_open
    5   2          1           0       New_open       New_open
    6   2          2           0    Double_open    Double_open
    7   2          3           0    Double_open    Double_open
    8   2          4           2  Double_filled  Double_filled
    9   2          5           0       New_open       New_open