pythonpandas

Get maximum previous nonmissing value within group in pandas dataframe


I have a pandas dataframe with a group structure where the value of interest, val, is guaranteed to be sorted within the group. However, there are missing values in val which I need to bound. The data I have looks like this:

group_id    id_within_group     val
1           1                   3.2  
1           2                   4.8
1           3                   5.2
1           4                   NaN
1           5                   7.5
2           1                   1.8
2           2                   2.8
2           3                   NaN
2           4                   5.4
2           5                   6.2

I now want to create a lower bound, max_prev which is the maximum value within the group for the rows before the current row, whereas min_next is the minimum value within the group for the rows after the current row. It is not possible to just look one row back and ahead, because there could be clusters of NaN. I don't need to take care of the edge cases of the first and last row within group. The desired output would hence be

group_id    id_within_group     val     max_prev    min_next
1           1                   3.2     NaN         4.8
1           2                   4.8     3.2         5.2
1           3                   5.2     4.8         7.5
1           4                   NaN     5.2         7.5
1           5                   7.5     5.2         NaN
2           1                   1.8     NaN         2.8
2           2                   2.8     1.8         5.4
2           3                   NaN     2.8         5.4
2           4                   5.4     2.8         6.2
2           5                   6.2     5.4         NaN

How can I achieve this in a reasonable fast way?


Solution

  • You could use a custom groupby.transform with ffill/bfill+shift:

    g = df.groupby('group_id')['val']
    
    df['max_prev'] = g.transform(lambda x: x.ffill().shift())
    df['min_next'] = g.transform(lambda x: x[::-1].ffill().shift())
    
    # or
    df['min_next'] = g.transform(lambda x: x.bfill().shift(-1))
    

    If your values are not sorted, add a cummax/cummin:

    g = df.groupby('group_id')['val']
    
    df['max_prev'] = g.transform(lambda x: x.ffill().cummax().shift())
    df['min_next'] = g.transform(lambda x: x[::-1].ffill().cummin().shift())
    

    Output:

       group_id  id_within_group  val  max_prev  min_next
    0         1                1  3.2       NaN       4.8
    1         1                2  4.8       3.2       5.2
    2         1                3  5.2       4.8       7.5
    3         1                4  NaN       5.2       7.5
    4         1                5  7.5       5.2       NaN
    5         2                1  1.8       NaN       2.8
    6         2                2  2.8       1.8       5.4
    7         2                3  NaN       2.8       5.4
    8         2                4  5.4       2.8       6.2
    9         2                5  6.2       5.4       NaN