I have a pandas dataframe with a group structure where the value of interest, val
, is guaranteed to be sorted within the group. However, there are missing values in val
which I need to bound. The data I have looks like this:
group_id id_within_group val
1 1 3.2
1 2 4.8
1 3 5.2
1 4 NaN
1 5 7.5
2 1 1.8
2 2 2.8
2 3 NaN
2 4 5.4
2 5 6.2
I now want to create a lower bound, max_prev
which is the maximum value within the group for the rows before the current row, whereas min_next
is the minimum value within the group for the rows after the current row. It is not possible to just look one row back and ahead, because there could be clusters of NaN
. I don't need to take care of the edge cases of the first and last row within group. The desired output would hence be
group_id id_within_group val max_prev min_next
1 1 3.2 NaN 4.8
1 2 4.8 3.2 5.2
1 3 5.2 4.8 7.5
1 4 NaN 5.2 7.5
1 5 7.5 5.2 NaN
2 1 1.8 NaN 2.8
2 2 2.8 1.8 5.4
2 3 NaN 2.8 5.4
2 4 5.4 2.8 6.2
2 5 6.2 5.4 NaN
How can I achieve this in a reasonable fast way?
You could use a custom groupby.transform
with ffill
/bfill
+shift
:
g = df.groupby('group_id')['val']
df['max_prev'] = g.transform(lambda x: x.ffill().shift())
df['min_next'] = g.transform(lambda x: x[::-1].ffill().shift())
# or
df['min_next'] = g.transform(lambda x: x.bfill().shift(-1))
If your values are not sorted, add a cummax
/cummin
:
g = df.groupby('group_id')['val']
df['max_prev'] = g.transform(lambda x: x.ffill().cummax().shift())
df['min_next'] = g.transform(lambda x: x[::-1].ffill().cummin().shift())
Output:
group_id id_within_group val max_prev min_next
0 1 1 3.2 NaN 4.8
1 1 2 4.8 3.2 5.2
2 1 3 5.2 4.8 7.5
3 1 4 NaN 5.2 7.5
4 1 5 7.5 5.2 NaN
5 2 1 1.8 NaN 2.8
6 2 2 2.8 1.8 5.4
7 2 3 NaN 2.8 5.4
8 2 4 5.4 2.8 6.2
9 2 5 6.2 5.4 NaN