So for example, I have a dataframe like this
Value Placement
0 12 high
1 15 high
2 18 high
3 14 high
4 4 low
5 5 low
6 9 high
7 11 high
8 2 low
9 1 low
10 3 low
11 2 low
I want to create a second dataframe that contains the the highest value in the "Value" column for each set of consecutive rows with "high" placement, and the lowest value in the "Value" column for each set of consecutive rows with "low" placement. So something like
Value Placement
0 18 high
1 4 low
2 11 high
3 1 low
I also don't want to change the order of the rows, as the order of the "highs" and "lows" is critical to the functionality of the project.
I could just iterate through the original dataframe and keep track of the the numbers in "Value" until a change in "Placement" is detected, but I've heard dataframe iteration is very slow and should be avoided if possible. Is there some way to do this without iteration? TIA
Group by consecutive values, swap the sign for Placement that match "low", and get the idxmax
per group, then keep the selected rows with loc
:
# group consecutive rows
group = df['Placement'].ne(df['Placement'].shift()).cumsum()
# invert the low values, get idxmax per group
keep = (df['Value']
.mul(df['Placement'].map({'low': -1, 'high': 1}))
.groupby(group, sort=False).idxmax()
)
out = df.loc[keep]
If efficiency is a concern, and since groupby
is based on a python loop, another approach (that is potentially faster for many groups) would be to stable-sort the rows by value and group (using numpy.lexsort
) and keep the highest (after sign swap for "low") using drop_duplicates
:
group = df['Placement'].ne(df['Placement'].shift()).cumsum()
s = df['Value'].mul(df['Placement'].map({'low': -1, 'high': 1}))
keep = (group
.iloc[np.lexsort([s, group])]
.drop_duplicates(keep='last')
.index
)
out = df.loc[keep]
Note that despite the sorting step, this strategy will maintain the relative original order of the rows.
Output:
Value Placement
2 18 high
4 4 low
7 11 high
9 1 low
Comparison of timing: