pythonpandasdataframe

Finding the max or min of values in local sets of rows of a Pandas Dataframe


So for example, I have a dataframe like this

   Value   Placement
0    12      high
1    15      high
2    18      high
3    14      high
4    4       low
5    5       low
6    9       high
7    11      high
8    2       low
9    1       low
10   3       low
11   2       low

I want to create a second dataframe that contains the the highest value in the "Value" column for each set of consecutive rows with "high" placement, and the lowest value in the "Value" column for each set of consecutive rows with "low" placement. So something like

   Value   Placement
0    18      high
1    4       low
2    11      high
3    1       low

I also don't want to change the order of the rows, as the order of the "highs" and "lows" is critical to the functionality of the project.

I could just iterate through the original dataframe and keep track of the the numbers in "Value" until a change in "Placement" is detected, but I've heard dataframe iteration is very slow and should be avoided if possible. Is there some way to do this without iteration? TIA


Solution

  • Group by consecutive values, swap the sign for Placement that match "low", and get the idxmax per group, then keep the selected rows with loc:

    # group consecutive rows
    group = df['Placement'].ne(df['Placement'].shift()).cumsum()
    
    # invert the low values, get idxmax per group
    keep = (df['Value']
            .mul(df['Placement'].map({'low': -1, 'high': 1}))
            .groupby(group, sort=False).idxmax()
            )
    
    out = df.loc[keep]
    

    If efficiency is a concern, and since groupby is based on a python loop, another approach (that is potentially faster for many groups) would be to stable-sort the rows by value and group (using numpy.lexsort) and keep the highest (after sign swap for "low") using drop_duplicates:

    group = df['Placement'].ne(df['Placement'].shift()).cumsum()
    s = df['Value'].mul(df['Placement'].map({'low': -1, 'high': 1}))
    
    keep = (group
            .iloc[np.lexsort([s, group])]
            .drop_duplicates(keep='last')
            .index
            )
    
    out = df.loc[keep]
    

    Note that despite the sorting step, this strategy will maintain the relative original order of the rows.

    Output:

       Value Placement
    2     18      high
    4      4       low
    7     11      high
    9      1       low
    

    Comparison of timing:

    enter image description here