pythonpandasdataframegroup-by

Groupby index and keep the max column value given a single column


Scenario: With a dataframe with duplicated indices, I want to groupby while keeping the max value. I found the solution to this in Drop duplicates by index, keeping max for each column across duplicates however, this gets the max value of each column. This mixed the data of different rows, keeping the max values.

Question: If instead of mixing the values of different rows, I want to keep a single row, where the value of a column "C" is the highest among the rows with the same index (in this case I will select the row with the highest value in "C" and keep all values for that row, not mixing with high values of other columns from other rows), how should the groupby be performed?

What I tried: From the question linked, I got

df.groupby(df.index).max()

and tried to modify it to:

df.groupby(df.index)['C'].max()

but this deletes the other columns of the dataframe.


Solution

  • You don't provide a sample of your data so I'm just going for a general approach.

    That said, you can sort the dataframe by C, then groupby with head:

    # this assumes that index has only one level
    df.sort_values('C', ascending=False).groupby(level=0).head(1)
    

    Or:

    df.sort_values('C').groupby(level=0).tail(1)
    

    Also take a look at this related question (not by the index, but a column):