pythonpandasdataframeindexingpandas-groupby

Python: remove rows with max value in each group


I have a pandas data frame df like this.

In [1]: df
Out[1]:
      country     count
0       Japan        78
1       Japan        80
2         USA        45
3      France        34
4      France        90
5          UK        45
6          UK        34
7       China        32
8       China        87
9      Russia        20
10      Russia        67

I want to remove rows with the maximum value in each group. So the result should look like:

      country     count
0       Japan        78
3      France        34
6          UK        34
7       China        32
9      Russia        20

My first attempt:

idx = df.groupby(['country'], sort=False).max()['count'].index
df_new = df.drop(list(idx))

My second attempt:

idx = df.groupby(['country'])['count'].transform(max).index
df_new = df.drop(list(idx))

But it didn't work. Any ideas?


Solution

  • groupby / transform('max')

    You can first calculate a series of maximums by group. Then filter out instances where count is equal to that series. Note this will also remove duplicates maximums.

    g = df.groupby(['country'])['count'].transform('max')
    df = df[~(df['count'] == g)]
    

    The series g represents maximums for each row by group. Where this equals df['count'] (by index), you have a row where you have the maximum for your group. You then use ~ for the negative condition.

    print(df.groupby(['country'])['count'].transform('max'))
    
    0    80
    1    80
    2    45
    3    90
    4    90
    5    45
    6    45
    7    87
    8    87
    9    20
    Name: count, dtype: int64
    

    sort + drop

    Alternatively, you can sort and drop the final occurrence:

    res = df.sort_values('count')
    res = res.drop(res.groupby('country').tail(1).index)
    
    print(res)
    
      country  count
    9  Russia     20
    7   China     32
    3  France     34
    6      UK     34
    0   Japan     78