pythonpandasdataframedata-science

How to get all the rows corresponding to maximum values of a column using groupby


For the given dataframe df as:

   Election Yr.  Party   States Votes
0     2000           A       a    50  
1     2000           A       b    30
2     2000           B       a    40
3     2000           B       b    50  
4     2000           C       a    30
5     2000           C       b    40
6     2005           A       a    50  
7     2005           A       b    30
8     2005           B       a    40
9     2005           B       b    50  
10    2005           C       a    30
11    2005           C       b    40

I want to get the Party that got the maximum Votes for a corresponding year. I have used the following code to groupby "Election Year" and "Party" and then .sum() to get the total votes for each party in every year.

df = df.groupby(['Election Yr.', 'Party']).sum()

Now how to get the party with maximum Votes each year? I am unable to get this.

Any support is highly appreciated.


Solution

  • 1. Using inner joins

    You can start off with df before doing your first groupby. Then you get the maximum number of votes each year and merge on the year-votes combination to get the party that got the most votes per year.

    # Original data
    df = pd.DataFrame({'Election Yr.':[2000,2000,2000,2000,2000,2000,2005,2005,2005,2005,2005,2005],
                       'Party':['A','A','B','B','C','C','A','A','B','B','C','C',],
                       'Votes':[50,30,40,50,30,40,50,30,40,50,30,40]})
    
    # Get number of votes per year-party
    df = df.groupby(['Election Yr.','Party'])['Votes'].sum().reset_index()
    
    # Get max number of votes per year
    max_ = df.groupby('Election Yr.')['Votes'].max().reset_index()
    
    # Merge on key
    max_ = max_.merge(df, on=['Election Yr.','Votes'])
    
    # Results
    print(max_)
    
    >    Election Yr.  Votes Party
    > 0          2000     90     B
    > 1          2005     90     B
    

    2. Sorting and keeping first observation

    Alternatively, you can sort by votes per year:

    df = df.groupby(['Election Yr.','Party'])['Votes'].sum().reset_index()
    df = df.sort_values(['Election Yr.','Votes'], ascending=False)
    print(df.groupby('Election Yr.').first().reset_index())
    
    print(df)
    
    >    Election Yr. Party  Votes
    > 0          2000     B     90
    > 1          2005     B     90