pythonpandasdataframepandas-groupbyaggregate

Max Value in a Data Frame with Groupby using idmax()


I have a dataframe that has 10 columns.

I used this code to filter to the rows I want: basically, the rows where the Revision Date is less than the cutoff date (declared variable) and the Job Title is in the provided list.

aggregate = df.loc[(df['RevisionDate']<= cutoff_date) & (df['JobTitle'].isin(['Production Control Clerk','Customer Service Representative III, Data Entry Operator I','Accounting Clerk II','General Clerk III','Technical Instructor']))]

Then, I need to group them by the column WD (there are multiple of these), and then by Job Title (again, multiple of these). So I did that by:

aggregate1 = aggregate.groupby(['WD','JobTitle'])

This produces a dataframe object that has the required rows, and still all 10 columns.

Then, from this smaller dataframe, I need to pull out only the rows with the highest (max) Revision Number.

aggregate1 = aggregate.max('RevisionNumber')

However, this last step produces a dataframe, but with only 3 of the columns: WD, Job Title and Revision Number. I need ALL 10 of the columns.

Based on other questions I've seen posted here, I have tried to use idmax():

df2 = aggregate.loc[aggregate.groupby(['WD','JobTitle'])['RevisionNumber'].idmax()]

but I get this error:

AttributeError: 'SeriesGroupBy' object has no attribute 'idmax'

What am I doing wrong?


Solution

  • If you sort first, you can take the top row of each group

    aggregate.sort_values(by='RevisionNumber', ascending=False).groupby(['WD','JobTitle']).head(1)