I have a dataframe that has 10 columns.
I used this code to filter to the rows I want: basically, the rows where the Revision Date is less than the cutoff date (declared variable) and the Job Title is in the provided list.
aggregate = df.loc[(df['RevisionDate']<= cutoff_date) & (df['JobTitle'].isin(['Production Control Clerk','Customer Service Representative III, Data Entry Operator I','Accounting Clerk II','General Clerk III','Technical Instructor']))]
Then, I need to group them by the column WD (there are multiple of these), and then by Job Title (again, multiple of these). So I did that by:
aggregate1 = aggregate.groupby(['WD','JobTitle'])
This produces a dataframe object that has the required rows, and still all 10 columns.
Then, from this smaller dataframe, I need to pull out only the rows with the highest (max) Revision Number.
aggregate1 = aggregate.max('RevisionNumber')
However, this last step produces a dataframe, but with only 3 of the columns: WD, Job Title and Revision Number. I need ALL 10 of the columns.
Based on other questions I've seen posted here, I have tried to use idmax():
df2 = aggregate.loc[aggregate.groupby(['WD','JobTitle'])['RevisionNumber'].idmax()]
but I get this error:
AttributeError: 'SeriesGroupBy' object has no attribute 'idmax'
What am I doing wrong?
If you sort first, you can take the top row of each group
aggregate.sort_values(by='RevisionNumber', ascending=False).groupby(['WD','JobTitle']).head(1)