I'm trying to find the name of the person who submitted the most applications in any given year over a series of years.
Each application is its own row in the dataframe. It comes with the year
it was submitted, and the applicant's name.
As an example, given a dataset as follows:
app_year_start name
2012 John Smith
2012 John Smith
2012 John Smith
2012 Jane Doe
2013 Jane Doe
2012 John Snow
2015 John Snow
2014 John Smith
2015 John Snow
2012 John Snow
2012 John Smith
2012 John Smith
2012 John Smith
2012 John Smith
2012 Jane Doe
2013 Jane Doe
2012 John Snow
2015 John Snow
2014 John Smith
2015 John Snow
2012 John Snow
2012 John Smith
I want to get the following output:
app_year_start name total_apps
2012 John Smith 8
2013 Jane Doe 2
2014 John Smith 2
2015 John Snow 4
I tried using groupby
to organize the data by year and name, then a variety of methods such as value_counts()
, count()
, max()
, etc... This is the closest I've gotten:
df3.groupby(['app_year_start'])['name'].value_counts().sort_values(ascending=False)
but doesn't quite produce the expected output.
I've consulted the following SO posts: 1, 2, 3 but none of them work in my case.
The straightforward GroupBy.mode
doesn't preserve the counts, so one way to keep the counts is to cross tabulate year by names to get each applicant's number of applications and keep the names that appear the most in each year.
counts = (
pd.crosstab(df['app_year_start'], df['name'])
.agg(['idxmax', 'max'], axis=1)
.set_axis(['name','total_apps'], axis=1)
.reset_index()
)
Another solution is to row-wise count the values of year-name pairs (value_counts
sorts in descending order by default) and remove duplicate years. This works because the rows are sorted by the counts for each year so for any year, any row after the first will have a count that is not greater.
counts = (
df.value_counts(sort=True, ascending=False)
.reset_index(name="total_apps")
.drop_duplicates(subset="app_year_start")
)