pythonpandasdataframegroup-bypivot-table

Find the most frequent value and their counts for each group


I'm trying to find the name of the person who submitted the most applications in any given year over a series of years.

Each application is its own row in the dataframe. It comes with the year it was submitted, and the applicant's name.

As an example, given a dataset as follows:

app_year_start        name
2012                  John Smith
2012                  John Smith
2012                  John Smith
2012                  Jane Doe
2013                  Jane Doe
2012                  John Snow
2015                  John Snow
2014                  John Smith
2015                  John Snow
2012                  John Snow
2012                  John Smith
2012                  John Smith
2012                  John Smith
2012                  John Smith
2012                  Jane Doe
2013                  Jane Doe
2012                  John Snow
2015                  John Snow
2014                  John Smith
2015                  John Snow
2012                  John Snow
2012                  John Smith

I want to get the following output:

app_year_start        name  total_apps
          2012  John Smith           8
          2013    Jane Doe           2
          2014  John Smith           2
          2015   John Snow           4

I tried using groupby to organize the data by year and name, then a variety of methods such as value_counts(), count(), max(), etc... This is the closest I've gotten:

df3.groupby(['app_year_start'])['name'].value_counts().sort_values(ascending=False)

but doesn't quite produce the expected output.

I've consulted the following SO posts: 1, 2, 3 but none of them work in my case.


Solution

  • 1. Cross-tabulate and find max values

    The straightforward GroupBy.mode doesn't preserve the counts, so one way to keep the counts is to cross tabulate year by names to get each applicant's number of applications and keep the names that appear the most in each year.

    counts = (
        pd.crosstab(df['app_year_start'], df['name'])
        .agg(['idxmax', 'max'], axis=1)
        .set_axis(['name','total_apps'], axis=1)
        .reset_index()
    )
    

    2. Row-wise count values of the two columns and remove the duplicate years

    Another solution is to row-wise count the values of year-name pairs (value_counts sorts in descending order by default) and remove duplicate years. This works because the rows are sorted by the counts for each year so for any year, any row after the first will have a count that is not greater.

    counts = (
        df.value_counts(sort=True, ascending=False)
        .reset_index(name="total_apps")
        .drop_duplicates(subset="app_year_start")
    )
    

    result