pythonpandas

How do I create a new dataframe that includes columns that are aggregates of columns in the original?


I'm new to learning Python and have a wide data file that I would like to aggregate by different variables and create a new file for use with a dashboard. I am able to use groupby() to get the aggregation I want for one column at a time and then append the results to a csv file. However, I'm wondering if there's a more efficient way to do this using a loop of some sort.

Here's a sample of what my original data file looks like:

data = {'ID': [105, 106, 107, 108, 109, 110, 111, 112],
        'Name': ['Bill', 'Jane', 'Mary', 'Rich', 'Tomas', 'Kiki', 'Martin', 'Larry'],
        'Cohort': ['Cohort A', 'Cohort A', 'Cohort A', 'Cohort A', 'Cohort B', 'Cohort B', 'Cohort B', 'Cohort B'],
        'Program Size': ['small', 'large', 'medium', 'medium', 'large', 'small', 'large', 'medium'],
        'Rating': ['excellent', 'good', 'needs improvement', 'needs improvement', 'good', 'excellent', 'good', 'excellent']}

df = pd.DataFrame(data)

Looks like this:

 ID    Name    Cohort Program Size             Rating
0  105    Bill  Cohort A        small          excellent
1  106    Jane  Cohort A        large               good
2  107    Mary  Cohort A       medium  needs improvement
3  108    Rich  Cohort A       medium  needs improvement
4  109   Tomas  Cohort B        large               good
5  110    Kiki  Cohort B        small          excellent
6  111  Martin  Cohort B        large               good
7  112   Larry  Cohort B       medium          excellent

I always want to group by the Cohort column, plus one of the other columns and get counts for each combination. In my new data file, I'll have a Cohort column, a "Variable" column (that will be the other column in the group by), the options present the "Description", and then the Frequency. Here's what my desired result would look like:

desired = {'Cohort': ['Cohort A', 'Cohort A', 'Cohort A', 'Cohort B', 'Cohort B', 'Cohort B', 'Cohort A', 'Cohort A', 'Cohort A', 'Cohort B', 'Cohort B'],
        'Variable': ['size', 'size', 'size', 'size', 'size', 'size', 'rating', 'rating', 'rating', 'rating', 'rating'],
        'Description': ['small', 'medium', 'large', 'small', 'medium', 'large', 'excellent', 'good', 'needs improvement', 'excellent', 'good'],
        'Frequency': [1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 2]}

desired_df = pd.DataFrame(desired)
 Cohort Variable        Description  Frequency
0   Cohort A     size              small          1
1   Cohort A     size             medium          2
2   Cohort A     size              large          1
3   Cohort B     size              small          1
4   Cohort B     size             medium          1
5   Cohort B     size              large          2
6   Cohort A   rating          excellent          1
7   Cohort A   rating               good          1
8   Cohort A   rating  needs improvement          2
9   Cohort B   rating          excellent          2
10  Cohort B   rating               good          2

I've been able to write a loop to group by Cohort and my other columns and get frequencies, but I'm not sure how to put it all together (append dataframes?) to get to my desired result. I appreciate any guidance on next steps! (Here is what I wrote to loop through my columns and get frequencies:

cols = ['Program Size', 'Rating']

for i in cols:
    grouped_df = df.groupby(['Cohort', (i)], as_index=False).agg(
        frequency=('ID', 'count')
    )
    print(f"Grouped by Cohort and {i}:\n{grouped_df}\n")

Solution

  • You can use a list comprehension.

    col_mapping = {'Program Size': 'size', 'Rating': 'rating'}
    
    final_df = pd.concat([
        df.groupby(['Cohort', col], as_index=False)
        .agg(Frequency=('ID', 'count'))
        .rename(columns={col: 'Description'})
        .assign(Variable=col_mapping[col])
        [['Cohort', 'Variable', 'Description', 'Frequency']]
        for col in col_mapping.keys()
    ], ignore_index=True).sort_values(['Cohort', 'Variable'])