I'm new to learning Python and have a wide data file that I would like to aggregate by different variables and create a new file for use with a dashboard. I am able to use groupby() to get the aggregation I want for one column at a time and then append the results to a csv file. However, I'm wondering if there's a more efficient way to do this using a loop of some sort.
Here's a sample of what my original data file looks like:
data = {'ID': [105, 106, 107, 108, 109, 110, 111, 112],
'Name': ['Bill', 'Jane', 'Mary', 'Rich', 'Tomas', 'Kiki', 'Martin', 'Larry'],
'Cohort': ['Cohort A', 'Cohort A', 'Cohort A', 'Cohort A', 'Cohort B', 'Cohort B', 'Cohort B', 'Cohort B'],
'Program Size': ['small', 'large', 'medium', 'medium', 'large', 'small', 'large', 'medium'],
'Rating': ['excellent', 'good', 'needs improvement', 'needs improvement', 'good', 'excellent', 'good', 'excellent']}
df = pd.DataFrame(data)
Looks like this:
ID Name Cohort Program Size Rating
0 105 Bill Cohort A small excellent
1 106 Jane Cohort A large good
2 107 Mary Cohort A medium needs improvement
3 108 Rich Cohort A medium needs improvement
4 109 Tomas Cohort B large good
5 110 Kiki Cohort B small excellent
6 111 Martin Cohort B large good
7 112 Larry Cohort B medium excellent
I always want to group by the Cohort column, plus one of the other columns and get counts for each combination. In my new data file, I'll have a Cohort column, a "Variable" column (that will be the other column in the group by), the options present the "Description", and then the Frequency. Here's what my desired result would look like:
desired = {'Cohort': ['Cohort A', 'Cohort A', 'Cohort A', 'Cohort B', 'Cohort B', 'Cohort B', 'Cohort A', 'Cohort A', 'Cohort A', 'Cohort B', 'Cohort B'],
'Variable': ['size', 'size', 'size', 'size', 'size', 'size', 'rating', 'rating', 'rating', 'rating', 'rating'],
'Description': ['small', 'medium', 'large', 'small', 'medium', 'large', 'excellent', 'good', 'needs improvement', 'excellent', 'good'],
'Frequency': [1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 2]}
desired_df = pd.DataFrame(desired)
Cohort Variable Description Frequency
0 Cohort A size small 1
1 Cohort A size medium 2
2 Cohort A size large 1
3 Cohort B size small 1
4 Cohort B size medium 1
5 Cohort B size large 2
6 Cohort A rating excellent 1
7 Cohort A rating good 1
8 Cohort A rating needs improvement 2
9 Cohort B rating excellent 2
10 Cohort B rating good 2
I've been able to write a loop to group by Cohort and my other columns and get frequencies, but I'm not sure how to put it all together (append dataframes?) to get to my desired result. I appreciate any guidance on next steps! (Here is what I wrote to loop through my columns and get frequencies:
cols = ['Program Size', 'Rating']
for i in cols:
grouped_df = df.groupby(['Cohort', (i)], as_index=False).agg(
frequency=('ID', 'count')
)
print(f"Grouped by Cohort and {i}:\n{grouped_df}\n")
You can use a list comprehension.
['Cohort', col]
.pd.concat()
to combine dataframes.col_mapping = {'Program Size': 'size', 'Rating': 'rating'}
final_df = pd.concat([
df.groupby(['Cohort', col], as_index=False)
.agg(Frequency=('ID', 'count'))
.rename(columns={col: 'Description'})
.assign(Variable=col_mapping[col])
[['Cohort', 'Variable', 'Description', 'Frequency']]
for col in col_mapping.keys()
], ignore_index=True).sort_values(['Cohort', 'Variable'])