I have a pandas dataframe like as below
data = {
'cust_id': ['abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc'],
'product_id': [12, 12, 12, 12, 12, 12, 12, 12, 12, 12],
'purchase_country': ['India', 'India', 'India', 'Australia', 'Australia', 'Australia', 'Australia', 'Australia', 'Australia', 'Australia']
}
df = pd.DataFrame(data)
My objective is to do the below for each group of cust_id and product_id
a) create two output columns - 'pct_region_split' and 'num_region_split'
b) For 'pct_region_split' - store the % of country split. For ex: For the specific group shown in sample data, Australia - 70% (7 out of 10 is 70%) and India - 30% (3 out of 10 is 30%)
c) For 'num_region_split' - just store the no of rows for country value. For ex: For the specific group shown in sample data, Australia - 7 rows out of total 10 and India is 3 out of total 10.
b) Store the values in a list format (descending order). Meaning, Australia should appear first because it has 70% as the value (which is higher than India).
I tried the below but it is going no where
df['total_purchases'] = df.groupby(['cust_id', 'product_id'])['purchase_country'].transform('size')
df['unique_country'] = df.groupby(['cust_id', 'product_id'])['purchase_country'].transform('nunique')
Please do note that my real data has more than 1000 customers and 200 product combinations.
I expect my output in a new dataframe like as shown below for each cust and product_id combination
Use a custom function and groupby.apply
:
def f(g):
s = g['purchase_country'].value_counts()
return pd.Series({'num_region_split': ', '.join(s.index+':'+s.astype('str')),
'pct_region_split': ', '.join(s.index+':'+s.div(s.sum()).astype('str')),
})
df.groupby(['cust_id', 'product_id'], as_index=False).apply(f)
Output:
cust_id product_id num_region_split pct_region_split
0 abc 12 Australia:7, India:3 Australia:0.7, India:0.3