I have a Pandas dataframe with thousands of rows that I want to combine to smaller number of rows.
In the new dataframe, I want:
my search shows I might use agg function to achieve the above piece (not too sure). But hardest part is to write the criteria with Pandas that select the number of rows needed to merged. Here is an example:
df = pd.DataFrame({'col1': [1, 1, 2, 2, 3, 4, 2],
'col2': [10, 20, 30, 40, 50, 60, 70],
'col3': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7],
'col4': [2, 1, 4, 5, 8, 10, 3]})
My criteria for the number of rows to be merged is so that summation of column 1 values for them to be above a certain threshold. Assume it's 3 here:
Columns 1 and 2 are sum of grouped rows. Column 3 is average of grouped rows. Column 3 is weighted average of grouped rows with weight factor being column 1.
newdf = pd.DataFrame({'col1': [4, 5, 4, 2],
'col2': [60, 90, 60, 70],
'col3': [0.2, 0.45, 0.6, 0.7],
'col4': [2.75, 6.8, 10, 3]})
Can someone help me?
I looked into something like this but two issues remained: the grouping criteria doesn't work, and not sure how the weighted average
agg_funcs = {'col1': 'sum', 'col2': 'sum','col3': 'mean', 'col4': lambda x: (x['col1'] * x['col4']).sum() / x['col1'].sum()}
grouped_df = df.groupby('col1').agg(agg_funcs)
The code failed and complained about the weighted average piece, but the grouping criteria is not correct too!
You need to change two things:
agg
. agg
only works by column/Series without being aware of the other columns. You can however pre-compute the weights, sum
, then divide by the sum of weights.def threshold_grouper(s, thresh=3):
group = []
i = 0
total = 0
for val in s:
total += val
group.append(i)
if total>=thresh:
i+=1
total=0
return group
(df.eval('col4 = col4*col1')
.groupby(threshold_grouper(df['col1']))
.agg({'col1': 'sum', 'col2': 'sum', 'col3': 'mean', 'col4': 'sum'})
.eval('col4 = col4/col1')
)
You can also cheat and use a side effect in agg
, accessing "col1" externally:
(df.groupby(threshold_grouper(df['col1']))
.agg({'col1': 'sum', 'col2': 'sum', 'col3': 'mean',
'col4': lambda g: np.average(g, weights=df['col1'].reindex_like(g))})
)
Output:
col1 col2 col3 col4
0 4 60 0.20 2.75
1 5 90 0.45 6.80
2 4 60 0.60 10.00
3 2 70 0.70 3.00