I'm trying to perform a groupby sum on a specific column in a pandas df. But I only want to execute of count after a certain threshold. For this example, it will be where B > 2
.
The groupby is on A and the count is on C. The correct output should be:
x = 3
y = 9
df = pd.DataFrame(dict(A=list('ababaa'), B=[1, 1, 3, 4, 5, 6], C=[9, 9, 0, 9, 1, 2]))
df.loc[(df['B'] > 2), 'Count'] = df.groupby('A')['C'].transform('sum')
df['Count'] = df['Count'].replace(np.NaN, 0).astype(int)
Out:
A B C Count
0 x 1 9 0
1 y 1 9 0
2 x 3 0 12 *3
3 y 4 9 18 *9
4 x 5 1 12 *3
5 x 6 2 12 *3
Use mask in both sides:
m = df['B'] > 2
df['Count'] = 0
df.loc[m, 'Count'] = df[m].groupby('A')['C'].transform('sum')
print (df)
A B C Count
0 a 1 9 0
1 b 1 9 0
2 a 3 0 3
3 b 4 9 9
4 a 5 1 3
5 a 6 2 3
Another idea is use Series.where
:
m = df['B'] > 2
df['Count'] = m.groupby(df['A']).transform('sum').where(m, 0)
Or numpy.where
:
m = df['B'] > 2
df['Count'] = np.where(m, m.groupby(df['A']).transform('sum'), 0)
Or multiple by mask:
m = df['B'] > 2
df['Count'] = m.groupby(df['A']).transform('sum').mul(m)