I have a dataframe that has user ID, code, and value.
user code value
0001 P 10
0001 P 20
0001 N 10
0002 N 40
0002 N 30
0003 P 10
I am trying to add a new column that groups by User ID, filters for code = P
and sums the value. However I want this value to be applied to every row.
So for the example above, the output I'm looking for would be:
user code value Sum_of_P
0001 P 10 30
0001 P 20 30
0001 N 10 30
0002 N 40 0
0002 N 30 0
0003 P 10 10
I tried doing df['Sum_of_P'] = df.loc[df['code'] == 'P', 'value'].groupby(df['user']).transform('sum')
, but this applies only to the rows with code = P
. Is there a way to have to have this apply to all rows?
Use a mask and where
rather than loc
:
df['Sum_of_P'] = (df['value'].where(df['code'].eq('P'), 0)
.groupby(df['user']).transform('sum')
)
Variant with NaNs as masked values:
df['Sum_of_P'] = (df['value'].where(df['code'].eq('P'))
.groupby(df['user']).transform('sum')
.convert_dtypes()
)
If you want to use loc
you should aggregate rather than transform
, then map
the values from the group:
s = df.loc[df['code'] == 'P'].groupby('user')['value'].sum()
df['Sum_of_P'] = df['user'].map(s).fillna(0).convert_dtypes()
Output:
user code value Sum_of_P
0 1 P 10 30
1 1 P 20 30
2 1 N 10 30
3 2 N 40 0
4 2 N 30 0
5 3 P 10 10