pythonpandas

Pandas Filter and Sum but Apply to All Rows


I have a dataframe that has user ID, code, and value.

user    code    value
0001       P       10
0001       P       20
0001       N       10
0002       N       40
0002       N       30
0003       P       10

I am trying to add a new column that groups by User ID, filters for code = P and sums the value. However I want this value to be applied to every row.

So for the example above, the output I'm looking for would be:

user    code    value    Sum_of_P
0001       P       10          30
0001       P       20          30
0001       N       10          30
0002       N       40           0
0002       N       30           0
0003       P       10          10

I tried doing df['Sum_of_P'] = df.loc[df['code'] == 'P', 'value'].groupby(df['user']).transform('sum'), but this applies only to the rows with code = P. Is there a way to have to have this apply to all rows?


Solution

  • Use a mask and where rather than loc:

    df['Sum_of_P'] = (df['value'].where(df['code'].eq('P'), 0)
                      .groupby(df['user']).transform('sum')
                     )
    

    Variant with NaNs as masked values:

    df['Sum_of_P'] = (df['value'].where(df['code'].eq('P'))
                      .groupby(df['user']).transform('sum')
                      .convert_dtypes()
                     )
    

    If you want to use loc you should aggregate rather than transform, then map the values from the group:

    s = df.loc[df['code'] == 'P'].groupby('user')['value'].sum()
    df['Sum_of_P'] = df['user'].map(s).fillna(0).convert_dtypes()
    

    Output:

       user code  value  Sum_of_P
    0     1    P     10        30
    1     1    P     20        30
    2     1    N     10        30
    3     2    N     40         0
    4     2    N     30         0
    5     3    P     10        10