pythonpandas

How do you summarize data frame in pandas based on values


I have a data frame like this

df

Node_Name size  count
Abc1       10     2
Abc1       20     2
Zxd        30     3
Zxd        40     3
Zxd        80     3
Ddd        10     4
Ddd        40     4
Ddd        80     4
Ddd       100     4

I need subset this data frame of as this.

If the count value per Node_Name is 2 or less, take min value per Node_Name. If count is 3 or more, remove the max and add the size of values excepr for tge max value group by the Node_Name.

For example final_df should look like this:

Node_Name size  count
Abc1        10     2
Zxd         70     3
Ddd         130     4

Solution

  • original question: aggregation that depends on a group value

    If I understand well the logic, for each group with groupby.apply, take the lowest 2 values and sum if count >2 else the min:

    out = (df.groupby(['Node_Name', 'count'], sort=False)['size']
             .apply(lambda x: x.nsmallest(2).sum() if x.name[1] > 2 else x.min())
             .reset_index()
          )
    

    NB. x.name contains the identifier of the group (for the first one this is ('Abc1', 2)) thus x.name[1] is the count.

    Output:

      Node_Name  count  size
    0      Abc1      2    10
    1       Zxd      3    70
    2       Ddd      4    50
    

    variant: sum all but the largest

    out = (df.groupby(['Node_Name', 'count'], sort=False)['size']
             .agg(lambda x: x.sort_values().iloc[:-1].sum())
          )
    
    # or
    out = (df.groupby(['Node_Name', 'count'], sort=False)['size']
             .agg(lambda x: x.nsmallest(len(x)-1).sum())
          )
    

    Output:

      Node_Name  count  size
    0      Abc1      2    10
    1       Zxd      3    70
    2       Ddd      4   130