pythonpandaspandas-groupbysplit-apply-combine

How to add a condition in split apply combine and repeat solution on each row?


I have the following pandas dataframe df:

cluster   tag   amount   name
1         0     200      Michael        
2         1     1200     John        
2         1     900      Daniel        
2         0     3000     David        
2         0     600      Jonny        
3         0     900      Denisse        
3         1     900      Mike        
3         1     3000     Kely        
3         0     2000     Devon  

What I need to do is add another column in df that writes for each row, the name (from the name column) that has the highest amount where the tag is 1. In other words, the solution looks like this:

cluster   tag   amount   name     highest_amount
1         0     200      Michael  NaN      
2         1     1200     John     John   
2         1     900      Daniel   John     
2         0     3000     David    John    
2         0     600      Jonny    John    
3         0     900      Denisse  Kely      
3         1     900      Mike     Kely   
3         1     3000     Kely     Kely   
3         0     2000     Devon    Kely

I've tried something like this:

df.group('clusters')['name','amount'].transform('max')[df['tag']==1]

but the problem with this is that the name does note repeat on every row. It will look like this:

cluster   tag   amount   name     highest_amount
1         0     200      Michael  NaN      
2         1     1200     John     John   
2         1     900      Daniel   John     
2         0     3000     David    NaN    
2         0     600      Jonny    NaN    
3         0     900      Denisse  NaN      
3         1     900      Mike     Kely   
3         1     3000     Kely     Kely   
3         0     2000     Devon    NaN

Can someone please let me know how to add a condition with split apply combine, and have the solution repeated on each row?


Solution

  • You can do this as a two-stage process. First calculate a mapping series, then map by cluster:

    s = df.query('tag == 1')\
          .sort_values('amount', ascending=False)\
          .drop_duplicates('cluster')\
          .set_index('cluster')['name']
    
    df['highest_name'] = df['cluster'].map(s)
    
    print(df)
    
       cluster  tag  amount     name highest_name
    0        1    0     200  Michael          NaN
    1        2    1    1200     John         John
    2        2    1     900   Daniel         John
    3        2    0    3000    David         John
    4        2    0     600    Jonny         John
    5        3    0     900  Denisse         Kely
    6        3    1     900     Mike         Kely
    7        3    1    3000     Kely         Kely
    8        3    0    2000    Devon         Kely
    

    If you want to use groupby, here's one way:

    def func(x):
        names = x.query('tag == 1').sort_values('amount', ascending=False)['name']
        return names.iloc[0] if not names.empty else np.nan
    
    df['highest_name'] = df['cluster'].map(df.groupby('cluster').apply(func))