pythonpandasaggregate-functionsarray-intersect

Get Maximum Intersection as An Aggregate Function in Python


I have a dataframe like below (available in array format or unnest one):

team  | player     | favorite_food
  A   | A_player1  | [pizza, sushi]
  A   | A_player2  | [salad, sushi]
  B   | B_player1  | [pizza, pasta, salad, taco]
  B   | B_player2  | [taco, salad, sushi]
  B   | B_player3  | [taco]

I want to get number and percentage of food players have in common per team. Something like below:

team  | #_food_common | percent_food_common
  A   | 1             |  0.33
  B   | 1             |  0.2

What is a good way to do this in Python preferably Pandas?


Solution

  • You can use set operations and groupby.agg:

    (df['favorite_food'].apply(set)
     .groupby(df['team'])
     .agg(**{'#_food_common': lambda x: len(set.intersection(*x)),
             'percent_food_common': lambda x: len(set.intersection(*x))/len(set.union(*x)),
             
            })
     .reset_index()
    )
    

    Output:

      team  #_food_common  percent_food_common
    0    A              1             0.333333
    1    B              1             0.200000
    

    Used input:

    df = pd.DataFrame({'team': ['A', 'A', 'B', 'B', 'B'],
                       'player': ['A_player1', 'A_player2', 'B_player1', 'B_player2', 'B_player3'],
                       'favorite_food': [['pizza', 'sushi'],
                                         ['salad', 'sushi'],
                                         ['pizza', 'pasta', 'salad', 'taco'],
                                         ['taco', 'salad', 'sushi'],
                                         ['taco']]})