pythonpandasvectorizationapplydistance-matrix

What is the best way to apply a function to subsets of a dataframe?


I am currently working with an NFL dataset. The dataset is very granular, but very straightforward, each row represents the position of a player (x,y coordinates relative to the field) for a given frame, for a given play, in a given game. You can think of a frame as snapshot in time, where at that point, we record the coordinates of all the players on the field, and input them as one row per player in the dataframe. Each play has ~70 frames, and each game has ~80 plays, and we have 250+ games

What I want, is to identify, for certain players on offense (specifically wide receivers), who their closest defender is, and how far away they are. So ideally, I would apply a function that takes in a frame, and outputs two columns, populated only for the wide receivers - the two columns being closest defenders and their distance to the WR. I would use that function over all frames, within all plays, within all games.

I am struggling to come up with an efficient solution. I use the distance_matrix function from scipy to calculate the two metrics, but what's most difficult right now is iterating through the 1MM+ combinations of game, play and frames.

I was thinking of maybe using the apply function to get to a result, but it would still involve iterating through the various combinations of games, play and frame. I'm thinking maybe there's even a vectorized solution but i can't come up with anything.

Any advice here would be immensly helpful - I've pasted my current working code below, which just uses for loops and takes a really long time

temp = pd.DataFrame()
# For each game, and each play within the games, and frames within the play
for game_id in test.gameId.unique():
    for play_id in test[test.gameId==game_id].playId.unique():
        for frame_id in test[(test.playId==play_id)&(test.gameId==game_id)].frameId.unique():
            print("Game: {} | Play: {} | Frame: {}".format(game_id,play_id,frame_id))
            
            # Filter the dataframe on a given frame, within a given play, within a given game
            df = test[(test.gameId==game_id)&
                    (test.playId==play_id)&
                    (test.frameId==frame_id)
                   ]
            
            # Isolate the wide receivers
            df_wr = df[(df["inPoss"]==1)&(df['position']=="WR")]
            # Isolate the defenders
            df_d =  df[df["inPoss"]==0]
            # Calculate the distance matrix between each WR and defenders
            dm = distance_matrix(df_wr[['x','y']].values, 
                     df_d[['x','y']].values)
            
            # use argmin and min to record the closest defender, and their distance
            closest_defender = dm.argmin(axis=1)
            closest_defender_distance = dm.min(axis=1)
            
            # Create a dataframe to record  the information
            for i,j in enumerate(closest_defender):
                temp_df = pd.DataFrame({
                    'gameId':[game_id]
                    ,'playId':[play_id]
                    ,'frameId':[frame_id]
                    ,'displayName':[df_wr.displayName.iloc[i]]
                    ,'closestDefender':[df_d.displayName.iloc[j]]
                    ,'closestDefenderDistance':[closest_defender_distance[i]]
                })

                temp = pd.concat([temp, temp_df])

Solution

  • Obviously I don't have any data so I can't test my code robustly. But there are some guiding principles that I can elucidate on.

    You do not want to be doing so much subsetting. To avoid this, you can groupby game, play, and frame:

    for g, grouped_df in test.groupby(['gameId', 'playId', 'frameId']):
        ...  # do your isolation stuff here
    

    This also means you don't need to do your subsetting while continuing to use the code you're already using right now. If you were to do this, you shouldn't constantly be concatenating to your existing data frame. Instead, create a list of data frames and concatenate at the end. Ie:

    temp = []
    for ... in ...:
        result_df = ...  # how you produce the result
        temp.append(result_df)
    
    final = pd.concat(temp, axis='rows')
    

    You also could reduce your entire thing into a function which you then apply over the groupby. The function would have signature:

    def complex_function(df):
        ...  # it can return multiple columns and rows as well
    
    result = test.groupby(['gameId', 'playId', 'frameId']).apply(complex_function)
    

    Returning a data frame in your groupby.apply here is somewhat tricky. The index of the returned data frame is broadcast to your result index and may require resetting or flattening. The columns are broadcasted properly, however.