I am currently working with an NFL dataset. The dataset is very granular, but very straightforward, each row represents the position of a player (x,y coordinates relative to the field) for a given frame, for a given play, in a given game. You can think of a frame as snapshot in time, where at that point, we record the coordinates of all the players on the field, and input them as one row per player in the dataframe. Each play has ~70 frames, and each game has ~80 plays, and we have 250+ games
What I want, is to identify, for certain players on offense (specifically wide receivers), who their closest defender is, and how far away they are. So ideally, I would apply a function that takes in a frame, and outputs two columns, populated only for the wide receivers - the two columns being closest defenders and their distance to the WR. I would use that function over all frames, within all plays, within all games.
I am struggling to come up with an efficient solution. I use the distance_matrix function from scipy to calculate the two metrics, but what's most difficult right now is iterating through the 1MM+ combinations of game, play and frames.
I was thinking of maybe using the apply function to get to a result, but it would still involve iterating through the various combinations of games, play and frame. I'm thinking maybe there's even a vectorized solution but i can't come up with anything.
Any advice here would be immensly helpful - I've pasted my current working code below, which just uses for loops and takes a really long time
temp = pd.DataFrame()
# For each game, and each play within the games, and frames within the play
for game_id in test.gameId.unique():
for play_id in test[test.gameId==game_id].playId.unique():
for frame_id in test[(test.playId==play_id)&(test.gameId==game_id)].frameId.unique():
print("Game: {} | Play: {} | Frame: {}".format(game_id,play_id,frame_id))
# Filter the dataframe on a given frame, within a given play, within a given game
df = test[(test.gameId==game_id)&
(test.playId==play_id)&
(test.frameId==frame_id)
]
# Isolate the wide receivers
df_wr = df[(df["inPoss"]==1)&(df['position']=="WR")]
# Isolate the defenders
df_d = df[df["inPoss"]==0]
# Calculate the distance matrix between each WR and defenders
dm = distance_matrix(df_wr[['x','y']].values,
df_d[['x','y']].values)
# use argmin and min to record the closest defender, and their distance
closest_defender = dm.argmin(axis=1)
closest_defender_distance = dm.min(axis=1)
# Create a dataframe to record the information
for i,j in enumerate(closest_defender):
temp_df = pd.DataFrame({
'gameId':[game_id]
,'playId':[play_id]
,'frameId':[frame_id]
,'displayName':[df_wr.displayName.iloc[i]]
,'closestDefender':[df_d.displayName.iloc[j]]
,'closestDefenderDistance':[closest_defender_distance[i]]
})
temp = pd.concat([temp, temp_df])
Obviously I don't have any data so I can't test my code robustly. But there are some guiding principles that I can elucidate on.
You do not want to be doing so much subsetting. To avoid this, you can groupby game, play, and frame:
for g, grouped_df in test.groupby(['gameId', 'playId', 'frameId']):
... # do your isolation stuff here
This also means you don't need to do your subsetting while continuing to use the code you're already using right now. If you were to do this, you shouldn't constantly be concatenating to your existing data frame. Instead, create a list of data frames and concatenate at the end. Ie:
temp = []
for ... in ...:
result_df = ... # how you produce the result
temp.append(result_df)
final = pd.concat(temp, axis='rows')
You also could reduce your entire thing into a function which you then apply over the groupby
. The function would have signature:
def complex_function(df):
... # it can return multiple columns and rows as well
result = test.groupby(['gameId', 'playId', 'frameId']).apply(complex_function)
Returning a data frame in your groupby.apply
here is somewhat tricky. The index of the returned data frame is broadcast to your result index and may require resetting or flattening. The columns are broadcasted properly, however.