I've got the following DataFrame :
df = pd.DataFrame({'A': ['Nadal', 'Federer', 'Djokovic', 'Nadal', 'Nadal', 'Murray', 'Nadal'],
'B': ['Djokovic', 'Nadal', 'Murray', 'Murray', 'Djokovic', 'Federer', 'Murray'],
'Winner': ['Nadal', 'Federer', 'Djokovic', 'Murray', 'Nadal', 'Federer', 'Murray'],
'Loser': ['Djokovic', 'Nadal', 'Murray', 'Nadal', 'Djokovic', 'Murray', 'Nadal']})
And I'd like to create new features based on these results. For example, the number of times the player currently in column A has won and lost games up to this row.
The results would look like that :
A B Winner Loser count_of_wins_A count_of_wins_B
0 Nadal Djokovic Nadal Djokovic 0 0
1 Federer Nadal Federer Nadal 0 1
2 Djokovic Murray Djokovic Murray 0 0
3 Nadal Murray Murray Nadal 1 0
4 Nadal Djokovic Nadal Djokovic 1 1
5 Murray Federer Federer Murray 1 1
6 Nadal Murray Murray Nadal 2 1
On row 6, we would read the results as Player A : Nadal has won 3 matches before the current one and Player B : Murray has won 1 match before this one.
I want to pass this first step in order to calculate other characteristics like :
I tried a lot of things and never got it right. I know I can do it easily by keeping track of the number of wins and losses of each player in a dictionary and iterating over all the lines, but it's not really efficient. And this solution will become difficult to maintain when I want to calculate win percentages depending on the tournament and the player in question for more complex features.
I'm able to get the cumulative number of wins of the current player in the winner column with :
df['Winner_wins'] = df_atp.groupby('Winner').cumcount()
But I don't know how to access efficiently the number of times the current player in column A appeared in the column Winner or Loser before.
I'm aware that I can get the dummies of a specific column like that :
counts = pd.get_dummies(df['Winner']).cumsum()
To calculate the number of times the Player A appeared in the column Winner before the current row, I think I have two choices:
OR
counts
at the right index based on the current name in column ABut I'm really struggling to put all the pieces of the puzzle together. Not to mention when I would want to group my data by player and by court type...
Does anyone have the solution?
Use get_dummies
+shift
+cumsum
and indexing lookup:
import numpy as np
tmp = pd.get_dummies(df['Winner']).shift(fill_value=0).cumsum()
idxA, colsA = pd.factorize(df['A'])
df['count_of_wins_A'] = tmp.reindex(colsA, axis=1).to_numpy()[np.arange(len(df)), idxA]
idxB, colsB = pd.factorize(df['B'])
df['count_of_wins_B'] = tmp.reindex(colsB, axis=1).to_numpy()[np.arange(len(df)), idxB]
Using a helper function:
import numpy as np
tmp = pd.get_dummies(df['Winner']).shift(fill_value=0).cumsum()
def lookup(s, tmp):
idx, cols = pd.factorize(s)
return tmp.reindex(cols, axis=1).to_numpy()[np.arange(len(s)), idx]
df['count_of_wins_A'] = lookup(df['A'], tmp)
df['count_of_wins_B'] = lookup(df['B'], tmp)
Output:
A B Winner Loser count_of_wins_A count_of_wins_B
0 Nadal Djokovic Nadal Djokovic 0 0
1 Federer Nadal Federer Nadal 0 1
2 Djokovic Murray Djokovic Murray 0 0
3 Nadal Murray Murray Nadal 1 0
4 Nadal Djokovic Nadal Djokovic 1 1
5 Murray Federer Federer Murray 1 1
6 Nadal Murray Murray Nadal 2 1