python pandas group-by cumulative-frequency running-count

Pandas cumulative count across different groups

I've got the following DataFrame :

df = pd.DataFrame({'A': ['Nadal', 'Federer', 'Djokovic', 'Nadal', 'Nadal', 'Murray', 'Nadal'], 
                   'B': ['Djokovic', 'Nadal', 'Murray', 'Murray', 'Djokovic', 'Federer', 'Murray'],
                   'Winner': ['Nadal', 'Federer', 'Djokovic', 'Murray', 'Nadal', 'Federer', 'Murray'],
                   'Loser': ['Djokovic', 'Nadal', 'Murray', 'Nadal', 'Djokovic', 'Murray', 'Nadal']})

And I'd like to create new features based on these results. For example, the number of times the player currently in column A has won and lost games up to this row.

The results would look like that :

          A         B    Winner     Loser  count_of_wins_A  count_of_wins_B
0     Nadal  Djokovic     Nadal  Djokovic                0                0
1   Federer     Nadal   Federer     Nadal                0                1
2  Djokovic    Murray  Djokovic    Murray                0                0
3     Nadal    Murray    Murray     Nadal                1                0
4     Nadal  Djokovic     Nadal  Djokovic                1                1
5    Murray   Federer   Federer    Murray                1                1
6     Nadal    Murray    Murray     Nadal                2                1

On row 6, we would read the results as Player A : Nadal has won 3 matches before the current one and Player B : Murray has won 1 match before this one.

I want to pass this first step in order to calculate other characteristics like :

the winning percentage of each player so far,
the winning percentage of each player in the last N games, etc.

I tried a lot of things and never got it right. I know I can do it easily by keeping track of the number of wins and losses of each player in a dictionary and iterating over all the lines, but it's not really efficient. And this solution will become difficult to maintain when I want to calculate win percentages depending on the tournament and the player in question for more complex features.

I'm able to get the cumulative number of wins of the current player in the winner column with :

df['Winner_wins'] = df_atp.groupby('Winner').cumcount() But I don't know how to access efficiently the number of times the current player in column A appeared in the column Winner or Loser before.

I'm aware that I can get the dummies of a specific column like that :

counts = pd.get_dummies(df['Winner']).cumsum()

To calculate the number of times the Player A appeared in the column Winner before the current row, I think I have two choices:

perform a sort of groupby the current player in column A and then count cumulatively the number of times this group appeared in the Winner column for example.

retrieve the corresponding corresponding dummies in counts at the right index based on the current name in column A

But I'm really struggling to put all the pieces of the puzzle together. Not to mention when I would want to group my data by player and by court type...

Does anyone have the solution?

Solution

Use get_dummies+shift+cumsum and indexing lookup:

import numpy as np

tmp = pd.get_dummies(df['Winner']).shift(fill_value=0).cumsum()

idxA, colsA = pd.factorize(df['A'])
df['count_of_wins_A'] = tmp.reindex(colsA, axis=1).to_numpy()[np.arange(len(df)), idxA]

idxB, colsB = pd.factorize(df['B'])
df['count_of_wins_B'] = tmp.reindex(colsB, axis=1).to_numpy()[np.arange(len(df)), idxB]

Using a helper function:

import numpy as np

tmp = pd.get_dummies(df['Winner']).shift(fill_value=0).cumsum()

def lookup(s, tmp):
    idx, cols = pd.factorize(s)
    return tmp.reindex(cols, axis=1).to_numpy()[np.arange(len(s)), idx]

df['count_of_wins_A'] = lookup(df['A'], tmp)
df['count_of_wins_B'] = lookup(df['B'], tmp)

Output:

          A         B    Winner     Loser  count_of_wins_A  count_of_wins_B
0     Nadal  Djokovic     Nadal  Djokovic                0                0
1   Federer     Nadal   Federer     Nadal                0                1
2  Djokovic    Murray  Djokovic    Murray                0                0
3     Nadal    Murray    Murray     Nadal                1                0
4     Nadal  Djokovic     Nadal  Djokovic                1                1
5    Murray   Federer   Federer    Murray                1                1
6     Nadal    Murray    Murray     Nadal                2                1