I have player A
and B
who both played against different opponents.
player | opponent | days ago |
---|---|---|
A | C | 1 |
A | C | 2 |
A | D | 10 |
A | F | 100 |
A | F | 101 |
A | F | 102 |
A | G | 1 |
B | C | 1 |
B | C | 2 |
B | D | 10 |
B | F | 100 |
B | F | 101 |
B | F | 102 |
B | G | 1 |
B | G | 2 |
B | G | 3 |
B | G | 4 |
B | G | 5 |
B | G | 6 |
B | G | 7 |
B | G | 8 |
First, I want to find the opponent that is the most common one. My definition of "most common" is not the total number of matches but more like the balanced number of matches.
If for example, player 1
and 2
played respectively 99 and 1 time(s) against player 3
I prefer opponent 4
where A
and B
played both 49 times against.
In order to measure the "balanceness" I write the following function:
import numpy as np
from collections import Counter
def balanceness(array: np.ndarray):
classes = [(c, cnt) for c, cnt in Counter(array).items()]
m = len(classes)
n = len(array)
H = -sum([(cnt / n) * np.log((cnt / n)) for c, cnt in classes])
return H / np.log(m)
This functions works as expected:
>> balanceness(array=np.array([0, 0, 0, 1, 1, 1]))
1.0
If I run the function on the different opponents I see the following results:
opponent | balanceness | n_matches |
---|---|---|
C | 1 | 4 |
D | 1 | 2 |
F | 1 | 6 |
G | 0.5032583347756457 | 9 |
Clearly, opponent F
is the most common one. However, the matches of A
and B
against F
are relatively old.
How should I incorporate a recency-factor into my calculation to find the "most recent common opponent"?
Edit
After thinking more about it I decided to weight each match using the following function
def weight(days_ago: int, epilson: float=0.005) -> float:
return np.exp(-1 * days_ago * epilson)
I sum the weight of all the matches against each opponent
opponent | balanceness | n_matches | weighted_n_matches |
---|---|---|---|
C | 1 | 4 | 3.9701246258837 |
D | 1 | 2 | 1.90245884900143 |
F | 1 | 6 | 3.62106362790388 |
G | 0.5032583347756457 | 9 | 8.81753570603108 |
Now, opponent C
is the "most-recent balanced opponent".
Nevertheless, this method ignores the "recentness" on a player-level because we sum the values. There could be a scenario where player 1
played recently a lot of matches against player 3
whereas player 2
faced player 3
in the distant past.
How can we find the opponent that is
First, I think "balanceness" needs to consider how many days ago the matches were played. For example, suppose A and B played 1 match against C, both 100 days ago. Again, let A and B both play 1 match against E, 1 day and 199 days ago respectively. Although the number of matches is the same, their recency is different, and they shouldn't have the same balanceness score.
By using the defined weight(days_ago)
function, it will be as if A and B both played 0.60 matches against C, while they played 0.995 and 0.36 matches against E respectively. These two scenarios should have different balanceness.
Second, just balanceness is obviously not enough. If A and B played 1 match each against D, both 100 years ago, and against E, both 200 years ago---both scenarios are equally "balanced". You need to define a "recency" score (between 0 and 1); I think average weight might work. And then you can combine the two metrics together in some way, e.g. B * R
, or (B * R)/(B + R)
, or alpha * B + (1 - alpha) * R
.
import numpy as np
import pandas as pd
data = [
["A", "C", 2],
["A", "D", 10],
["A", "F", 100],
["A", "F", 101],
["A", "F", 102],
["A", "G", 1],
["B", "C", 1],
["B", "C", 2],
["B", "D", 10],
["B", "F", 100],
["B", "F", 101],
["B", "F", 102],
["B", "G", 1],
["B", "G", 2],
["B", "G", 3],
["B", "G", 4],
["B", "G", 5],
["B", "G", 6],
["B", "G", 7],
["B", "G", 8]
]
def weight(days_ago: int, epilson: float=0.005) -> float:
return np.exp(-1 * days_ago * epilson)
def weighted_balanceness(array: np.ndarray, weights: np.ndarray):
classes = np.unique(array)
cnt = np.array([weights[array == c].sum() for c in classes])
m = len(classes)
n = weights.sum()
H = -(cnt / n * np.log(cnt / n)).sum()
return H / np.log(m)
df = pd.DataFrame(data=data, columns=["player", "opponent", "days_ago"])
df["effective_count"] = weight(df["days_ago"])
scores = []
for opponent in df["opponent"].unique():
df_o = df.loc[df["opponent"] == opponent]
player = np.where(df_o["player"].values == "A", 0, 1)
balanceness = weighted_balanceness(array=player, weights=df_o["effective_count"])
recency = df_o["effective_count"].mean()
scores.append([opponent, balanceness, recency])
df_out = pd.DataFrame(scores, columns=["opponent", "balanceness", "recency"])
df_out["br"] = df_out["balanceness"] * df_out["recency"]
df_out["mean_br"] = 0.5 * df_out["balanceness"] + 0.5 * df_out["recency"]
df_out["harmonic_mean_br"] = df_out["balanceness"] * df_out["recency"] / ( (df_out["balanceness"] + df_out["recency"]))
print(df_out)
This gives me the following:
opponent balanceness recency br mean_br harmonic_mean_br
0 C 0.917739 0.991704 0.910125 0.954721 0.476644
1 D 1.000000 0.951229 0.951229 0.975615 0.487503
2 F 1.000000 0.603511 0.603511 0.801755 0.376368
3 G 0.508437 0.979726 0.498129 0.744082 0.334728
Note that D and F have perfect balanceness. They both played with A & B with same number of matches and same days ago. However, F played a while back (100-102 days ago), so they have a lower recency score, which hurts their combined scores.
Depending on how you combine b and r, most likely D or C would be the best choice (C may win if you give more weight to recency).