pythonnumpycounter

Measure balanceness of a weighted numpy array


I have player A and B who both played against different opponents.

player opponent days ago
A C 1
A C 2
A D 10
A F 100
A F 101
A F 102
A G 1
B C 1
B C 2
B D 10
B F 100
B F 101
B F 102
B G 1
B G 2
B G 3
B G 4
B G 5
B G 6
B G 7
B G 8

First, I want to find the opponent that is the most common one. My definition of "most common" is not the total number of matches but more like the balanced number of matches. If for example, player 1 and 2 played respectively 99 and 1 time(s) against player 3 I prefer opponent 4 where A and B played both 49 times against.

In order to measure the "balanceness" I write the following function:

import numpy as np
from collections import Counter


def balanceness(array: np.ndarray):
    classes = [(c, cnt) for c, cnt in Counter(array).items()]
    m = len(classes)
    n = len(array)

    H = -sum([(cnt / n) * np.log((cnt / n)) for c, cnt in classes])

    return H / np.log(m)

This functions works as expected:

>> balanceness(array=np.array([0, 0, 0, 1, 1, 1]))
1.0

If I run the function on the different opponents I see the following results:

opponent balanceness n_matches
C 1 4
D 1 2
F 1 6
G 0.5032583347756457 9

Clearly, opponent F is the most common one. However, the matches of A and B against F are relatively old.

How should I incorporate a recency-factor into my calculation to find the "most recent common opponent"?

Edit

After thinking more about it I decided to weight each match using the following function

def weight(days_ago: int, epilson: float=0.005) -> float:
    return np.exp(-1 * days_ago * epilson)

I sum the weight of all the matches against each opponent

opponent balanceness n_matches weighted_n_matches
C 1 4 3.9701246258837
D 1 2 1.90245884900143
F 1 6 3.62106362790388
G 0.5032583347756457 9 8.81753570603108

Now, opponent C is the "most-recent balanced opponent".

Nevertheless, this method ignores the "recentness" on a player-level because we sum the values. There could be a scenario where player 1 played recently a lot of matches against player 3 whereas player 2 faced player 3 in the distant past.

How can we find the opponent that is

  1. the most balanced / equally-distributed between two players
  2. the opponent with the most recent matches against the two players

Solution

  • First, I think "balanceness" needs to consider how many days ago the matches were played. For example, suppose A and B played 1 match against C, both 100 days ago. Again, let A and B both play 1 match against E, 1 day and 199 days ago respectively. Although the number of matches is the same, their recency is different, and they shouldn't have the same balanceness score.

    By using the defined weight(days_ago) function, it will be as if A and B both played 0.60 matches against C, while they played 0.995 and 0.36 matches against E respectively. These two scenarios should have different balanceness.

    Second, just balanceness is obviously not enough. If A and B played 1 match each against D, both 100 years ago, and against E, both 200 years ago---both scenarios are equally "balanced". You need to define a "recency" score (between 0 and 1); I think average weight might work. And then you can combine the two metrics together in some way, e.g. B * R, or (B * R)/(B + R), or alpha * B + (1 - alpha) * R.

    import numpy as np
    import pandas as pd
    
    data = [
        ["A", "C", 2],
        ["A", "D", 10],
        ["A", "F", 100],
        ["A", "F", 101],
        ["A", "F", 102],
        ["A", "G", 1],
        ["B", "C", 1],
        ["B", "C", 2],
        ["B", "D", 10],
        ["B", "F", 100],
        ["B", "F", 101],
        ["B", "F", 102],
        ["B", "G", 1],
        ["B", "G", 2],
        ["B", "G", 3],
        ["B", "G", 4],
        ["B", "G", 5],
        ["B", "G", 6],
        ["B", "G", 7],
        ["B", "G", 8]
    ]
    
    def weight(days_ago: int, epilson: float=0.005) -> float:
        return np.exp(-1 * days_ago * epilson)
    
    def weighted_balanceness(array: np.ndarray, weights: np.ndarray):
        classes = np.unique(array)
        cnt = np.array([weights[array == c].sum() for c in classes])
        m = len(classes)
        n = weights.sum()
    
        H = -(cnt / n * np.log(cnt / n)).sum() 
        return H / np.log(m)
    
    
    df = pd.DataFrame(data=data, columns=["player", "opponent", "days_ago"])
    df["effective_count"] = weight(df["days_ago"])
    
    scores = []
    for opponent in df["opponent"].unique():
        df_o = df.loc[df["opponent"] == opponent]
        player = np.where(df_o["player"].values == "A", 0, 1)
        balanceness = weighted_balanceness(array=player, weights=df_o["effective_count"])
    
        recency = df_o["effective_count"].mean()
        scores.append([opponent, balanceness, recency])
    
    
    df_out = pd.DataFrame(scores, columns=["opponent", "balanceness", "recency"])
    df_out["br"] = df_out["balanceness"] * df_out["recency"]
    df_out["mean_br"] = 0.5 * df_out["balanceness"] + 0.5 * df_out["recency"]
    df_out["harmonic_mean_br"] = df_out["balanceness"] * df_out["recency"] / ( (df_out["balanceness"] + df_out["recency"]))
    
    print(df_out)
    

    This gives me the following:

      opponent  balanceness   recency        br   mean_br  harmonic_mean_br
    0        C     0.917739  0.991704  0.910125  0.954721          0.476644
    1        D     1.000000  0.951229  0.951229  0.975615          0.487503
    2        F     1.000000  0.603511  0.603511  0.801755          0.376368
    3        G     0.508437  0.979726  0.498129  0.744082          0.334728
    

    Note that D and F have perfect balanceness. They both played with A & B with same number of matches and same days ago. However, F played a while back (100-102 days ago), so they have a lower recency score, which hurts their combined scores.

    Depending on how you combine b and r, most likely D or C would be the best choice (C may win if you give more weight to recency).