pythonpandasdataframenumpyrecord-linkage

Resolving conflicts in Pandas dataframe


I am performing record linkage on a dataframe such as:

ID_1     ID_2    Predicted Link     Probability
   1        0                 1             0.9
   1        1                 1             0.5
   1        2                 0               0
   2        1                 1             0.8
   2        5                 1             0.8
   3        1                 0               0
   3        2                 1             0.5

When my model overpredicts and links the same ID_1 to more than one ID_2 (indicated by a 1 in Predicted Link) I want to resolve the conflicts based on the Probability-value. If one predicted link has a higher probability than the other I want to keep a 1 for that, but reverse the other prediction link values for that ID_1 to 0. If the (highest) probabilities are of equal value I want to reverse all the predicted link values to 0. If only one predicted link then the predicted values should be left as they are.

The resulting dataframe would look like this:

ID_1     ID_2    Predicted Link     Probability
   1        0                 1             0.9
   1        1                 0             0.5
   1        2                 0               0
   2        1                 0             0.8
   2        5                 0             0.8
   3        1                 0               0
   3        2                 1             0.5

I am grouping via pandas.groupby, and tried some variations with numpy.select and numpy.where, but without luck. Any help much appreciated!


Solution

  • For each ID_1, you want to keep one and only one row. Thus, grouping is a good start.

    First let's construct our data :

    import pandas as pd
    from io import StringIO
    
    csvfile = StringIO(
    """ID_1\tID_2\tPredicted Link\tProbability
    1\t0\t1\t0.9
    1\t1\t1\t0.5
    1\t2\t0\t0
    2\t1\t1\t0.8
    2\t5\t1\t0.8
    3\t1\t0\t0
    3\t2\t1\t0.5""")
    
    df = pd.read_csv(csvfile, sep = '\t', engine='python')
    

    We want to a group for each value of ID_1 and then looking for the row holding the max value of Probability for that said value of ID_1. Let's create a mask :

    
    max_proba = df.groupby("ID_1")["Probability"].transform(lambda x : x.eq(x.max()))
    
    max_proba
    Out[196]: 
    0     True
    1    False
    2    False
    3     True
    4     True
    5    False
    6     True
    Name: Probability, dtype: bool
    

    Considering your rules, rows 0, 1, 2 and rows 5, 6 are valid (only one max for that ID_1 value), but not the 3 and 4 rows. Let's build a mask that consider these two conditions, True if max value and if only one max value.

    To be more accurate, for each ID_1, if a Probablity value is duplicated then it can't be a candidate for the said max. We will then build a max that exclude duplicates Probability value for each ID_1 value

    mask_unique = df.groupby(["ID_1", "Probability"])["Probability"].transform(lambda x : len(x) == 1)
    
    mask_unique
    Out[284]: 
    0     True
    1     True
    2     True
    3    False
    4    False
    5     True
    6     True
    Name: Probability, dtype: bool
    

    Finally, let's combine our two masks :

    df.loc[:, "Predicted Link"] = 1 * (mask_max_proba & mask_unique)
    
    df
    Out[285]: 
       ID_1  ID_2  Predicted Link  Probability
    0     1     0               1          0.9
    1     1     1               0          0.5
    2     1     2               0          0.0
    3     2     1               0          0.8
    4     2     5               0          0.8
    5     3     1               0          0.0
    6     3     2               1          0.5