pythonpandasdataframeheatmapcohen-kappa

Pairwise cohen's kappa of values in two dataframes


I have two dataframes that look like the toy examples below:

data1 = {'subject': ['A', 'B', 'C', 'D'],
         'group': ['red', 'red', 'blue', 'blue'],
         'lists': [[0, 1, 1], [0, 0, 0], [1, 1, 1], [0, 1, 0]]}

data2 = {'subject': ['a', 'b', 'c', 'd'],
         'group': ['red', 'red', 'blue', 'blue'],
         'lists': [[0, 1, 0], [1, 1, 0], [1, 0, 1], [1, 1, 0]]}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

I would like to calculate the cohen's kappa score for each pair of subjects. For example, I would like to calculate the cohen's kappa scores for subject "A" in df1 against subjects "a", "b", and "c" in df2... and onwards. Like this:

from sklearn.metrics import cohen_kappa_score
cohen_kappa_score(df1['lists'][0], df2['lists'][0])
cohen_kappa_score(df1['lists'][0], df2['lists'][1])
cohen_kappa_score(df1['lists'][0], df2['lists'][2])
...

Importantly, I would like to represent these pairwise cohen's kappa scores in a new dataframe where both the columns and rows would be all the subjects ("A", "B", "C", "a", "b", "c"), so that I can see whether these scores are more consist between dataframes or within dataframes. I will eventually convert this dataframe into a heatmap organized by "group".

This post for a similar R problem looks promising but I don't know how to implement this in python. Similarly, I have not yet figured out how to implement this python solution, which appears similar enough.


Solution

  • Use concat and pdist:

    import numpy as np
    from scipy.spatial.distance import pdist, squareform
    from sklearn.metrics import cohen_kappa_score
    
    s = (pd.concat([df1, df2])
           .set_index(['subject', 'group'])['lists']
         )
    
    out = pd.DataFrame(squareform(pdist(np.vstack(s.to_list()),
                                        cohen_kappa_score)),
                       index=s.index, columns=s.index)
    
    print(out)
    

    Output:

    subject          A    B    C    D    a    b    c    d
    group          red  red blue blue  red  red blue blue
    subject group                                        
    A       red    0.0  0.0  0.0  0.4  0.4 -0.5 -0.5 -0.5
    B       red    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
    C       blue   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
    D       blue   0.4  0.0  0.0  0.0  1.0  0.4 -0.8  0.4
    a       red    0.4  0.0  0.0  1.0  0.0  0.4 -0.8  0.4
    b       red   -0.5  0.0  0.0  0.4  0.4  0.0 -0.5  1.0
    c       blue  -0.5  0.0  0.0 -0.8 -0.8 -0.5  0.0 -0.5
    d       blue  -0.5  0.0  0.0  0.4  0.4  1.0 -0.5  0.0