pythonscikit-learncohen-kappa

Is that Cohen Kappa score correct?


Is correct cohen_kappa_score outputs 0.0 when only 2% of the labels are not in agreement?

from sklearn.metrics import cohen_kappa_score
y1 = 100 * [1]
y2 = 100 * [1]

y2[0]=0
y2[1]=0

cohen_kappa_score(y1,y2)
#0.0

or did I miss something?


Solution

  • The calculation is correct. This is an unfortunate downside to using this metric of agreement. If one class is predicted 100% of the time by at least one classifier, then the result will always be zero. If you have a few minutes, I encourage you to try calculating it yourself based on the example on Wikipedia.

    As this paper's abstract puts it,

    A limitation of kappa is that it is affected by the prevalence of the finding under observation.

    The full text describes the problem more fully with an example and concludes that

    ...kappa may not be reliable for rare observations. Kappa is affected by prevalence of the finding under consideration much like predictive values are affected by the prevalence of the disease under consideration. For rare findings, very low values of kappa may not necessarily reflect low rates of overall agreement.

    Another useful reference is Interrater reliability: the kappa statistic which advocates using both percent agreement and Cohen's kappa for a fuller picture of agreement.