python scikit-learn metrics multiclass-classification

Precision, Recall and F1 with Sklearn for a Multiclass problem

I have a Multiclass problem, where 0 is my negative class and 1 and 2 are positive. Check the following code:

import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

# Outputs
y_true = np.array((1, 2, 2, 0, 1, 0))
y_pred = np.array((1, 0, 0, 0, 0, 1))
# Metrics
precision_macro = precision_score(y_true, y_pred, average='macro')
precision_weighted = precision_score(y_true, y_pred, average='weighted')
recall_macro = recall_score(y_true, y_pred, average='macro')
recall_weighted = recall_score(y_true, y_pred, average='weighted')
f1_macro = f1_score(y_true, y_pred, average='macro')
f1_weighted = f1_score(y_true, y_pred, average='weighted')
# Confusion Matrix
cm = confusion_matrix(y_true, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

The metrics calculated with Sklearn in this case are the following:

precision_macro = 0.25
precision_weighted = 0.25
recall_macro = 0.33333
recall_weighted = 0.33333
f1_macro = 0.27778
f1_weighted = 0.27778

And this is the confusion matrix:

The macro and weighted are the same because i have the same number of samples for each class? This is what i did manually.

1 - Precision = TP/(TP+FP). So for classes 1 and 2, we get:

Precision1 = TP1/(TP1+FP1) = 1/(1+1) = 0.5
Precision2 = TP2/(TP2+FP2) = 0/(0+0) = 0 (this returns 0 according Sklearn documentation)
Precision_Macro = (Precision1 + Precision2)/2 = 0.25
Precision_Weighted = (2*Precision1 + 2*Precision2)/4 = 0.25

2 - Recall = TP/(TP+FN). So for classes 1 and 2, we get:

Recall1 = TP1/(TP1+FN1) = 1/(1+1) = 0.5
Recall2 = TP2/(TP2+FN2) = 0/(0+2) = 0
Recall_Macro = (Recall1+Recall2)/2 = (0.5+0)/2 = 0.25
Recall_Weighted = (2*Recall1+2*Recall2)/4 = (2*0.5+2*0)/4 = 0.25

3 - F1 = 2*(Precision*Recall)/(Precision+Recall)

F1_Macro = 2*(Precision_Macro*Recall_Macro)/(Precision_Macro*Recall_Macro) = 0.25
F1_Weighted = 2*(Precision_Weighted*Recall_Weighted)/(Precision_Weighted*Recall_Weighted) = 0.25

So, the Precision score is the same as Sklearn. But Recall and F1 are different. What did i do wrong here? Even if you use the values of Precision and Recall from Sklearn (i.e., 0.25 and 0.3333), you can't get the 0.27778 F1 score.

Solution

For the averaged scores, you need also the score for class 0. The precision of class 0 is 1/4 (so the average doesn't change). The recall of class 0 is 1/2, so the average recall is (1/2+1/2+0)/3 = 1/3.

The average F1 score is not the harmonic-mean of average precision & recall; rather, it is the average of the F1's for each class. Here, F1 for class 0 is 1/3, for class 1 is 1/2, and for class 2 undefined but taken to be 0, for an average of 5/18.