pythonmachine-learningscikit-learnclassificationevaluation

Why do the sensitivity (recall) values differ between classification_report and precision_recall_fscore_support in a loop?


I am working with a synthetic dataset generated using make_classification from sklearn.datasets with 5 classes. I have trained a RandomForestClassifier on this data and am evaluating its performance using two different methods. However, I am observing differences in the sensitivity (recall) values between these two methods.

Here is the code I am using:

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, precision_recall_fscore_support
import numpy as np
import pandas as pd

# Generate a synthetic dataset with 5 classes
X, y = make_classification(n_samples=1000, n_classes=5, n_informative=10, n_clusters_per_class=1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a classifier
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# Predict on the test set
y_pred = clf.predict(X_test)

# Method 1: classification_report
print("Classification Report")
print(classification_report(y_test, y_pred))

# Method 2: Loop with precision_recall_fscore_support
res = []

for l in range(5):
    prec, recall, _, _ = precision_recall_fscore_support(np.array(y_test) == l,
                                                         np.array(y_pred) == l,
                                                         pos_label=True, average=None)
    res.append([l, recall[0], recall[1]])

df = pd.DataFrame(res, columns=['class', 'sensitivity', 'specificity'])
print("\nSensitivity and Specificity")
print(df)

Outputs:

Classification Report
              precision    recall  f1-score   support

           0       0.76      0.71      0.74        35
           1       0.72      0.93      0.81        30
           2       0.72      0.81      0.76        32
           3       0.85      0.86      0.86        59
           4       0.88      0.64      0.74        44

    accuracy                           0.79       200
   macro avg       0.78      0.79      0.78       200
weighted avg       0.80      0.79      0.79       200


Sensitivity and Specificity
   class  sensitivity  specificity
0      0     0.951515     0.714286
1      1     0.935294     0.933333
2      2     0.940476     0.812500
3      3     0.936170     0.864407
4      4     0.974359     0.636364

Question:

Why do the sensitivity (recall) values differ between the classification_report and the loop using precision_recall_fscore_support? Specifically, why is there a discrepancy between the recall values reported by classification_report and the sensitivity values calculated in the loop method? If possible can u show it with a simple example (solved manually)

What did you try and what were you expecting?

I used two methods to evaluate the performance of my RandomForestClassifier. First, I used classification_report to get precision, recall, and F1-score for each class. Then, I calculated sensitivity and specificity for each class using a loop with precision_recall_fscore_support.

I expected the sensitivity values calculated in the loop method to match the recall values from the classification_report, as sensitivity and recall are often considered synonymous in classification tasks. However, I observed discrepancies between the two sets of values.

What actually resulted?

The recall values from the classification_report are different from the sensitivity values calculated in the loop method. The classification_report provides recall values for each class in a multi-class context, while the loop method treats each class as a binary classification problem, leading to different sensitivity and specificity values.


Solution

  • You're incorrectly unpacking the results in the loop. With average=None, your recall is a pair:

    1. the first one is recall of the "negative class" of this loop-iteration's one-vs-rest metric; this doesn't correspond to any single class metric.

    2. The second one is recall of the positive class, which really is the recall for this class in the multiclass sense.

    But then you've labeled them wrong in your dataframe: what you called specificity is really the sensitivity/recall (and indeed these match the classification report), and what you called sensitivity isn't any single-class metric.