I am working with a synthetic dataset generated using make_classification
from sklearn.datasets
with 5 classes. I have trained a RandomForestClassifier
on this data and am evaluating its performance using two different methods. However, I am observing differences in the sensitivity (recall) values between these two methods.
Here is the code I am using:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, precision_recall_fscore_support
import numpy as np
import pandas as pd
# Generate a synthetic dataset with 5 classes
X, y = make_classification(n_samples=1000, n_classes=5, n_informative=10, n_clusters_per_class=1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a classifier
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)
# Predict on the test set
y_pred = clf.predict(X_test)
# Method 1: classification_report
print("Classification Report")
print(classification_report(y_test, y_pred))
# Method 2: Loop with precision_recall_fscore_support
res = []
for l in range(5):
prec, recall, _, _ = precision_recall_fscore_support(np.array(y_test) == l,
np.array(y_pred) == l,
pos_label=True, average=None)
res.append([l, recall[0], recall[1]])
df = pd.DataFrame(res, columns=['class', 'sensitivity', 'specificity'])
print("\nSensitivity and Specificity")
print(df)
Classification Report
precision recall f1-score support
0 0.76 0.71 0.74 35
1 0.72 0.93 0.81 30
2 0.72 0.81 0.76 32
3 0.85 0.86 0.86 59
4 0.88 0.64 0.74 44
accuracy 0.79 200
macro avg 0.78 0.79 0.78 200
weighted avg 0.80 0.79 0.79 200
Sensitivity and Specificity
class sensitivity specificity
0 0 0.951515 0.714286
1 1 0.935294 0.933333
2 2 0.940476 0.812500
3 3 0.936170 0.864407
4 4 0.974359 0.636364
Why do the sensitivity (recall) values differ between the classification_report
and the loop using precision_recall_fscore_support
? Specifically, why is there a discrepancy between the recall values reported by classification_report
and the sensitivity values calculated in the loop method? If possible can u show it with a simple example (solved manually)
I used two methods to evaluate the performance of my RandomForestClassifier
. First, I used classification_report
to get precision, recall, and F1-score for each class. Then, I calculated sensitivity and specificity for each class using a loop with precision_recall_fscore_support
.
I expected the sensitivity values calculated in the loop method to match the recall values from the classification_report
, as sensitivity and recall are often considered synonymous in classification tasks. However, I observed discrepancies between the two sets of values.
The recall values from the classification_report
are different from the sensitivity values calculated in the loop method. The classification_report
provides recall values for each class in a multi-class context, while the loop method treats each class as a binary classification problem, leading to different sensitivity and specificity values.
You're incorrectly unpacking the results in the loop. With average=None
, your recall
is a pair:
the first one is recall of the "negative class" of this loop-iteration's one-vs-rest metric; this doesn't correspond to any single class metric.
The second one is recall of the positive class, which really is the recall for this class in the multiclass sense.
But then you've labeled them wrong in your dataframe: what you called specificity is really the sensitivity/recall (and indeed these match the classification report), and what you called sensitivity isn't any single-class metric.