scikit-learnclassificationmetricsaverage-precision

Scikit Learn: Skewed Average Precision Report


I'm using scikit-learn to perform binary classification, however the labels are not evenly distributed throughout the dataset. For cases where I'm interested in predicting the minority class, I have some concerns about the average precision metric provided by metrics.average_precision_score. When I run the experiments, and print a classification report I see good performance on precision overall, but this is clearly from the model doing well on predicting the majority class, something like this:

                     precision    recall    f1-score    support
label of interest    0.24         0.67      0.35        30
non-label            0.97         0.81      0.88        300

The average precision is then reported as somewhere around 0.9752. This average precision score is clearly being reported with respect to the majority class, which isn't really the class I'm interested in identifying. Is there some way to modify the metrics.average_precision_score function to report the metric with respect to the minority class of interest? Any insight would be greatly appreciated - thanks for reading.


Solution

  • Figured out a solution after much tinkering. I'd been using the preprocessing tool LabelEncoder() to automatically encode the labels for the training and test sets. I'm performing binary classification, so the labels just need an encoding of 0 or 1. However, when doing this, the function automatically encodes the majority class as 1 and the minority class as 0. For cases where I'm interested in predicting the minority class (which is often) this skews the report of the average precision function in favor of whatever the majority class is, whether I'm interested in predicting this or not.

    This led me to ask another question here about "flipping" the 0 and 1 values in the array that my labels are assigned to and, lo and behold, it's working. So, the bottom line is to just be more intentional to make sure that the class I'm interested in predicting is always encoded as 1, and making sure the other class is encoded as 0.