rmachine-learningstatisticsroctidymodels

How do I calculate the Area Under the ROC Curve in practice (using the yardstick package in R)?


I am really struggling to understand why I can't get this working, as it must surely be very simple. I'd really appreciate some insight.

I am using R 4.2.2 on an M1 Macbook Air.

I am a physician and have trained a Gaussian Naive Bayes model that (imperfectly) predicts the likelihood of a patient dying given their response to a certain method of physicial positioning them when their lungs are not working properly.

NB - all of this research is ethically approved, it won't be used in patient treatment, I don't have a statistician on this project.

I will focus on the training dataset in which a column of predictions has been added. This df is called train_pre_post. There are 28 columns, and the ones of interest are outcome, which contains the true outcome of the patient, and prediction containing the predicted outcome. Both of these are factors with two levels. These levels are rip meaning the patient died, and dc meaning they survived until they were discharged. The other columns contain continuous physiological variables (e.g. oxygen levels, blood pressure, etc.).

I am trying to calculate the area under the receiver operating characteristic curve using the yardstick package. The command is roc_auc(). I am struggling with part of this command:

roc_auc(data = train_pre_post, truth = outcome, ?????)

Where I have typed ????? this refers to my inability to comprehend what to put here. For simple measures of accuracy (such as f_meas(), part of the command involves designating the column containing the truth and the column containing the estimate. When I try to do this in roc_auc(), I get the following error message:

Error in `dplyr::summarise()`:
ℹ In argument: `.estimate = metric_fn(...)`.
Caused by error in `validate_class()`:
! `estimate` should be a numeric but a factor was supplied

The yardstick documentation states "Binary metrics using class probabilities take a factor truth column, and a single class probability column containing the probabilities of the event of interest." and despite playing around, I can't make this work and in all honesty I don't understand what this means.

Can anyone shed light on this or point me in the right direction?


Solution

  • Here's some intuition for the theory of what AUCROC is, and why it is useful, and how you don't really need a library to this for you, and how AUCROC can be applied to a classifier that predicts a binary outcome.

    Here's how I would calculate the AUCROC for a binary classifier. First you make a 2 dimensional scatterplot with true positives from 0%->100% on the vertical axis, and false negatives 0%->100% on the horizontal axis. You can draw a diagonal line from the lower left to upper right as a visual aid for what comes next.

    Split your labeled training data into 10 groups with 10% of the labeled training data each. Have your model predict the answers for each of the ten groups, compare that with the correct answers, that produces a number for a "true positive percentage rate" and a "false negative percentage rate". Place a dot on your scatterplot accordingly. It should appear above the diagonal line. Do this for all 10 labeled training set groups, you'll get 10 dots above the diagonal line. Connect the dots with a straight line. Calculate the distance between every point and the diagonal line. The sum of each pillar width * height, is the "The area under the curve". Models with larger AUCROC's are better overall for every group.

    The reason we do all this extra work rather than just simply calculating (Correct Answers / Total Answers) = accuracy or more correctly: ((True Positives / Num False Positives) + (True negatives / Num False negatives) / 2) is because you want the model to perform at the stated accuracy at every named segment of the dataset, not just "Has 100% accuracy for even numbered years, but only 40% accuracy for odd numbered years" due to information gain your model lifts only being present on even numbered years. The AUCROC value and the shape of the curve gives you a way to know if your model needs some specific characteristic to be the case in order to perform at the stated accuracy.

    So every word in the Area under the curve for the receiver operating characteristic makes sense. It tells you if there's a characteristic in the data that your model needs to get all its accuracy from, and it's just guessing on the others. Generally, a model that performs a 90% accurately across all examples is superior to another model that performs at 100% accuracy on even numbered years, but performs at 80% accuracy on the odd numbered years, due to the model being rewarded into existence for generalizing on some characteristic that only exists in some of the training rows. If you pull out that characteristic as a new input feature the classifier will in theory perform better.

    Roll your own solution, convince yourself it makes sense, then you can examine the source of your roc_auc() which attempts to do all this and then fix their code, put in a note that their method needs more documentation and demos for correct usage.

    How you split the data is "the Characteristic". The Receiver is the model. Operating means the model was used. The Area under the curve is total performance over every characteristic.

    Suspicious bumps in the AUCROC curve mean your model is finding information gain but only if some set of characteristics are available. This is supremely useful, because you can eliminate that named criteria from the labeled training set and force the classifier to use another source of data to isolate outcome. Stop using zipcode to correctly predict patient outcome, because that info betrays something we're not supposed to know, and use something else.