rfor-loopstatisticsapplyproc-r-package

Calculating AUC for multiple simple logistic regression models using a for loop


Let me just start off by saying this is my first time posting a question on stack overflow so I hope I explain this well.

I am trying to calculate the c-stat (area under the curve) for multiple logistic regression simple models.

I have the code for how to do it for one simple model. I have one binary response variable (which is a factor with levels 0 and 1) and 100 predictor variables which are all numeric. Here I use just one numeric predictor variable. This code works.

simple_model <- glm(target_variable ~ pred1, family = binomial, data = training_data)
pROC::auc(roc(training_data$target_variable, predict(simple_model, type = "response")))

Now what I am trying to do is create a separate data frame which has the name of the predictor variable in one column and its c stat in the second column.

This is what I have tried so far without any success:

        auc <- sapply(training_data, 2, function (x) {
               temp_data <- cbind(training_data$target_variable, x)
               multiple_simple_models <- glm(target_variable ~ ., family = binomial, data = temp_data)
               proc::auc(roc(temp_data$target_variable, predict(multiple_simple_models, type = "response")))
})

But I get an error that says:

Error in match.fun(FUN): '2' is not a function, character or symbol

Solution

  • Your solution is not far off!

    The only outstanding issue, as raised in your comments above, is that you are not able to process the code due to a 'matrix / data frame' expectancy issue - this is because glm() expects a data frame, or at the very least something which can be coerced to a data frame and with the names of the columns retained. As a result, you can't use cbind() since it will create an unnamed matrix.

    So - assuming you have access to a target_variable vector and a data frame with predictors in it - my slight amendment to your code would look something like this:

    results <- sapply(predictors, function (p) {
         temp_data <- data.frame(p, target_variable)
         temp_model <- glm(target_variable ~ ., family = binomial, data = temp_data)
         pROC::auc(roc(target_variable, predict(temp_model, type = "response")))})
    
    results_data <- data.frame(predictor = names(results), auc = results)
    

    Note that you need the extra line for results_data since sapply() on its own returns a named vector (it automatically simplifies its outputs whenever possible)