Let me just start off by saying this is my first time posting a question on stack overflow so I hope I explain this well.
I am trying to calculate the c-stat (area under the curve) for multiple logistic regression simple models.
I have the code for how to do it for one simple model. I have one binary response variable (which is a factor with levels 0 and 1) and 100 predictor variables which are all numeric. Here I use just one numeric predictor variable. This code works.
simple_model <- glm(target_variable ~ pred1, family = binomial, data = training_data)
pROC::auc(roc(training_data$target_variable, predict(simple_model, type = "response")))
Now what I am trying to do is create a separate data frame which has the name of the predictor variable in one column and its c stat in the second column.
This is what I have tried so far without any success:
auc <- sapply(training_data, 2, function (x) {
temp_data <- cbind(training_data$target_variable, x)
multiple_simple_models <- glm(target_variable ~ ., family = binomial, data = temp_data)
proc::auc(roc(temp_data$target_variable, predict(multiple_simple_models, type = "response")))
})
But I get an error that says:
Error in match.fun(FUN): '2' is not a function, character or symbol
Your solution is not far off!
The only outstanding issue, as raised in your comments above, is that you are not able to process the code due to a 'matrix / data frame' expectancy issue - this is because glm()
expects a data frame, or at the very least something which can be coerced to a data frame and with the names of the columns retained. As a result, you can't use cbind()
since it will create an unnamed matrix.
So - assuming you have access to a target_variable
vector and a data frame with predictors
in it - my slight amendment to your code would look something like this:
results <- sapply(predictors, function (p) {
temp_data <- data.frame(p, target_variable)
temp_model <- glm(target_variable ~ ., family = binomial, data = temp_data)
pROC::auc(roc(target_variable, predict(temp_model, type = "response")))})
results_data <- data.frame(predictor = names(results), auc = results)
Note that you need the extra line for results_data
since sapply()
on its own returns a named vector (it automatically simplifies its outputs whenever possible)