rdatabasestatisticsmedicalstatistical-test

How can I find the c-ctatistic or AUROC using logistic regression in R?


I am running a logistic regression to see how these factors/variables affect an outcome (Neurological Complication).

How can I obtain the c-statistic -- also known as the area under the Receiver Operating Characteristic (AUROC) Curve?

    NeuroLogit2 <- glm(Neurologic Complication? ~ HTN + stroke + Gender + Embol + Drain, data=Tevar.new, family=binomial)
    summary(NeuroLogit2) 

Solution

  • Well, obviously I don't have your data, so let's make some up. Here, we'll pretend we're modelling the probability of people catching a cold in any given year based on age and sex. Our outcome variable is just a 1 for "caught a cold" and 0 for "didn't catch a cold"

    set.seed(69)
    outcome <- c(rbinom(1000, 1, seq(0.4, 0.6, length.out = 1000)),
                 rbinom(1000, 1, seq(0.3, 0.5, length.out = 1000)))
    sex     <- rep(c("M", "F"), each = 1000)
    age     <- rep((601:1600)/20, 2)
    
    df      <- data.frame(outcome, age, sex)
    

    Now we'll create the model and have a look at it:

    my_mod  <- glm(outcome ~ age + sex, data = df, family = binomial())
    
    summary(my_mod)
    #> 
    #> Call:
    #> glm(formula = outcome ~ age + sex, family = binomial(), data = df)
    #> 
    #> Deviance Residuals: 
    #>     Min       1Q   Median       3Q      Max  
    #> -1.3859  -1.0993  -0.8891   1.1847   1.5319  
    #> 
    #> Coefficients:
    #>             Estimate Std. Error z value Pr(>|z|)    
    #> (Intercept) -1.20917    0.18814  -6.427 1.30e-10 ***
    #> age          0.01346    0.00317   4.246 2.18e-05 ***
    #> sexM         0.61000    0.09122   6.687 2.28e-11 ***
    #> ---
    #> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    #> 
    #> (Dispersion parameter for binomial family taken to be 1)
    #> 
    #>     Null deviance: 2760.1  on 1999  degrees of freedom
    #> Residual deviance: 2697.1  on 1997  degrees of freedom
    #> AIC: 2703.1
    #> 
    #> Number of Fisher Scoring iterations: 4
    

    Looks good. Older people and men are more likely to catch colds.

    Now suppose we wanted to use this model to get a prediction of whether someone of a given age and sex will catch a cold in the next year. If we use the predict function with type = "response", we get a probability estimate for each of the people in our data frame based on their age and sex.

    predictions <- predict(my_mod, type = "response")
    

    We can use these probabilities to construct our ROC. Here I'll use the pROC package to help:

    library(pROC)
    
    roc(outcome, predictions)
    #> Setting levels: control = 0, case = 1
    #> Setting direction: controls < cases
    #> 
    #> Call:
    #> roc.default(response = outcome, predictor = predictions)
    #> 
    #> Data: predictions in 1079 controls (outcome 0) < 921 cases (outcome 1).
    #> Area under the curve: 0.6027
    

    So the area under the ROC is 60.27%. We can plot the ROC itself to see what this looks like:

    library(ggplot2)
    
    ggroc(roc(outcome, predictions)) +
      theme_minimal() + 
      ggtitle("My ROC curve") + 
      geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), color="grey", linetype="dashed")
    #> Setting levels: control = 0, case = 1
    #> Setting direction: controls < cases
    

    Created on 2020-06-07 by the reprex package (v0.3.0)