rlogistic-regressionmlr3

How are residuals calculated in a logistic regression model?


I created a logistic regression model using the mlr3 package in R. I outputted residuals from the model, but I can't work out how they have been calculated - they do not correspond to any residual calculation that I know of.

Suppose I use the mlr3 package to create a logistic regression model:

library(mlr3)
library(tidyverse)

#create dummy data
data <- data.frame(
  predictor = c(rnorm(50, mean = 0), rnorm(50, mean = 1)),
  dependant = as.factor(c(rep(0,50), rep(1, 50)))
)

#define and train a logistic regression model
classifier_log_reg <-  mlr_learners$get("classif.log_reg")
task <- mlr3::TaskClassif$new(id = "my_data",
                                     backend = data,
                                     target = "dependant", # target variable
                                     positive = "1")
classifier_log_reg$train(task, row_ids = 1:100)

I can get the residuals from the model using: residuals <- classifier_log_reg$model$residuals

My question is: how are these residuals calculated? I cannot reproduce them manually. They don't match the numbers I get when I calculate pearson or deviance residuals using the functions below:

pearson_residuals <- function(p, actual) {
  # Standard deviation of the predicted binomial distribution
  std_dev <- sqrt(p * (1 - p))
  
  # Avoid division by zero in case of p values being 0 or 1
  std_dev[std_dev == 0] <- .Machine$double.eps
  
  # Calculate the Pearson residuals
  residuals <- (actual - p) / std_dev
  
  return(residuals)
}

deviance_residuals <- function(p, actual) {
  # Ensure p is within valid range to avoid log(0) issues
  p <- ifelse(p == 0, .Machine$double.eps, ifelse(p == 1, 1 - .Machine$double.eps, p))
  
  # Calculate the deviance residuals
  residuals <- sign(actual - p) * sqrt(-2 * (actual * log(p) + (1 - actual) * log(1 - p)))
  
  return(residuals)
}

What I have found, strangely, is that the residuals from classifier_log_reg$model$residuals do appear to correspond systematically to residuals that I can calculate manually as the simple difference between the predicted and actual values of the dependant variable. Note that I have adjusted both my manual calculations and the residuals outputted by the model object to best illustrate the apparent sigmoid relationship:


#get residuals directly from the model object
residuals <- classifier_log_reg$model$residuals

#####calculate residuals manually

#specify that predictions should be continuous
classifier_log_reg$predict_type <- 'prob'

#get the predictions
predictions <- classifier_log_reg$predict(task, row_ids = 1:100)

#isolate the vector containing the predictions
predictions <- predictions$data$prob %>% as.data.frame() %>% pull(1) 

#subtract predictions from actual values of dependant variable
actual <- data$dependant %>% as.character() %>% as.numeric()
my_resid <-  actual - predictions 


#put the residuals from the model, the manually calculated residuals
#and the actual values into a dataframe.
#I have adjusted them a bit to illustrate the (apparent) sigmoid relationship that 
#emerges after these adjustments.
df <- data.frame(
  x = residuals - (actual * 2) + 1,
  y = (my_resid + 1) / 2,
  actual = actual
)

#plot the relationship between the manually calculated residuals (with adjustment)
#and the residuals straight from the model (with adjustment).
#The curve is completely smooth, but I cannot find the function linking x to y
ggplot(df) + geom_point(aes(x = x, y = y))

Plot outputted by ggplot

As can be seen, there seems to be a sigmoid relationship here. However, I have tried using the nls function to get the parameters of the best fitting sigmoid curve linking x and y...and it doesn't fit well at all! The below is what I tried; I have not pasted the plot that is produced, but suffice to say that it does not show a straight line (which is what I would expect if the relationship between x and y really is sigmoid):

sigmoid <- function(x, L, k, x0) {
  L / (1 + exp(-k * (x - x0)))
}

model <- nls(y ~ sigmoid(x, L, k, x0), 
             data = df, 
             start = list(L = 1, k = 1,  x0 = 1),
             control = nls.control(maxiter = 100))

df$fitted <- predict(model, df)
ggplot(df) + geom_point(aes(x = fitted, y = y))

So what IS the relationship between x and y here? And more to the point, how are the residuals from the mlr3 logistic regression model being calculated under the hood?


Solution

  • Probably the same way that the glm command calculates them:

    glm1 <- glm(dependant~predictor, family="binomial", data=data)
    
    identical(residuals, glm1$residuals)
    # [1] TRUE
    

    And the type of residuals that glm calculates are the "working" residuals, as mentioned in the docs (see ?glm).

    That is, working residuals = enter image description here where zi are the working responses and enter image description here is the linear predictor.

    Some useful links:

    Understanding glm$residuals and resid(glm)

    Calculating working residuals of a Gamma GLM model