rglmnetmse

Does cv.glmnet in R return double MSE for binary data?


I noticed that the minimized values for the MSE (mean of squared errors) and MAE (mean of absolute values of the errors) criteria returned by "$cvm" or "plot()" in cv.glmnet are twice the actual values when the outcome is binary. This is very confusing. There must be some reason for this. Is this explained somewhere or do I understand something wrong?

I published a figure with MAE-values larger than 0.5 on the y-axis (binary outcome) produced by a code similar to what I posted below. This actually implies that my predictions would be worse than random guessing. So I got a complaint from a referee which I could not answer. And when I calculate MAE from the predicted values, it seems to be correct and it is exactly 50% of the value what cv.glmnet gives. I find this discrepancy extremely irritating.

# sample R code to illustrate huge discrepancy between MSE and MAE
# values from cv.glmnet when compared to prediction errors
# calculated applying a test data set 

# generate binomial data
n <- 10000
set.seed(1234)
x1 <- runif(n, -2, 2); x2 <- runif(n, -2, 2)
p <- exp(x1 + x2)/(1 + exp(x1 + x2))
y <- rbinom(n, 1, p)
  # Note: if the line above is replaced by  y <- 5*p + rnorm(n)
  #  and later family="gaussian" in cv.glmnet() 
  #  then there are no similar discrepancies as with binary data

# predictors
x <- matrix(0, n, 10)
for (k in 1:ncol(x)) {a <- runif(1) 
 x[, k] <- 0.5*(a*x1 + (1-a)*x2 + runif(n, -2, 2))}

# training data
xa <- x[seq(n/2), ]; ya <- y[seq(n/2)]
# test data
xb <- x[-1*seq(n/2), ]; yb <- y[-1*seq(n/2)]

install.packages("glmnet")
library(glmnet)
# set alpha=1 i.e. apply Lasso regression with optimal regularization
# parameter lambda chosen by MSE or MAE criterion
cvfit_MSE <- cv.glmnet(xa, ya, family="binomial", alpha=1, 
                       type.measure="mse", nfolds=10)
cvfit_MAE <- cv.glmnet(xa, ya, family="binomial", alpha=1, 
                       type.measure="mae", nfolds=10)

min(cvfit_MSE$cvm) # MSE=0.3734 (mean of squared errors)
min(cvfit_MAE$cvm) # MAE=0.7477 (mean of absolute values of the errors)
plot(cvfit_MSE) 
           # also this has min 0.3734 on y-axis (labelled as "MSE")
plot(cvfit_MAE) # here y-axis (labelled as MAE) starts at 0.75 
                # but even random guessing would produce MAE about 0.5, 
                # extremely confusing... should y-axis label be 2*MAE??

# obtain predictions for the test data:
pp <- predict(cvfit_MSE, newx=xb, s="lambda.min", type="response")

# compare these to the actual observed values (variable "yb") in test data.
# surprisingly, MSE and MAE from these predictions are half of those from cv.glmnet:
mean((pp-yb)^2) # MSE=0.17650 (mean of squared errors)
mean(abs(pp-yb)) # MAE=0.36380 (mean of absolute values of the errors)

Solution

  • I don't know where this is justified/described, but I can point you to the actual code, which is in glmnet:::cv.lognet (lognet is the subclass of glmnet model that applies to fitting binomial data).

    The formula used is

    (y[, 1] - (1 - predmat))^2 + (y[, 2] - predmat)^2
    

    which effectively treats a binary outcome as a (0,1) or (1,0) pair and computes the error for both the "failures" (0 outcomes) and "successes" (1 outcomes).

    For example:

    predmat <- 0.2
    y <- matrix(c(0, 1), nrow = 1)
    mse = (y[, 1] - (1 - predmat))^2 + 
        (y[, 2] - predmat)^2
    

    which gives 2*0.8^2 = 1.28 rather than 0.8^2 = 0.64 exactly as you point out.

    Clearly for binary data the two parts of this expression will be equivalent. It's probably written this way to generalize more easily to binomial data with N>1 (in R these data are typically represented as a two-column matrix of failures and successes).

    I would guess that the authors only cared that the MSE be proportionally correct, so that it would be useful for picking the right model ... I guess they could have written it as

    (y[,2]/rowSums(y) - predmat)^2
    

    instead.