rperformancelogistic-regressionlme4mixed-models

Why is glmer function for logistic regression taking so long to run in R?


I am running a multiple logistic regression model. The dataset has ~350,000 observations, with the outcome being a binary 0/1 dichotomous variable. Most predictors are also dichotomous but there are several with multiple levels (max is 12 levels). My initial code using only fixed effects was this:

model_fixed <- glm(outcome ~ var1 + var2 + var3 + var4 + var5 + var6 + var7 + var8 + var9 
+ var10, family = binomial (link = logit), data=df)

This runs within 2 minutes. No issues.

I decided to make var1, a variable with 4 levels ("A","B","C","D") a random effect so switched over to this code:

model_mixed <- glmer(outcome ~ var2 + var3 + var4 + var5 + var6 + var7 + var8 + var9 + var10 +
    (1 | var1), data = df, family = binomial)

This model takes over an hour to run. The good thing is that the odds ratios for the mixed effects model seem similar to the ORs from the fixed effects model. But I'd like to speed up the process and wonder if I'm doing something wrong based on the run time.

I tried adding this optimizer: control = glmerControl(optimizer = "bobyqa")) but it takes even longer to run and gives me this warning:

Warning in commonArgs(par, fn, control, environment()) : maxfun < 10 * length(par)^2 is not recommended.

In addition, I can't generate Wald confidence intervals because of the run time.

This code works just fine: exp(confint.default(model_fixed))

This one takes forever (waited 2 hrs before canceling): exp(confint.default(model_mixed))

I've combed this site and saw similar examples (e.g. Speed up lmer function in R) but none specific to logistic regression.

I can't help but feel I am doing something wrong or the underlying statistics is incorrect here (although the similar ORs from the fixed and mixed models seem encouraging). Any advice?


Solution

  • The main issue here, I think, is the large number of columns of your fixed-effect (X) model matrix. glmer has to minimize a nonlinear function with (ncols(X) + number of RE parameters); in your case there's only one RE parameter (the variation in the intercept among var1 groups), but (in particular) a categorical (factor) variable with n levels adds n-1 columns to X. Furthermore, when you calculate the confidence intervals, you may be needing to calculate the Hessian, whose size goes as the square of the number of parameters ...

    I also agree with @Aaron that you don't gain much by treating a factor with four levels as a random effect ...

    library(lme4)
    set.seed(101)
    dd <- expand.grid(f1 = factor(1:10), f2 = factor(1:20), rep = 1:200)
    dd$y <- rbinom(nrow(dd), prob = 0.2, size = 1)
    m1 <- glmer(y ~ f2 + (1|f1), family = binomial, data = dd, verbose = 1,
                control = glmerControl(calc.derivs = FALSE))
    ## do the equivalent of stats::confint.default() with Hessian-less vcov
    vv <- vcov(m1, use.hessian = FALSE)
    ss <- sqrt(diag(vv))
    cimat <- fixef(m1) + outer(ss, c(-1, 1)) * qnorm(0.975)