rlapplylme4lsmeans

lapply: Fitting thousands of mixed models and being able to extract lsmeans


I have a list of formulas (> 10,000) for linear mixed models (lme4) that I fitted to a data set. Successfully, I have used lapply() and a custom function that incorporated tryCatch() to fit these models. Now I would like to extract the P-values and lsmeans for all of these models. I have successfully extracted the P-values, but the lsmeans function is encountering errors.

library(lme4)
library(lmerTest)
library(pbkrtest)
library(lsmeans)

formulaS <- list() #Not going to detail generation of list, generically: 'Yvar~X1*X2+(1|subject)'
dataSET <- X #dataframe with first 3 columns containing fixed and random factors, 
             # as well as >10,000 columns of variables of interest

modelSeq <- function (x, dat) {
  return(tryCatch(lmer(x, data = dat), error=function(e) NULL))
}

modelsOutput <- lapply(formulaS, function(x) modelSeq(x, dat = dataSET))

lsmeans(modelsOutput[[1]], pairwise ~ X1:X2) #recieves error

Error in solve.default(L %% V0 %% t(L), L) : Lapack routine dgesv: system is exactly singular: U[1,1] = 0

The reason I don't think it's a model problem is that if I fit the models individually I can extract the lsmeans just fine. Is there any commentary on 1) why I cannot extract lsmeans, 2) how to efficiently extract means, or 3) an alternative, efficient method.

Thanks!

__ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __

UPDATE & EDIT: This is RNAseq data with repeated samples of subjects over time I am playing with, so the >10,000 models have the same fixed and random effects that describe the experimental design. The response (a gene) is the only variable that varies. I have tried to make that more explicit in the code below. Recognizing that a mixed model with an identity link might not be ideal for the data, I have the new wrapper below. I'm still having issues extracting means. Also, any commentary on more appropriate, time-efficient methods for computing P-values is appreciated.

library(lme4)
library(blmeco)
library(ggeffects)

formulaS <- list() #Not going to detail generation of list, generically: 'GeneI~TRT*TIME+(1|subject)'
dataSET <- X #dataframe with first 3 columns containing fixed and random factors, 
             # as well as >10,000 columns of variables of interest (gene TPM)

wrap.glmer.nb <- function (modelForm, dat) {
  m <- tryCatch(glmer.nb(formula = modelForm, data = dat), error = function(e) NULL)
  if (!is.null(m)) {
    m.disp <- tryCatch(dispersion_glmer(m), error = function(e) NULL)
    m.wald <- tryCatch(anova(m), error = function(e) NULL)
    m.means.c <- tryCatch(ggemmeans(model = m, terms = c('TRT')), error = function(e) NULL)
    m.means.e <- tryCatch(ggemmeans(model = m, terms = c('TIME')), error = function(e) NULL)
    m.means.cxe <- tryCatch(ggemmeans(model = m, terms = c('TRT', 'TIME')), error = function(e) NULL)
    x <- list(m.disp, m.wald, m.means.c, m.means.e, m.means.cxe)
    print(paste0('Done with a model at ', Sys.time()))
    return(x)
  } else{
    x <- m
    return(x)
  }
}

startTime <- Sys.time()
modelOUTPUTS <- lapply(formulaS, function(modelForm) wrap.glmer.nb(modelForm, dat = dataSET))
endTime <- Sys.time()
print(paste('Victory! The analysis took:', endTime - startTime))


Solution

  • Your original setup would work if you add one line to modelSeq():

    modelSeq <- function (x, dat) {
      environment(x) <- environment()
      return(tryCatch(lmer(x, data = dat), error=function(e) NULL))
    }
    

    This sets the environment of the formula to that of the function body, making it possible to find the dataset named dat.

    A similar example:

    fitm <- function(formula, data, ...) {
        environment(formula) <- environment()
        lm(formula, data = data, ...)
    }
    
    fl <- list(breaks ~ tension, breaks ~ wool + tension, breaks ~ wool*tension)
    
    md <- lapply(fl, fitm, data = warpbreaks[c(1,2,3,5,8,13,21,34,54), ])
    
    lapply(md, function(m) emmeans(m, "tension"))
    

    Which produces:

    NOTE: Results may be misleading due to involvement in interactions
    
    [[1]]
     tension emmean    SE df lower.CL upper.CL
     L         41.2  6.64  6    24.91     57.4
     M         17.0 16.27  6   -22.82     56.8
     H         26.0 11.51  6    -2.16     54.2
    
    Confidence level used: 0.95 
    
    [[2]]
     tension emmean    SE df lower.CL upper.CL
     L         41.6  8.91  5    18.73     64.5
     M         17.7 19.41  5   -32.21     67.6
     H         26.0 12.59  5    -6.38     58.4
    
    Results are averaged over the levels of: wool 
    Confidence level used: 0.95 
    
    [[3]]
     tension emmean   SE df lower.CL upper.CL
     L         41.1 10.9  4     10.9     71.3
     M       nonEst   NA NA       NA       NA
     H         26.0 14.1  4    -13.0     65.0
    
    Results are averaged over the levels of: wool 
    Confidence level used: 0.95 
    

    BTW, you don't need the lsmeans package; it is just a front-end for emmeans. In fact, the lsmeans function itself is in emmeans; it just runs emmeans and re-labels the results.