oftentimes I need to fit several linear models (dozens) in R. These models may have the same predictor but varying response variables, data distribution families, and more.
What I've been doing: fitting models one by one, looking at diagnostic plots one by one, and altering the model as needed (see below).
My goal: fit models with functions that fit initial models using a specified set of predictors and response variables and families and datasets. Put those models into one place, like a dataframe, where I can access the model object itself as well as characteristics like the model's AIC or formula. Provide diagnostic plots with DHARMa. If necessary, modify the model and then update the dataframe to reflect that model's changes.
How can I achieve this ideal workflow? I've tried to make a reprex, below. Thanks for any advice!
library(tidyverse)
library(DHARMa)
library(magrittr)
library(glmmTMB)
cars <- datasets::mtcars
# 3 models, each with different response variables. Some (e.g. glm2)
# have non-Gaussian families; some have filtered datasets
lm1 <- glmmTMB(mpg ~ hp, data = cars)
glm2 <- glmmTMB(cyl ~ hp,
data = cars,
family = "poisson")
lm3 <- glmmTMB(drat ~ hp,
data = cars %>%
dplyr::filter(carb != 1))
# next, look at diagnostic plots
plot(DHARMa::simulateResiduals(lm1))
plot(DHARMa::simulateResiduals(glm2))
plot(DHARMa::simulateResiduals(lm3))
# transform one model's response variable to satisfy assumptions
glm2 <- glmmTMB(cyl^2 ~ hp,
data = cars,
family = "poisson")
# check diagnostics again
plot(DHARMa::simulateResiduals(lm2))
# organize models into one place. A dataframe? A list? Somewhere where
# I can access the model itself and the model formula so I can extract
# those things to be used as arguments in other functions
?
You can put "everything" into a dataframe, as long as the receiving column is a list-column (meaning cell contents of this columns are lists). That's a convenient way to slice & dice your data and keep them and/or model formulae, recipes, plots, you name it ... in a tabular structure. Just remember to encapsulate items in a list, where necessary.
In case of your example, you could:
the_grid <-
expand.grid(dependent = c('mpg', 'cyl', 'drat'),
type = 'glmmTMB',
predictors = c('hp'),
predicate = c('TRUE', 'TRUE', 'carb != 1'),
stringsAsFactors = FALSE
)
library(DHARMa)
library(glmmTMB)
library(dplyr)
library(ggeffects) ## to plot model effects
the_models <-
the_grid |>
rowwise() |> ## !important
mutate(
m = do.call(type,
args = list(formula = reformulate(response = dependent,
termlabels = predictors,
),
data = mtcars
)
) |> list(), ## don't forget to wrap the result in a list
## add a ggeffect plot object for fun:
p = list(ggpredict(m) |> plot()) ## again, use a list
)
## > the_models
## # A tibble: 9 x 6
## # Rowwise:
## dependent type predictors predicate m p
## <fct> <fct> <fct> <fct> <list> <list>
## 1 mpg glmmTMB hp TRUE <glmmTMB> <named list [1]>
## 2 cyl glmmTMB hp TRUE <glmmTMB> <named list [1]>
## 3 drat glmmTMB hp TRUE <glmmTMB> <named list [1]>
## 4 mpg glmmTMB hp TRUE <glmmTMB> <named list [1]>
## 5 cyl glmmTMB hp TRUE <glmmTMB> <named list [1]>
## 6 drat glmmTMB hp TRUE <glmmTMB> <named list [1]>
## 7 mpg glmmTMB hp carb != 1 <glmmTMB> <named list [1]>
## 8 cyl glmmTMB hp carb != 1 <glmmTMB> <named list [1]>
## > the_models$m[[1]] |> summary()
## Family: gaussian ( identity )
## Formula: mpg ~ hp
## Data: filter(mtcars, TRUE)
##
## AIC BIC logLik deviance df.resid
## 181.2 185.6 -87.6 175.2 29
##
##
## Dispersion estimate for gaussian family (sigma^2): 14
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 30.098862 1.582037 19.025 < 2e-16 ***
## hp -0.068228 0.009798 -6.964 3.32e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
the_models$p[[5]]