rlogistic-regressionmixed-modelsglmmtmb

glmmTMB issue with number of observations and groups


I have a dataset with 125 animals across 3 sites and 100500 observations. Both show up properly when looking at the structure of the data but when I run the model with an updated data frame (I added a variable), they are now not coming out right. Any idea how to fix this?

Formula:
type.x ~ std_rdden + std_forest2 + std_dev2 + (1 | site) + 
      (1 | Animal.ID) + Age.x + time.cat + Sex.x + road_category + season
Data: df7

AIC BIC logLik deviance df.resid
75769.3 75903.4 -37869.7 75739.3 56275

Random effects:

Conditional model:
Groups    Name        Variance  Std.Dev.
site      (Intercept) 2.936e-08 0.0001713
Animal.ID (Intercept) 2.000e-01 0.4472232
Number of obs: 56290, groups: site, 3; Animal.ID, 70

str() output showing that they show up as factor and with the right numbers in the data frame:

'data.frame':   100501 obs. of  55 variables:
  Animal.ID         : Factor w/ 125 levels "D1060

Solution

  • It's hard to know for sure, but the most likely issue is that glmmTMB (like most R modeling functions) does complete case analysis: that is, it discards any observations with missing values for the response or any of the predictor variables in the model.

    One way to test this is to use model.frame on the formula and see how many rows are left, e.g.

    library(reformulas) ## for subbars
    form <- type.x ~ std_rdden + std_forest2 + std_dev2 + (1 | site) + 
          (1 | Animal.ID) + Age.x + time.cat + Sex.x + road_category + season
    form <- subbars(form) ## substitute '+' for '|' in formula so model.frame can handle it
    mf <- model.frame(form, data = df7)
    nrow(mf)   ## total observations remaining
    length(unique(mf$Animal.ID)) ## number of animals remaining
    

    If this issue is indeed due to missing (i.e. NA) values, you'll have to make some decisions about what to do: you can (1) live with the reduced data set size, (2) exclude some predictors if they have too many NA values, or (3) if you are willing to work harder, you can use the mice package to do multiple imputation to fill in the missing values based on relationships with the other (non-missing) data. Chapter 3 of Harrell's Regression Modeling Strategies discusses handling of missing data in detail.

    PS it's not generally advisable to model grouping variables with fewer than 5 levels (e.g. Site) as random effects, even if they conceptually/philosophically fall into that category ...