I have a dataset with 125 animals across 3 sites and 100500 observations. Both show up properly when looking at the structure of the data but when I run the model with an updated data frame (I added a variable), they are now not coming out right. Any idea how to fix this?
Formula:
type.x ~ std_rdden + std_forest2 + std_dev2 + (1 | site) +
(1 | Animal.ID) + Age.x + time.cat + Sex.x + road_category + season
Data: df7
AIC BIC logLik deviance df.resid
75769.3 75903.4 -37869.7 75739.3 56275
Random effects:
Conditional model:
Groups Name Variance Std.Dev.
site (Intercept) 2.936e-08 0.0001713
Animal.ID (Intercept) 2.000e-01 0.4472232
Number of obs: 56290, groups: site, 3; Animal.ID, 70
str()
output showing that they show up as factor and with the right numbers in the data frame:
'data.frame': 100501 obs. of 55 variables:
Animal.ID : Factor w/ 125 levels "D1060
It's hard to know for sure, but the most likely issue is that glmmTMB
(like most R modeling functions) does complete case analysis: that is, it discards any observations with missing values for the response or any of the predictor variables in the model.
One way to test this is to use model.frame
on the formula and see how many rows are left, e.g.
library(reformulas) ## for subbars
form <- type.x ~ std_rdden + std_forest2 + std_dev2 + (1 | site) +
(1 | Animal.ID) + Age.x + time.cat + Sex.x + road_category + season
form <- subbars(form) ## substitute '+' for '|' in formula so model.frame can handle it
mf <- model.frame(form, data = df7)
nrow(mf) ## total observations remaining
length(unique(mf$Animal.ID)) ## number of animals remaining
If this issue is indeed due to missing (i.e. NA
) values, you'll have to make some decisions about what to do: you can (1) live with the reduced data set size, (2) exclude some predictors if they have too many NA
values, or (3) if you are willing to work harder, you can use the mice
package to do multiple imputation to fill in the missing values based on relationships with the other (non-missing) data. Chapter 3 of Harrell's Regression Modeling Strategies discusses handling of missing data in detail.
PS it's not generally advisable to model grouping variables with fewer than 5 levels (e.g. Site
) as random effects, even if they conceptually/philosophically fall into that category ...