When you estimate a model, the estimation function will drop observations (i.e., rows) for which at least one variable (i.e., column) used either in the LHS or in the RHS of the formula is missing.
For example:
dt <- mtcars
dt[1:5, "wt"] <- NA
mod <- lm(mpg ~ wt + cyl + disp, data = dt)
summary(mod)
print(paste0(nrow(dt), " observations in the dataframe"))
print(paste0(nobs(mod), " have been used in the estimator"))
print(paste0(nrow(dt) - nobs(mod), " have been dropped"))
In this case, lm will drop the first 5 rows because in those rows the wt variable, used in the RHS, is missing.
The above code in fact returns:
"32 observations in the dataframe"
"27 have been used in the estimator"
"5 have been dropped"
Now, given a model, I would like to write a function that returns the dataframe of observations dropped, i.e., not used in the estimation.
This function should ideally work with as many estimation functions as possible.
Here are some possible estimation functions:
library(magrittr)
mods <- list()
mods %<>% rlist::list.append(lm(mpg ~ wt + cyl + disp, data = dt))
mods %<>% rlist::list.append(glm(am ~ wt, data = dt, family = binomial()))
mods %<>% rlist::list.append(glmmTMB::glmmTMB(mpg ~ wt + cyl + disp, data = dt))
mods %<>% rlist::list.append(glmmTMB::glmmTMB(am ~ wt + vs, data = dt, family = binomial()))
mods %<>% rlist::list.append(glmmTMB::glmmTMB(carb ~ wt + gear, data = dt, family = glmmTMB::nbinom2()))
mods %<>% rlist::list.append(pscl::zeroinfl(vs ~ wt + cyl + disp, data = dt))
mods %<>% rlist::list.append(pscl::hurdle(vs ~ wt + cyl + disp, data = dt))
mods %<>% rlist::list.append(fixest::feols(vs ~ wt | gear, data = dt))
mods %<>% rlist::list.append(fixest::feglm(am ~ wt + cyl | vs, data = dt, family = binomial()))
mods %<>% rlist::list.append(lme4::lmer(vs ~ wt + (1 | gear), data = dt))
Supposing we have such a function called get_dropped_obs, here is a test that such a function should pass (given the previous mods object):
for (i in seq_along(mods)) {
tryCatch(
{
df_dropped <- get_dropped_obs(mods[[i]], dt)
cat(paste0(
"Model ",
i,
": ",
nrow(df_dropped),
" observations have been dropped\n"
))
},
error = function(e) {
cat(paste0("Model ", i, ": ", e))
}
)
}
@iroha provided an implementation based on complete.cases.
It uses base R's terms to extract the terms object from the model, and base R's all.vars to get the variables used by the model:
> all.vars(terms(mods[[1]]))
[1] "mpg" "wt" "cyl" "disp"
It then subset the dataframe to get only the variables used by the model, use complete.cases to subset the observations without any missing data, negate this mask to get the observations with at least one missing data, and then further subset the dataframe to get those observations:
get_dropped_obs <- function(mod, df) df[!complete.cases(df[all.vars(terms(mod))]), ];
It works on all models in OP's question:
Model 1: 5 observations have been dropped
Model 2: 5 observations have been dropped
Model 3: 5 observations have been dropped
Model 4: 5 observations have been dropped
Model 5: 5 observations have been dropped
Model 6: 5 observations have been dropped
Model 7: 5 observations have been dropped
Model 8: 5 observations have been dropped
Model 9: 5 observations have been dropped
Model 10: 5 observations have been dropped