Update: The fixest package now includes a function obs()
that retrieves the indices of the observations used in a regression.
Original post: I would like to obtain the estimation sample from a model object, i.e. the observations that were not dropped due to missing values. This seems to be simple for standard lm
regressions (using case.names()
) but less so for more recent packages such as fixest
.
Is there any general way to access the estimation sample, irrespective of the package used for estimation?
My attempts for both lm
and fixest
objects are:
library(tidyverse)
library(insight)
library(fixest)
# create data with NA -----------------------------------------------------
dat <- mtcars %>%
as_tibble(rownames = "model") %>%
mutate(cyl = na_if(cyl, 4))
# lm ----------------------------------------------------------------------
mod_lm <- lm(mpg ~ cyl * disp, data = dat)
obs <- as.integer(case.names(mod_lm))
dat %>%
filter(row_number() %in% obs)
#> # A tibble: 21 x 12
#> model mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda RX4 … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 Hornet 4 D… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 4 Hornet Spo… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 5 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 6 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 7 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> 8 Merc 280C 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
#> 9 Merc 450SE 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
#> 10 Merc 450SL 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
#> # … with 11 more rows
# fixest ------------------------------------------------------------------
mod_fe <- fixest::feols(mpg ~ cyl * disp, data = dat)
#> NOTE: 11 observations removed because of NA values (RHS: 11).
# does not work
case.names(mod_fe)
#> NULL
# remove missing values manually for all variables used in the regression
vars <- find_predictors(mod_fe, flatten = TRUE)
dat %>%
filter(if_all(
all_of(vars),
~ !is.na(.x)
))
#> # A tibble: 21 x 12
#> model mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda RX4 … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 Hornet 4 D… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 4 Hornet Spo… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 5 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 6 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 7 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> 8 Merc 280C 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
#> 9 Merc 450SE 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
#> 10 Merc 450SL 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
#> # … with 11 more rows
Created on 2021-06-09 by the reprex package (v2.0.0)
Generic function case.names
has no method written for objects of class "fixest"
. The solution is to look at str(mod_fe)
and write your own method.
case.names.fixest <- function(object, ...){
no <- object$obsRemoved
seq_len(object$nobs_origin)[-no]
}
case.names(mod_fe)
# [1] 1 2 4 5 6 7 10 11 12 13 14 15 16 17 22 23 24 25 29 30 31
fixest
version ‘0.12.1’Following a comment by user @robertspierre, fixest
objects no longer have a member named obsRemoved
.
These objects now have a named member nobs_origin
which is a named list with a member obsRemoved
.
A method case.names.fixest
can be adapted to extract this member.
case.names.fixest <- function(object, full = FALSE, ...) {
# a vector of negative integers
no <- object$obs_selection$obsRemoved
seq_len(object$nobs_origin)[no]
}
case.names(mod_fe)
# [1] 1 2 4 5 6 7 10 11 12 13 14 15 16 17 22 23 24 25 29 30 31