rregression

How do you retrieve the estimation sample in R?


Update: The fixest package now includes a function obs() that retrieves the indices of the observations used in a regression.

Original post: I would like to obtain the estimation sample from a model object, i.e. the observations that were not dropped due to missing values. This seems to be simple for standard lm regressions (using case.names()) but less so for more recent packages such as fixest.

Is there any general way to access the estimation sample, irrespective of the package used for estimation?

My attempts for both lm and fixest objects are:

library(tidyverse)
library(insight)
library(fixest)

# create data with NA -----------------------------------------------------

dat <- mtcars %>% 
  as_tibble(rownames = "model") %>% 
  mutate(cyl = na_if(cyl, 4))


# lm ----------------------------------------------------------------------

mod_lm <- lm(mpg ~ cyl * disp, data = dat)
obs <- as.integer(case.names(mod_lm))

dat %>% 
  filter(row_number() %in% obs)
#> # A tibble: 21 x 12
#>    model         mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Mazda RX4    21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2 Mazda RX4 …  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3 Hornet 4 D…  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  4 Hornet Spo…  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  5 Valiant      18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  6 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  7 Merc 280     19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#>  8 Merc 280C    17.8     6  168.   123  3.92  3.44  18.9     1     0     4     4
#>  9 Merc 450SE   16.4     8  276.   180  3.07  4.07  17.4     0     0     3     3
#> 10 Merc 450SL   17.3     8  276.   180  3.07  3.73  17.6     0     0     3     3
#> # … with 11 more rows


# fixest ------------------------------------------------------------------

mod_fe <- fixest::feols(mpg ~ cyl * disp, data = dat)
#> NOTE: 11 observations removed because of NA values (RHS: 11).

# does not work
case.names(mod_fe)
#> NULL

# remove missing values manually for all variables used in the regression
vars <- find_predictors(mod_fe, flatten = TRUE)

dat %>% 
  filter(if_all(
    all_of(vars),
    ~ !is.na(.x)
  ))
#> # A tibble: 21 x 12
#>    model         mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Mazda RX4    21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2 Mazda RX4 …  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3 Hornet 4 D…  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  4 Hornet Spo…  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  5 Valiant      18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  6 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  7 Merc 280     19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#>  8 Merc 280C    17.8     6  168.   123  3.92  3.44  18.9     1     0     4     4
#>  9 Merc 450SE   16.4     8  276.   180  3.07  4.07  17.4     0     0     3     3
#> 10 Merc 450SL   17.3     8  276.   180  3.07  3.73  17.6     0     0     3     3
#> # … with 11 more rows

Created on 2021-06-09 by the reprex package (v2.0.0)


Solution

  • Generic function case.names has no method written for objects of class "fixest". The solution is to look at str(mod_fe) and write your own method.

    case.names.fixest <- function(object, ...){
      no <- object$obsRemoved
      seq_len(object$nobs_origin)[-no]
    }
    
    case.names(mod_fe)
    # [1]  1  2  4  5  6  7 10 11 12 13 14 15 16 17 22 23 24 25 29 30 31
    

    Package fixest version ‘0.12.1’

    Following a comment by user @robertspierre, fixest objects no longer have a member named obsRemoved.
    These objects now have a named member nobs_origin which is a named list with a member obsRemoved.
    A method case.names.fixest can be adapted to extract this member.

    case.names.fixest <- function(object, full = FALSE, ...) {
      # a vector of negative integers
      no <- object$obs_selection$obsRemoved
      seq_len(object$nobs_origin)[no]
    }
    
    case.names(mod_fe)
    # [1]  1  2  4  5  6  7 10 11 12 13 14 15 16 17 22 23 24 25 29 30 31