I want to compute predictions for a fixed effects ols regression by group with the marginaleffects
package. When I generate new data for the prediction and do not specify the specific level of the fixed effects variables that I want the prediction to use, the prediction function calculates the values at am==0
, gear==0
and carb==4
(see below).
Where do these values come from? What's the most sensible way to calculate average predictions in the presence of fixed effects in this example?
library(tidyverse)
library(fixest)
library(marginaleffects)
dat <- mtcars
m1 <- dat %>%
mutate(cyl = as_factor(cyl)) %>%
feols(mpg ~ hp*cyl | am + gear + carb)
etable(m1)
#> m1
#> Dependent Var.: mpg
#>
#> hp -0.1316 (0.0504)
#> cyl6 -47.87 (27.42)
#> cyl8 -21.31 (14.80)
#> hp x cyl6 0.4434 (0.2619)
#> hp x cyl8 0.1471 (0.0665)
#> Fixed-Effects: ----------------
#> am Yes
#> gear Yes
#> carb Yes
#> _______________ ________________
#> S.E.: Clustered by: am
#> Observations 32
#> R2 0.88892
#> Within R2 0.38565
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pred1 <- predictions(m1,
newdata = datagrid(
hp = c(min(dat$hp):max(dat$hp)),
cyl = levels(dat$cyl)))
pred1
#>
#> Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 % cyl am gear carb hp
#> 9.76 8.72 1.119 0.263 1.9 -7.33 26.9 8 0 3 4 52
#> 9.78 8.60 1.136 0.256 2.0 -7.09 26.6 8 0 3 4 53
#> 9.79 8.49 1.154 0.249 2.0 -6.84 26.4 8 0 3 4 54
#> 9.81 8.37 1.172 0.241 2.1 -6.60 26.2 8 0 3 4 55
#> 9.82 8.25 1.190 0.234 2.1 -6.35 26.0 8 0 3 4 56
#> --- 274 rows omitted. See ?avg_predictions and ?print.marginaleffects ---
#> 14.09 23.89 0.590 0.555 0.8 -32.73 60.9 8 0 3 4 331
#> 14.11 24.01 0.588 0.557 0.8 -32.94 61.2 8 0 3 4 332
#> 14.12 24.12 0.586 0.558 0.8 -33.15 61.4 8 0 3 4 333
#> 14.14 24.24 0.583 0.560 0.8 -33.37 61.6 8 0 3 4 334
#> 14.16 24.36 0.581 0.561 0.8 -33.58 61.9 8 0 3 4 335
#> Columns: rowid, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, mpg, cyl, am, gear, carb, hp
Created on 2023-08-22 with reprex v2.0.2
You ask:
When I generate new data for the prediction and do not specify the specific level of the fixed effects variables that I want the prediction to use, the prediction function calculates the values at am==0, gear==0 and carb==4
(I think you meant gear=6. Must be a typo.)
When a variable is not explicitly specified by name in datagrid
, the value is picked by a function which depends on the variable class. Please read ?datagrid
and check out the FUN_character
, FUN_numeric
, and other arguments to that function.
In fixest
models, the variables after the |
are treated as factor variables, so the default is to take the mode. When there is more than one mode, the choice is arbitrary (I think it’s the first one to appear in the dataset, but not 100% sure). The user can of course customize this by specifying the relevant FUN_*
argument.
You ask:
What’s the most sensible way to calculate average predictions in the presence of fixed effects in this example?
In my opinion, it is typically most sensible to not specify the newdata
argument at all when computing average predictions by group. When you leave, newdata=NULL
, the function will compute predictions for each row of the original data, and then average them by group. This default behavior gives you predicted values averaged across the empirical distribution of the covariates.
That said, this is a statistical question whose answer may vary depending on your specific research context and application. If I were you, I would consult a local statistician or open a new question on Cross Validated (not Stack Overflow) with a lot more detail about the specific research question that you hope to answer.
library(fixest)
library(marginaleffects)
dat <- mtcars
mod <- feols(mpg ~ hp*cyl | am + gear + carb, data = dat)
avg_predictions(mod, by = c("am", "cyl"))
#
# am cyl Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
# 0 4 23.8 12.1 1.973 0.0484 4.4 0.163 47.4
# 0 6 18.8 14.2 1.321 0.1863 2.4 -9.088 46.7
# 0 8 14.9 17.2 0.867 0.3858 1.4 -18.807 48.7
# 1 4 27.8 11.8 2.360 0.0183 5.8 4.720 51.0
# 1 6 20.4 15.4 1.327 0.1845 2.4 -9.746 50.6
# 1 8 16.5 22.1 0.747 0.4550 1.1 -26.859 59.9
#
# Columns: am, cyl, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high