I am hoping to replicate what the fabulous Chuck Huber produced here (margins). In this example, he was able to calculate the predicted birthweight for mothers who were smokers and who were non-smokers. The State code went something like this:
regress bwt age i.smoke
margins smoke
Where bwt = birthweight, age=age and i.smoke= smoker vs. non-smoker as a factor variable.
I have gotten this far
library(tidyverse)
library(margins)
set.seed(42)
n <- 1000
dat <- data.frame(id=1:n,
treat = factor(sample(c('Treat','Control'), n, rep=TRUE, prob=c(.5, .5))),
age=sample(18:80, n, replace=TRUE),
sex = factor(sample(c('Male','Female'), n, rep=TRUE, prob=c(.6, .4))),
smoke=factor(sample(c("Never", 'Former', 'Current'), n, rep=TRUE, prob=c(.25, .6, .15))),
score=runif(n, min=16, max=45))
mod <- lm(score~treat+age+sex+smoke, data=dat)
summary(margins(mod, variables = "treat"))
The output provides the average marginal effect (AME) for treat
but only for a single level of this variable.
My questions are:
Many thanks,
Sandro
The margins
package is no longer actively developed, and as of 2024-04-23 was not available on CRAN. This archiving may be temporary.
You can use the marginaleffects
package instead (disclaimer: I am the author). The website includes over 30 chapters of tutorials and case studies, including an introductory Get Started page: https://marginaleffects.com
Before showing code, we need to agree on terminology. The margins smoke
command in Stata computes “marginal predictions”, that is, average predictions by subgroups. This is not the same as the Average Marginal Effect that the margins
package in R computes. The AME in this package is better called an “Average Slope”, and is defined as a partial derivative of the outcome equation with respect to the variable of interest.
You can compute all of these quantities easily with marginaleffects
. First, fit the model:
library(dplyr)
library(marginaleffects)
set.seed(42)
n <- 1000
dat <- data.frame(id=1:n,
treat = factor(sample(c('Treat','Control'), n, rep=TRUE, prob=c(.5, .5))),
age=sample(18:80, n, replace=TRUE),
sex = factor(sample(c('Male','Female'), n, rep=TRUE, prob=c(.6, .4))),
smoke=factor(sample(c("Never", 'Former', 'Current'), n, rep=TRUE, prob=c(.25, .6, .15))),
score=runif(n, min=16, max=45))
mod <- lm(score~treat+age+sex+smoke, data=dat)
Now, compute average predictions for each subgroup of the treat
variable. I show dplyr
code so you understand exacly what is going on:
avg_predictions(mod, by = "treat")
treat Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
Treat 30.9 0.38 81.4 <0.001 Inf 30.2 31.7
Control 30.9 0.36 86.0 <0.001 Inf 30.2 31.6
Columns: treat, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high
Type: response
dat |> mutate(prediction = predict(mod, newdata = dat)) |>
summarize(prediction = mean(prediction), .by = treat)
treat prediction
1 Treat 30.90587
2 Control 30.92291
You can use the hypothesis
argument to conduct a hypothesis test that compares the two quantities:
avg_predictions(mod, by = "treat", hypothesis = "sequential")
Term Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
Control - Treat 0.017 0.523 0.0326 0.974 0.0 -1.01 1.04
Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high
Type: response
dat |> mutate(prediction = predict(mod, newdata = dat)) |>
summarize(prediction = mean(prediction), .by = treat) |>
summarize(diff = diff(prediction))
diff
1 0.01703766
Alternatively, you can compare average counterfactual predictions using the avg_comparisons()
function:
avg_comparisons(mod, variables = "treat")
Term Contrast Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
treat Treat - Control -0.0108 0.523 -0.0207 0.983 0.0 -1.04 1.01
Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high
Type: response
d0 <- mutate(dat, treat = "Control")
d1 <- mutate(dat, treat = "Treat")
mean(predict(mod, newdata = d1) - predict(mod, newdata = d0))
[1] -0.01083374