Let's say I am interested in predicting the probability of surviving after the Titanic sunk across different social classes. Since this isn't an RCT, the groups are unlikely to be balanced, so I also want to account for any differences in surviving that arise from gender and age. I ran a logit model, but I find communicating odds ratios difficult whilst, risk differences and risk ratios are a little easier for laypeople to understand. My questions are:
library(tidyverse)
## Load Titanic library to get the dataset
library(titanic)
library(marginaleffects)
## Load the datasets
data("titanic_train")
data("titanic_test")
## Setting Survived column for test data to NA
titanic_test$Survived <- NA
## Combining Training and Testing dataset
complete_data <- rbind(titanic_train, titanic_test) %>%
filter(!Survived == "Unknown") %>%
mutate(Pclass = factor(Pclass),
Sex = factor(Sex),
Survived = factor(Survived))
lrm_unadj <- glm(Survived ~ Pclass,
data = complete_data, family = binomial(link = "logit"))
lrm_adj <- glm(Survived ~ Pclass + Sex + Age,
data = complete_data, family = binomial(link = "logit"))
# Risk difference (difference in probabilities) - should this be similar to LPM?
avg_comparisons(lrm_unadj, variables = "Pclass")
avg_comparisons(lrm_adj, variables = "Pclass")
lm(as.numeric(Survived) ~ Pclass, data = complete_data)%>%
lmtest::coeftest(., vcov = sandwich::vcovHC, type = "HC3") %>%
broom::tidy(conf.int = T)
lm(as.numeric(Survived) ~ Pclass + Sex + Age, data = complete_data)%>%
lmtest::coeftest(., vcov = sandwich::vcovHC, type = "HC3") %>%
broom::tidy(conf.int = T)
I strongly recommend you read this page, which explains all of this in a lot of details, using the same data: https://marginaleffects.com/vignettes/comparisons.html
Is the risk difference the same as the difference in probability, and in turn, approximately the same as the coefficients (i.e. change in probability) from an Linear Probability Model?
Yes.
how does marginaleffects calculate...
There are two basic approaches. First, you can do:
PClass
to "1st" for all observation and compute the predicted probability for each row.PClass
to "2nd" for all observation and compute the predicted probability for each row.This is what code like this would do:
avg_comparisons(fit)
The second strategy would be:
SexCode
=1, Age
=25.PClass
="1st"PClass
="2nd"This is what code like this would do:
comparisons(fit, newdata = datagrid(SexCode=1, Age=25))
Again, please refer to the documentation. This is all explained in a lot of detail, with many examples of manual computation: