My goal is to interpret the coefficients of a hurdle model through estimated marginal means. I prefer to interpret probabilities (back-transformed from the logit scale), rather than log-odds (model coefficients) or odd-ratio (exp(log-odds)). I would like to use emmeans()
for this goal, as it is compatible with many models, and I have experience using it in linear models and binomial GLMs.
The hurdle part of the hurdle model yields the same coefficients as a binomial GLM, which has been commented elsewhere (here or here).
However, I don't fully understand why, depending on the setting, the estimated means from the hurdle do not match those from the binomial GLM. In particular
lin.pred = FALSE
. This yields different probabilites from a binomial GLM. Following the emmeans
documentation, I think they are averaged at the probability scale.lin.pred = TRUE, type = "response"
. This setting yields the same probabilites as binomial GLM. Following the emmeans documentation, I think they are averaged at the logit scale.I have two questions:
Inf
degrees of freedom?See below a reproducible example, using the number of papers produced by Biochemstry students during the least 3 years of PhD, focusing on a pairwise comparison between two factor levels (single vs married).
My preferred approach would be the one that matches the estimated probabilites from the binomial GLM. I would interpret that, married PhD students are as likely than single students, to publish at least one paper (= overcome the hurdle). Married students have a 74% chance of producing at least one paper, whereas single ones have a 67% change: this 7% change is a small difference, and not significant.
Many thanks in advance!
library(emmeans)
library(pscl)
#> Warning: package 'pscl' was built under R version 4.2.3
#> Classes and Methods for R developed in the
#> Political Science Computational Laboratory
#> Department of Political Science
#> Stanford University
#> Simon Jackman
#> hurdle and zeroinfl functions by Achim Zeileis
data("bioChemists", package = "pscl")
# str(bioChemists)
hurdle <- hurdle(art ~ ., data = bioChemists, dist = "poisson", zero.dist = "binomial", link = "logit")
glm <- glm(art > 0 ~ ., data = bioChemists, family = binomial(link = "logit"))
# Same coefficients
coef(summary(hurdle))$zero
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.23679601 0.29551883 0.8012891 4.229643e-01
#> femWomen -0.25115113 0.15910522 -1.5785222 1.144457e-01
#> marMarried 0.32623358 0.18081823 1.8042074 7.119880e-02
#> kid5 -0.28524872 0.11113043 -2.5667921 1.026441e-02
#> phd 0.02221940 0.07955710 0.2792887 7.800233e-01
#> ment 0.08012135 0.01301763 6.1548321 7.515710e-10
coef(glm)
#> (Intercept) femWomen marMarried kid5 phd ment
#> 0.23679601 -0.25115113 0.32623358 -0.28524872 0.02221940 0.08012135
#######
## lin.pred = FALSE
emmeans::emmeans(hurdle, specs = pairwise ~ mar, mode = "zero", lin.pred = FALSE)
#> $emmeans
#> mar emmean SE df lower.CL upper.CL
#> Single 0.806 0.167 903 0.478 1.13
#> Married 0.858 0.122 903 0.619 1.10
#>
#> Results are averaged over the levels of: fem
#> Confidence level used: 0.95
#>
#> $contrasts
#> contrast estimate SE df t.ratio p.value
#> Single - Married -0.0522 0.213 903 -0.245 0.8066
#>
#> Results are averaged over the levels of: fem
#######
## lin.pred = TRUE, type = "response"
emmeans::emmeans(hurdle, specs = pairwise ~ mar, mode = "zero", lin.pred = TRUE, type = "response")
#> $emmeans
#> mar prob SE df lower.CL upper.CL
#> Single 0.677 0.0306 903 0.615 0.734
#> Married 0.744 0.0200 903 0.703 0.781
#>
#> Results are averaged over the levels of: fem
#> Confidence level used: 0.95
#> Intervals are back-transformed from the logit scale
#>
#> $contrasts
#> contrast odds.ratio SE df null t.ratio p.value
#> Single / Married 0.722 0.13 903 1 -1.804 0.0715
#>
#> Results are averaged over the levels of: fem
#> Tests are performed on the log odds ratio scale
emmeans::emmeans(glm, specs = pairwise ~ mar, regrid = "response")
#> $emmeans
#> mar response SE df asymp.LCL asymp.UCL
#> Single 0.677 0.0305 Inf 0.617 0.736
#> Married 0.743 0.0201 Inf 0.704 0.783
#>
#> Results are averaged over the levels of: fem
#> Confidence level used: 0.95
#>
#> $contrasts
#> contrast estimate SE df z.ratio p.value
#> Single - Married -0.0667 0.0377 Inf -1.772 0.0764
#>
#> Results are averaged over the levels of: fem
#######
# Estimates and contrasts at the logit scale appear to match
emmeans::emmeans(hurdle, specs = pairwise ~ mar, mode = "zero", lin.pred = TRUE)
#> $emmeans
#> mar emmean SE df lower.CL upper.CL
#> Single 0.741 0.140 903 0.466 1.02
#> Married 1.068 0.105 903 0.862 1.27
#>
#> Results are averaged over the levels of: fem
#> Results are given on the logit (not the response) scale.
#> Confidence level used: 0.95
#>
#> $contrasts
#> contrast estimate SE df t.ratio p.value
#> Single - Married -0.326 0.181 903 -1.804 0.0715
#>
#> Results are averaged over the levels of: fem
#> Results are given on the log odds ratio (not the response) scale.
emmeans::emmeans(glm, specs = pairwise ~ mar)
#> $emmeans
#> mar emmean SE df asymp.LCL asymp.UCL
#> Single 0.741 0.140 Inf 0.467 1.02
#> Married 1.068 0.105 Inf 0.862 1.27
#>
#> Results are averaged over the levels of: fem
#> Results are given on the logit (not the response) scale.
#> Confidence level used: 0.95
#>
#> $contrasts
#> contrast estimate SE df z.ratio p.value
#> Single - Married -0.326 0.181 Inf -1.804 0.0712
#>
#> Results are averaged over the levels of: fem
#> Results are given on the log odds ratio (not the response) scale.
sessionInfo()
#> R version 4.2.0 (2022-04-22 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 22000)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=Spanish_Spain.utf8 LC_CTYPE=Spanish_Spain.utf8
#> [3] LC_MONETARY=Spanish_Spain.utf8 LC_NUMERIC=C
#> [5] LC_TIME=Spanish_Spain.utf8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] pscl_1.5.5.1 emmeans_1.8.5-9000004
#>
#> loaded via a namespace (and not attached):
#> [1] compiler_4.2.0 highr_0.9 tools_4.2.0
#> [4] digest_0.6.29 evaluate_0.15 lifecycle_1.0.3
#> [7] lattice_0.20-45 rlang_1.1.1 reprex_2.0.2
#> [10] Matrix_1.4-1 cli_3.6.1 rstudioapi_0.13
#> [13] yaml_2.3.5 mvtnorm_1.1-3 xfun_0.40
#> [16] fastmap_1.1.0 coda_0.19-4 withr_2.5.0
#> [19] stringr_1.5.0 knitr_1.39 fs_1.5.2
#> [22] vctrs_0.6.3 grid_4.2.0 glue_1.6.2
#> [25] survival_3.3-1 rmarkdown_2.14 multcomp_1.4-19
#> [28] TH.data_1.1-1 magrittr_2.0.3 codetools_0.2-18
#> [31] htmltools_0.5.6 splines_4.2.0 MASS_7.3-57
#> [34] xtable_1.8-4 numDeriv_2016.8-1.1 sandwich_3.0-1
#> [37] estimability_1.4.1 stringi_1.7.6 zoo_1.8-10
Created on 2023-09-06 with reprex v2.0.2
Russell Lenth (developper of the emmeans package), provided an answer over at GitHub. I paste it here, with a comparison between a hurdle model fitted with emmeans and glmmTMB, which show consistent results.
library(emmeans)
#> Warning: package 'emmeans' was built under R version 4.2.3
library(glmmTMB)
#> Warning: package 'glmmTMB' was built under R version 4.2.3
#> Warning in checkMatrixPackageVersion(): Package version inconsistency detected.
#> TMB was built with Matrix version 1.6.1
#> Current Matrix version is 1.4.1
#> Please re-install 'TMB' from source using install.packages('TMB', type = 'source') or ask CRAN for a binary version of 'TMB' matching CRAN's 'Matrix' package
library(pscl)
#> Warning: package 'pscl' was built under R version 4.2.3
#> Classes and Methods for R developed in the
#> Political Science Computational Laboratory
#> Department of Political Science
#> Stanford University
#> Simon Jackman
#> hurdle and zeroinfl functions by Achim Zeileis
# Create datasets: number of papers during a Phd
data("bioChemists", package = "pscl")
# Declare models
pscl.hurdle <- pscl::hurdle(art ~ fem + mar, data = bioChemists, dist = "poisson", zero.dist = "binomial", link = "logit")
glmmtmb.hurdle <- glmmTMB::glmmTMB(art ~ fem + mar, data = bioChemists, family = truncated_poisson(link = "log"), zi = ~ fem + mar)
# Truncated count
emmeans(pscl.hurdle, ~ mar, mode = "count", lin.pred = TRUE, type = "response")
#> mar count SE df lower.CL upper.CL
#> Single 2.07 0.1122 909 1.86 2.31
#> Married 2.09 0.0807 909 1.94 2.26
#>
#> Results are averaged over the levels of: fem
#> Confidence level used: 0.95
#> Intervals are back-transformed from the log scale
emmeans(glmmtmb.hurdle, ~ mar, comp = "cond", type = "response")
#> mar rate SE df asymp.LCL asymp.UCL
#> Single 2.07 0.1122 Inf 1.86 2.30
#> Married 2.09 0.0807 Inf 1.94 2.26
#>
#> Results are averaged over the levels of: fem
#> Confidence level used: 0.95
#> Intervals are back-transformed from the log scale
# Binomial
emmeans(pscl.hurdle, ~ mar, mode = "prob0")
#> mar emmean SE df lower.CL upper.CL
#> Single 0.313 0.0265 909 0.261 0.365
#> Married 0.297 0.0191 909 0.259 0.334
#>
#> Results are averaged over the levels of: fem
#> Confidence level used: 0.95
emmeans(glmmtmb.hurdle, ~ mar, comp = "zi", type = "response")
#> mar response SE df asymp.LCL asymp.UCL
#> Single 0.313 0.0267 Inf 0.263 0.367
#> Married 0.296 0.0190 Inf 0.261 0.335
#>
#> Results are averaged over the levels of: fem
#> Confidence level used: 0.95
#> Intervals are back-transformed from the logit scale
# Response (binomial * truncated count)
emmeans(pscl.hurdle, ~ mar, mode = "response")
#> mar emmean SE df lower.CL upper.CL
#> Single 1.64 0.0900 909 1.47 1.82
#> Married 1.69 0.0633 909 1.57 1.82
#>
#> Results are averaged over the levels of: fem
#> Confidence level used: 0.95
emmeans(glmmtmb.hurdle, ~ mar, comp = "response")
#> mar emmean SE df asymp.LCL asymp.UCL
#> Single 1.64 0.0900 Inf 1.47 1.82
#> Married 1.69 0.0633 Inf 1.57 1.82
#>
#> Results are averaged over the levels of: fem
#> Confidence level used: 0.95
sessionInfo()
#> R version 4.2.0 (2022-04-22 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 22000)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=Spanish_Spain.utf8 LC_CTYPE=Spanish_Spain.utf8
#> [3] LC_MONETARY=Spanish_Spain.utf8 LC_NUMERIC=C
#> [5] LC_TIME=Spanish_Spain.utf8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] pscl_1.5.5.1 glmmTMB_1.1.7 emmeans_1.8.8
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.10 nloptr_2.0.1 compiler_4.2.0
#> [4] highr_0.9 TMB_1.9.6 tools_4.2.0
#> [7] boot_1.3-28 lme4_1.1-29 digest_0.6.29
#> [10] nlme_3.1-157 evaluate_0.15 lifecycle_1.0.3
#> [13] lattice_0.20-45 rlang_1.1.1 reprex_2.0.2
#> [16] Matrix_1.4-1 cli_3.6.1 rstudioapi_0.13
#> [19] yaml_2.3.5 mvtnorm_1.1-3 xfun_0.40
#> [22] fastmap_1.1.0 coda_0.19-4 withr_2.5.0
#> [25] stringr_1.5.0 knitr_1.39 fs_1.5.2
#> [28] vctrs_0.6.3 grid_4.2.0 glue_1.6.2
#> [31] survival_3.3-1 rmarkdown_2.14 multcomp_1.4-19
#> [34] minqa_1.2.4 TH.data_1.1-1 magrittr_2.0.3
#> [37] codetools_0.2-18 htmltools_0.5.6 splines_4.2.0
#> [40] MASS_7.3-57 xtable_1.8-4 numDeriv_2016.8-1.1
#> [43] sandwich_3.0-1 estimability_1.4.1 stringi_1.7.6
#> [46] zoo_1.8-10
Created on 2023-09-14 with reprex v2.0.2