My goal is to calculate a 95% prediction interval using a Bayesian zero-inflated beta model in R. I have had no issues doing so but when I plotted the posterior predictive density using stat_halfeye()
the density looks extremely flat. This must be due to the zeros in the data (density is infinite at zero) and the way the density is scaled between zero and one?! I tried to reproduce this behavior using a scaled density (from 0 to 1) via geom_density()
to better understand what stat_halfeye()
is doing but geom_density()
does look not as flat. In fact it looks similar to the histogram approach using stat_histinterval()
.
My question is how is stat_halfeye()
scaling the density? I looked for ?stat_halfeye
and ?stat_slabinterval
but couldn't find an answer there.
Here's a reproducible example:
library(tidyverse)
library(tidybayes)
library(brms)
set.seed(1232)
# set zero inflation
zero_inflation <- 0.3
# generate roughly 30% zeros and 70% ones
zero_obs <- rbinom(n = 30,
size = 1,
prob = 1 - zero_inflation)
# generate beta random observations
cover_obs <- rbeta(n = 30, shape1 = 2, shape2 = 9)
# multiply the above vectors to get the final zero-inflated observation vector
combined_obs <- tibble(y_beta_zero = zero_obs * cover_obs)
# fit zero-inflated beta model
beta_zero_model <-
brm(y_beta_zero ~ 1, family = zero_inflated_beta(),
data = combined_obs)
# now I want the 95% prediction interval from the posterior predictive
combined_obs %>%
add_predicted_draws(beta_zero_model) %>%
ggplot(aes(x = .prediction)) +
stat_halfeye(.width = c(.95)) +
ggtitle("stat_halfeye")
# It looks like the density in stat_halfeye is scaled (y axis from 0 to 1)
# Tried to reproduce with the `geom_density()` function but it didn't work.
combined_obs %>%
add_predicted_draws(beta_zero_model) %>%
ggplot(aes(x = .prediction, after_stat(scaled))) +
geom_density() +
ggtitle("geom_density, scaled")
# also tried the histogram approach and this looks more like the `geom_density()` approach
combined_obs %>%
add_predicted_draws(beta_zero_model) %>%
ggplot(aes(x = .prediction)) +
stat_histinterval(.width = c(.95)) +
ggtitle("stat_histinterval")
Created on 2023-12-05 with reprex v2.0.2
The main difference here is that the two approaches use different bandwidth estimators by default. ggplot2::geom_density()
uses the default bandwidth estimator of density()
, which is "nrd0"
(i.e. bw.nrd0()
), while ggdist::stat_slabinterval()
uses "dpi"
(i.e. ggdist::bandwidth_dpi()
, which is equivalent to bw.SJ(..., method = "dpi")
). Notably, the documentation of density()
recommends using "SJ"
instead of "nrd0"
, which is only the default in that function for historical reasons.
You can see what's happening by plotting densities with both estimators on the same scale. I'll use ggdist::stat_slab()
instead of stat_slabinterval()
because the interval makes it harder to see the comparison here:
combined_obs %>%
add_predicted_draws(beta_zero_model) %>%
ggplot(aes(x = .prediction)) +
stat_slab(density = ggdist::density_bounded(bandwidth = "nrd0"), color = "gray25") +
stat_slab(color = "red", fill = NA, linetype = "11") +
# ensure both slabs are scaled together
scale_thickness_shared() +
theme_ggdist()
The gray slab is (roughly) what you would get from geom_density()
(except that ggdist::stat_slab()
also applies a correction for the boundedness of the data, which will impact the size of the spike, but this is dwarfed by the bandwidth differences). The red dotted line is what you get from ggdist::stat_slab()
by default.
It's a little hard to see, so let's zoom in:
combined_obs %>%
add_predicted_draws(beta_zero_model) %>%
ggplot(aes(x = .prediction)) +
stat_slab(density = ggdist::density_bounded(bandwidth = "nrd0"), color = "gray25") +
stat_slab(color = "red", fill = NA, linetype = "11") +
scale_thickness_shared() +
# zoom in
coord_cartesian(ylim = c(0, 0.2)) +
theme_ggdist()
Basically what is happening is that ggdist's default bandwidth estimator is selecting a smaller bandwidth, so the spike is narrower and taller. Other than that, they essentially agree on the shape of the rest of the distribution. Arguably (though I'm biased here), ggdist's choice is more accurate, as it doesn't have a thick portion of values around (but not equal to) 0 with high density.
Ultimately though, the problem with point masses in densities or histograms is that you can make them arbitrarily tall by adjusting the bandwidth. Here's an example with histograms:
combined_obs %>%
add_predicted_draws(beta_zero_model) %>%
ggplot(aes(x = .prediction)) +
stat_slab(density = "histogram", breaks = breaks_fixed(width = 0.05), align = "boundary", color = "gray70", fill = NA) +
stat_slab(density = "histogram", breaks = breaks_fixed(width = 0.04), align = "boundary", color = "gray60", fill = NA) +
stat_slab(density = "histogram", breaks = breaks_fixed(width = 0.03), align = "boundary", color = "gray50", fill = NA) +
stat_slab(density = "histogram", breaks = breaks_fixed(width = 0.02), align = "boundary", color = "gray40", fill = NA) +
stat_slab(density = "histogram", breaks = breaks_fixed(width = 0.01), align = "boundary", color = "gray30", fill = NA) +
scale_thickness_shared() +
theme_ggdist()
The correct solution here is probably something like a plot of the proportion of zeros alongside a density of the rest of the distribution; currently ggdist doesn't have an out-of-the-box solution, but there is an open issue related to this.