I am trying to create a graph plotting distributions of estimated difference between several groups using the ggdist()
package in R. The help has been useful but it does not cover anywhere near the breadth of functionality possible with the package. Here is an example analysis using the iris dataset in R and brms, examining the association between Species (three-level categorical) and Sepal Length (also three-level categorical, created for this analysis from a continuous variable) on Sepal Width, a numeric variable.
library(tidyverse)
library(brms)
library(ggdist)
library(tidybayes)
library(marginaleffects)
library(modelr)
# create dataset
iris %>%
mutate(SepalLength_cat = factor(case_when(Sepal.Length >= 4.3 & Sepal.Length < 5 ~ "cat1",
Sepal.Length >= 5.1 & Sepal.Length < 6.4 ~ "cat2",
TRUE ~ "cat3"))) -> irisAlt
# run model
mod <- brm(formula = Sepal.Width ~ Species*SepalLength_cat,
data = irisAlt)
# sample from the posterior distribution at each level of the two three-level variables (so nine cells)
data_grid(data = irisAlt,
Species = c("setosa", "versicolor", "virginica"),
SepalLength_cat = c("cat1", "cat2", "cat3")) %>%
add_epred_draws(mod,
re_formula = NA,
seed = 1234) -> modPred
Now using the compare_levels()
function from tidybayes
we compare estimated difference in Sepal Width between levels of Sepal Length at each level of Species.
# compare levels
modPred %>%
compare_levels(variable = .epred,
by = SepalLength_cat) %>%
mutate(SepalLength_cat = factor(x = SepalLength_cat,
levels = c("cat2 - cat1",
"cat3 - cat1",
"cat3 - cat2")),
Species = factor(Species)) -> compCat
What I want to do is first plot the distribution, shading the 95% HDI, adding a point interval below also showing the HDI and adding spikes. What I also want to include is
(a) an interval with a ROPE between -0.25 and 0.25
(b) shading the distribution falling within that ROPE a different colour.
(c) Adding labels, in a convenient place, indicating the proportion of posterior density falling within that ROPE.
Here is my attempt
compCat %>%
ggplot(mapping = aes(x = .epred,
y = SepalLength_cat)) +
stat_histinterval(mapping = aes(fill = after_stat(level)),
point_interval = median_hdi,
.width = c(.95)) +
stat_spike(mapping = aes(linetype = after_stat(at)),
at = c("median",
"hdi")) +
stat_spike(mapping = aes(linetype = after_stat(at)),
at = c(-0.25, 0.25),
colour = "red") +
scale_fill_manual(values = "skyblue",
na.value = "gray75") +
scale_thickness_shared() + # makes sure spike same size as distribution
facet_grid(.~Species) +
labs(y = "Sepal Length comparison",
x = "Sepal Width (in mm)") +
theme_minimal() +
theme(legend.position = "none")
Which should come out looking like this
I used spikes for the ROPE instead of an interval, but I would really like to use an interval and especially to shade the area of the distribution that falls between -0.25 and 0.25, and annotate with labels indicating proportion of density within the ROPE. One of the problems is that stat_interval()
and its related functions don't seem to accept a fixed cartesian interval, instead calculating HDI, which moves around depending on the distribution of estimates.
I don't know if ggdist
contains the necessary functionality to do this, but if it does, or if anyone can suggest another way of achieving my aims, I would appreciate it.
I'm not sure how to do this within ggdist
(or if it can be done). So, here's a way to construct the plot by directly calculating the desired density values from the posterior draws and then creating a custom ggplot.
The overlapping shading of the ROPE and HDI might be confusing, but hopefully you can play around with the code to get something that meets your needs.*
# Calculate HDIs
hdi = compCat %>%
group_by(Species, SepalLength_cat) %>%
summarise(as_tibble(hdi(.epred))) %>%
rename(hdi.lwr=V1, hdi.upr=V2)
# Calculate densities
adj = 0.5
dens.hist = compCat %>%
group_by(Species, SepalLength_cat) %>%
group_split() %>%
map_df(~{
dens = density_bounded(.x$.epred, adjust=adj)
tibble(Species=.x$Species[1], SepalLength_cat=.x$SepalLength_cat[1],
x=dens$x, .epred=dens$y)
}) %>%
left_join(hdi)
# Calculate ROPE density
rope.rng = c(-0.25, 0.25)
rope.dens = dens.hist %>%
filter(x >= min(rope.rng), x <= max(rope.rng)) %>%
group_by(Species, SepalLength_cat) %>%
summarise(rope.dens = sum(mean(diff(x)) * .epred))
rope.color = hcl(0, 80, 30, alpha=c(1, 0.7))
hdi.color = hcl(240, 100, 30, alpha=c(1,0.5))
theme_set(theme_minimal(base_size=14))
dens.hist %>%
ggplot(aes(x, .epred)) +
geom_hline(yintercept=0, linewidth=0.1, colour="black") +
# Overall densities
geom_line(linewidth=0.3) +
# Shade ROPE
geom_area(data=. %>%
filter(x >= min(rope.rng), x <= max(rope.rng)),
aes(fill="ROPE")) +
# Shade HDI
geom_area(data=. %>%
filter(x >= hdi.lwr, x <= hdi.upr),
aes(fill="HDI")) +
# HDI segment
geom_segment(data=hdi,
aes(x=hdi.lwr, xend=hdi.upr, y=-0.1, yend=-0.1,
colour="HDI")) +
# ROPE segment
geom_line(data=. %>%
group_by(Species, SepalLength_cat) %>%
reframe(x=rope.rng, .epred=-0.2),
aes(colour="ROPE")) +
# ROPE density value
geom_text(data=rope.dens,
size=3.5, vjust=0.5,
aes(label=paste0(round(rope.dens*100, 1),"%"),
x = mean(rope.rng),
y = -0.4,
colour="ROPE")) +
facet_grid(cols=vars(Species), row=vars(fct_rev(SepalLength_cat)), switch="y") +
theme(
strip.text.y.left=element_text(angle=0, vjust=0.02),
axis.text.y=element_blank(),
axis.ticks.y=element_blank()
) +
scale_y_continuous(expand=c(0.03, 0.03)) +
scale_fill_manual(values=c(ROPE=rope.color[2], HDI=hdi.color[2])) +
scale_colour_manual(values=c(ROPE=rope.color[1], HDI=hdi.color[1])) +
guides(colour="none") +
labs(y="Sepal Length comparison", x="Sepal Width", fill=NULL)
* After not receiving an answer for a few days here, this question was also cross-posted to the Stan Community, so I've also posted this answer there.