I am fitting GAMs in R using the mgcv
R package. I then want to perform posterior predictive checks to compare observed and simulated outcomes. A toy example using the performance
R package:
library(mgcv)
library(performance)
df <- mtcars |>
tibble::as_tibble() |>
dplyr::mutate(carb = as.integer(carb))
model <- mgcv::gam(
carb~s(wt, k=5),
data = df,
family = mgcv::ocat(R=8)
)
plot(check_model(model))
That does not work:
Error in d[, !(names(data) %in% all.varying), drop = FALSE] :
incorrect number of dimensions
I am trying to get something similar to this:
The only issue seems to be that when the performance
calls simulate
, what it gets is not supported. I can obtain simulated outcomes using mgcViz
:
library(ggplot2)
sims <- mgcViz::simulate.gam(model, newdata=df, nsim=10) |> tibble::as_tibble() |>
tidyr::pivot_longer(dplyr::everything())
ggplot(sims, aes(x = value)) +
geom_point(aes(y = after_stat(count)), stat = "count")
The problem is that I only obtain one point, not the jitter of points as in the image above. How can I achieve that?
EDIT:
After accepting the answer below, I realised that the results of the proposed solution and those of performance
are not the same.
library(mgcv)
library(performance)
library(patchwork)
df <- mtcars |>
tibble::as_tibble() |>
dplyr::mutate(carb = factor(carb, ordered = T))
model <- MASS::polr(
formula = carb ~ wt,
data = df,
Hess = T,
method = "logistic"
)
sims <- simulate(model, newdata = df, nsim = 100) |>
tibble::as_tibble() |>
tidyr::pivot_longer(dplyr::everything(),
names_to = 'replicate',
values_to = 'carb')
sims_summ <- sims |> dplyr::count(carb, replicate)
p1 <- ggplot(mapping = aes(carb)) +
geom_jitter(
aes(y = n, color = 'Model-predicted data'),
sims_summ,
height = 0,
width = 0.1,
alpha = 0.1
) +
stat_summary(
aes(y = n, color = 'Model-predicted data'),
sims_summ,
geom = 'pointrange',
fun.data = 'mean_cl_boot'
) +
geom_point(aes(color = 'Observed data'),
df,
stat = 'count',
size = 5) +
scale_color_manual(values = c('black', 'firebrick')) +
theme_classic()
p2 <- plot(check_predictions(model, iterations = 100))
p1 + p2
Here is one way to do this using my gratia package. For this example you will need the development version of the package however; the current released version on CRAN does not support the ocat()
family, but I recently added support in the dev version on GitHub. Instructions on how to install the dev version from my R Universe package repo, without needing any of the build tools, are in the repo's readme.
Begin by loading required packages and fitting your example model.
library("mgcv")
library("gratia")
library("dplyr")
library("ggplot2")
df <- mtcars |>
as_tibble() |>
mutate(carb = as.integer(carb))
model <- gam(
carb ~ s(wt, k=5),
data = df,
family = mgcv::ocat(R=8)
)
gratia does provide a simulate.gam
method, but I think that the tidy alternatives also provided by gratia are more useful for this kind of thing, especially if you are using ggplot2.
If we just want to simulate new data from the model then we can use predicted_samples()
. The values it returns are conditional upon the fitted values of the model; in other words, as with simulate()
, they do not reflect the uncertainty in the fitted model itself. We can change that later by using posterior_samples()
, but for now I'll proceed under the assumption that you want to simulate new data from the estimated model.
sims <- predicted_samples(model, n = 50, seed = 42, data = df) |>
rename(carb = .response)
Here, I'm asking for 50 simulated data sets (n = 50
). To make the plotting easier in this example I have renamed the .response
variable returned by predicted_samples()
to carb
.
Now we can prepare the posterior predictive check (PPC) plot
df |>
ggplot(
aes(
x = carb
)
) +
geom_point(stat = "count", size = 3) +
geom_point( # add on the data simulated from the model
data = sims,
mapping = aes(x = carb, group = .draw), # group by .draw for the stat summary
colour = "steelblue",
alpha = 0.3,
stat = "count", # apply a summary to count the obs per carb
position = position_dodge(width = 0.25) # jitter
)
In this PPC plot, the focus is on whether the model can generate new data sets with roughly the same number of observations per value of carb
as was observed in the data used to fit the model. Hence, while I could use dplyr to compute these values and then plot, here I'm using ggplot2's ability to apply statistical transforms to the data provided to the geom
; in this case the stat_count()
stat is used to count the observations per carb
per simulated data set (because of the grouping by .draw
). The resulting PPC plot is shown below.
A true posterior predictive check plot would use the posterior predictive distribution, which includes both the sampling variation and the model uncertainty. We can modify the above code to do just that with posterior_samples()
:
sims2 <- posterior_samples(model, n = 50, seed = 42, data = df) |>
rename(carb = .response)
df |>
ggplot(
aes(
x = carb
)
) +
geom_point(stat = "count", size = 3) +
geom_point( # add on the data simulated from the model
data = sims2,
mapping = aes(x = carb, group = .draw), # group by .draw for the stat summary
colour = "steelblue",
alpha = 0.3,
stat = "count", # apply a summary to count the obs per carb
position = position_dodge(width = 0.25) # jitter
)
which produces