rggplot2mgcv

Posterior predictive check for GAM (mgcv in R)


I am fitting GAMs in R using the mgcv R package. I then want to perform posterior predictive checks to compare observed and simulated outcomes. A toy example using the performance R package:

library(mgcv)
library(performance)

df <- mtcars |>
  tibble::as_tibble() |>
  dplyr::mutate(carb = as.integer(carb))

model <- mgcv::gam(
  carb~s(wt, k=5),
  data = df,
  family = mgcv::ocat(R=8)
)

plot(check_model(model))

That does not work:

Error in d[, !(names(data) %in% all.varying), drop = FALSE] : 
  incorrect number of dimensions

I am trying to get something similar to this:

posterior predictive check of toy model

The only issue seems to be that when the performance calls simulate, what it gets is not supported. I can obtain simulated outcomes using mgcViz:

library(ggplot2)
sims <- mgcViz::simulate.gam(model, newdata=df, nsim=10) |> tibble::as_tibble() |>
    tidyr::pivot_longer(dplyr::everything())
ggplot(sims, aes(x = value)) +
    geom_point(aes(y = after_stat(count)), stat = "count")

The problem is that I only obtain one point, not the jitter of points as in the image above. How can I achieve that?


EDIT:

After accepting the answer below, I realised that the results of the proposed solution and those of performance are not the same.

library(mgcv)
library(performance)
library(patchwork)
df <- mtcars |>
  tibble::as_tibble() |>
  dplyr::mutate(carb = factor(carb, ordered = T))
model <- MASS::polr(
  formula = carb ~ wt,
  data = df,
  Hess = T,
  method = "logistic"
)
sims <- simulate(model, newdata = df, nsim = 100) |>
  tibble::as_tibble() |>
  tidyr::pivot_longer(dplyr::everything(),
                      names_to = 'replicate',
                      values_to = 'carb')
sims_summ <- sims |> dplyr::count(carb, replicate)
p1 <- ggplot(mapping = aes(carb)) +
  geom_jitter(
    aes(y = n, color = 'Model-predicted data'),
    sims_summ,
    height = 0,
    width = 0.1,
    alpha = 0.1
  ) +
  stat_summary(
    aes(y = n, color = 'Model-predicted data'),
    sims_summ,
    geom = 'pointrange',
    fun.data = 'mean_cl_boot'
  ) +
  geom_point(aes(color = 'Observed data'),
             df,
             stat = 'count',
             size = 5) +
  scale_color_manual(values = c('black', 'firebrick')) +
  theme_classic()
p2 <- plot(check_predictions(model, iterations = 100))
p1 + p2

Comparison of results PPC.


Solution

  • performance does posterior predictive using check_predictions. Using:

    check_predictions(model)
    

    Gives the much more informative error:

    Error: Could not simulate responses. Maybe there is no `simulate()` for 
    objects of class `gam`?
    

    As you say, there is mgcViz::simulate.gam, and loading that package will now find a simulate function. However, it returns data in a different format (an array) than stats::simulate (which gives a list), and so it will not work with the functions that performance provides.

    You can plot the simulated data and observed data pretty easily yourself though. E.g.:

    library(ggplot2)
    
    sims <- mgcViz::simulate.gam(model, newdata=df, nsim=100) |> 
      tibble::as_tibble() |>
      tidyr::pivot_longer(dplyr::everything(), names_to = 'replicate', values_to = 'carb')
    
    ggplot(mapping = aes(carb)) +
      geom_density(
        aes(color = 'Model-predicted data', group = replicate), 
        sims, key_glyph = 'smooth'
      ) +
      geom_density(aes(color = 'Observed data'), df, linewidth = 2, key_glyph = 'smooth') + 
      scale_color_manual(values = c(alpha('black', 0.3), 'firebrick')) +
      theme_classic()
    

    enter image description here

    Or perhaps like so:

    sims_summ <- sims |> dplyr::count(carb, replicate)
    ggplot(mapping = aes(carb)) +
      geom_jitter(
        aes(y = n, color = 'Model-predicted data'), 
        sims_summ, height = 0, width = 0.1, alpha = 0.1
      ) +
      stat_summary(
        aes(y = n, color = 'Model-predicted data'), 
        sims_summ, geom = 'pointrange', fun.data = 'mean_cl_boot'
      ) +
      geom_point(aes(color = 'Observed data'), df, stat = 'count', size = 5) + 
      scale_color_manual(values = c('black', 'firebrick')) +
      theme_classic()
    

    enter image description here