Predicting from a gam (mgcv) with multiple "by" smooth terms

I fitted a gam (mgcv) where y is modelled as a function of the time (t) by factor x1 and by factor x2 in separate smooth terms. I also included r, which should mimic a random factor.

library(mgcv)
dat <- data.frame(y = rnorm(1500,500,100),
                  t = rep(c(1:5),100),
                  x1 = as.factor(rep(letters[1:3],length.out=1500)),
                  x2 = as.factor(rep(LETTERS[1:5],length.out=1500)),
                  r = as.factor(rep(letters[4:26],length.out=1500)))


fit <- gam(y~s(t,by=x1,k=4)+s(x1,bs="re")+s(t,by=x2,k=4)+s(x2,bs="re")+s(r,bs="re"),data=dat,
           family="gaussian",method="REML")

I now wanted to predict y for a level of x1 across all levels of x2. However, a new level in x2 will lead to NAs in the prediction.

pred <- data.frame(predict(fit,type="response",newdata = data.frame(t=seq(0,5,0.1),
                                                                    x1="a",
                                                                    x2="new_level",
                                                                    r="new_level")
                           ,exclude = c("x2","r"),se.fit = T))

pred$fit

If I set x2 to one of the levels included in x2, it will predict values which will vary depending on the value/level I use even though I specified it to be excluded. Interestingly, the r takes new levels without leading to NAs. So, I figured this is connected to the "by" smooth terms. However, I could not come up with a solution. One might be fitting tow models including only one of the smooth terms w ith by, each? Would that be ok?

Solution

It sounds like you just need to specify exclude properly. For example,

library("gratia")
library("mgcv")
su_eg4 <- data_sim("eg4", n = 400,  dist = "normal", scale = 2, seed = 1)
m <- gam(y ~ fac + s(x2, by = fac) + s(x0, by = fac),
         data = su_eg4, method = "REML")

This model has the following smooth terms

> smooths(m)
[1] "s(x2):fac1" "s(x2):fac2" "s(x2):fac3" "s(x0):fac1" "s(x0):fac2" "s(x0):fac3"

and hence if you want to exclude all the smooths containing s(x0) here, you need to use their full names. The factor by terms are named following this template: s(<x>):<f><level>, where

<x> is the smooth covariate(s),
<f> is the name of the factor
<level> is the level of <f> that this smooth is for.

If you have lots of levels, it might get tedious to write all that out, so we can search for the relevant string

sms <- smooths(m)
excl <- grepl("x0", sms)

and then adjust the exclude argument value accordingly

exclude = sms[excl]

as the sms[excl] evaluates to

> sms[excl]
[1] "s(x0):fac1" "s(x0):fac2" "s(x0):fac3"

In your case it seems you want to use

sms <- smooths(fit)
excl <- grepl("x2", sms) | grepl("s\\(r\\)", sms)

to capture the random intercept for r as well as any smooths that involve x2.

I actually think your model would be more simply specified using:

m <- <- gam(y ~
    s(t) + # "average" smooth of t
    s(t, x1, bs = "sz") + # smooth difference from `s(t)` for each level of x1
    s(t, x2, bs = "sz") + # smooth difference from `s(t)` for each level of x2
    s(r, bs = "re"), # random intercept term for `r`
  family = "gaussian", method = "REML")

Then you would just be able to use

exclude = c("s(t,x2)", "s(r)")

to exclude the smooths you want to exclude.

Note that with the sz basis, the group means are included in the basis, hence no parametric effects of x1 or x2 or their ranef equivalents are included in the model shown.