rgammgcv

Is it possible to predict terms from a GAM with partial newdata?


Is it possible to get predictions from a GAM object for specific terms from 'partial' newdata, which only provides values for the terms to predict? Running predict.gam with type = 'terms' for specific terms still seems to require me to provide "complete" newdata:

library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.8-42. For overview type 'help("mgcv-package")'.
library(data.table)

# some data for training a gam:
train = data.table(response = rnorm(100),
                   a = rnorm(100),
                   b = rnorm(100),
                   d = rnorm(100))

mod = gam(response ~ s(a) + s(b) + s(d),data = train)

newdat = data.table(a = -1:1,b = -1:1)

# this is not possible:
predict(mod,newdata = newdat,type = 'terms',terms = c('s(a)','s(b)'))
#> Warning in predict.gam(mod, newdata = newdat, type = "terms", terms = c("s(a)", : not all required variables have been supplied in  newdata!
#> Error in eval(predvars, data, env): object 'd' not found

So the model expects newdata that has values for the predictand d, even though this is never used:

# adding any value for d works:
predict(mod,newdata = newdat[,d:=1],type = 'terms',terms = c('s(a)','s(b)'))
#>          s(a)         s(b)
#> 1 -0.14909452 -0.096316246
#> 2 -0.01305186  0.001326293
#> 3 -0.05200030  0.098968833
#> attr(,"constant")
#> (Intercept) 
#>  -0.0468407

# results do not depend on the value of d:
predict(mod,newdata = newdat[,d:=10000],type = 'terms',terms = c('s(a)','s(b)'))
#>          s(a)         s(b)
#> 1 -0.14909452 -0.096316246
#> 2 -0.01305186  0.001326293
#> 3 -0.05200030  0.098968833
#> attr(,"constant")
#> (Intercept) 
#>  -0.0468407

Created on 2024-04-03 with reprex v2.0.2

Specifically, I am working with several big GAMs with many different terms. The number and names of terms varies between the GAMs, but they have some shared terms for which I need to provide newdata. I am looking for a way to do this that does not depend on "the rest of the GAM" (e.g. names and number of terms), which is not actually used in the prediction.


Solution

  • As pointed out by users langtang and Gavin Simpson, you can simply set newdata.guaranteed = TRUE:

    library(mgcv)
    #> Loading required package: nlme
    #> This is mgcv 1.8-42. For overview type 'help("mgcv-package")'.
    library(data.table)
    
    train = data.table(response = rnorm(100),
                       a = rnorm(100),
                       b = rnorm(100),
                       d = rnorm(100))
    
    mod = gam(response ~ s(a) + s(b) + s(d),data = train)
    
    newdat = data.table(a = -1:1,b = -1:1)
    
    predict(mod,newdata = newdat,type = 'terms',terms = c('s(a)','s(b)'),newdata.guaranteed = TRUE)
    #>           s(a)        s(b)
    #> 1  0.034319305 -0.32623950
    #> 2  0.004482982  0.18215385
    #> 3 -0.025353341 -0.04922302
    #> attr(,"constant")
    #> (Intercept) 
    #>  0.04415271
    

    Created on 2024-04-04 with reprex v2.0.2