rsplinemgcv

How can I set "cr" as the default basis type in mgcv::gam instead of "tp"?


I'm using the mgcv package in R to build generalized additive models (GAMs), and I often use cubic regression splines (cr) instead of the default thin plate regression splines (tp) (mainly because cr is much faster than tp, and sometimes more accurate, as below). However, specifying bs = "cr" for each predictor in the s() function gets repetitive, especially when I have many predictors.

For example, if I want to predict mpg from some variables in the mtcars dataset, by default, the s() smooth function uses "tp" when the bs argument is not specified:

# default bs = "tp"
mgcv::gam(
  mpg ~ s(hp) + s(wt) + s(qsec), 
  data = mtcars
) |>
  summary()
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> mpg ~ s(hp) + s(wt) + s(qsec)
#> 
#> Parametric coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  20.0906     0.3729   53.88   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>           edf Ref.df      F p-value    
#> s(hp)   2.209  2.799  1.258   0.347    
#> s(wt)   2.095  2.535 12.640 9.2e-05 ***
#> s(qsec) 1.000  1.000  1.199   0.284    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> R-sq.(adj) =  0.878   Deviance explained = 89.8%
#> GCV = 5.5412  Scale est. = 4.4497    n = 32

To use cubic regression splines, I must explicitly specify the "cr" basis for every single smooth:

# manual bs = "cr"
mgcv::gam(
  mpg ~ s(hp, bs = "cr") + s(wt, bs = "cr") + s(qsec, bs = "cr"), 
  data = mtcars
) |> 
  summary()
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> mpg ~ s(hp, bs = "cr") + s(wt, bs = "cr") + s(qsec, bs = "cr")
#> 
#> Parametric coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  20.0906     0.2923   68.72   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>           edf Ref.df      F  p-value    
#> s(hp)   8.867  8.981  3.012   0.0203 *  
#> s(wt)   2.380  2.987 13.509 8.82e-05 ***
#> s(qsec) 1.000  1.000  6.242   0.0218 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> R-sq.(adj) =  0.925   Deviance explained = 95.4%
#> GCV = 4.6666  Scale est. = 2.7348    n = 32

Created on 2025-05-09 with reprex v2.1.1

When there are many predictors, this approach becomes quite cumbersome. Ideally, I would like a more automatic way to use "cr" as the default basis for all smooths.

I've looked into mgcv::gam.control(), but it doesn't seem to have an option for changing the default basis type. Is there a built-in way to make "cr" the default, or any other more elegant solution?

I know that I could try to override the s() function with my own custom version but I really want to avoid that, especially since my solution might end up in a package and I wouldn't like to mess with mgcv built-in functions.

Thanks in advance for any tips or suggestions!


Solution

  • The function s is defined with bs = "tp" as a default argument. There aren't any settings anywhere that can change this.

    Furthermore, you can't even define a wrapper of s because of how gam handles the calls to smoother functions inside formulas: only the specific names "s", "te", "ti", "t2" are treated as function calls due to the inner workings of interpret.gam. Even if you tried to overwrite s, it would be mgcv::s rather than your new function that would be used by gam, so that isn't a valid option either unless you overwrite s in the mgcv namespace (which you shouldn't do within a package).

    The only way I can see to do this is by writing a thin wrapper around gam that parses the formula and inserts the argument bs = "cr" into each s call.

    Although this can be done a few ways using gsub or substitute, I think the safest and most portable way to do it would be to walk the syntax tree of the formula and conditionally insert the bs arguments as necessary (if none already exist). We need to first extract the environment of the formula so we can reassign it later to our corrected formula.

    To ensure the call member of the model object is the same as it would be if we had just typed out the full call to gam we can build and evaluate the final call to gam rather than calling it directly:

    cr_gam <- function(formula, ...) {
      
      e <- environment(formula)
    
      walk_formula <- function(formula) {
        if(is.call(formula)) {
          if(identical(formula[[1]], as.symbol("s"))) {
            if(!is.null(names(formula))) {
              if("bs" %in% names(formula)) return(formula)
            } else {
              return(as.call(c(as.list(formula), bs = "cr")))
            }
            return(formula)
          }
          as.call(lapply(formula, walk_formula))
        } else if(is.symbol(formula)) {
          return(formula)
        } else {
          stop("Unexpected input in formula")
        }
      }
    
      formula <- walk_formula(formula)
      environment(formula) <- e
      mc <- match.call()
      mc[[1]] <- quote(mgcv::gam)
      mc[[2]] <- formula
      eval(mc, envir = parent.frame())
    }
    

    Now we can use cr_gam as a direct replacement for mgcv::gam. It acts identically except that s calls are all automatically given bs = "cr" unless otherwise specified. You are free to pass whatever arguments you want to each s(), including bs arguments, which will be honoured if directly specified. You could safely use it in a package as a replacement for gam

    cr_gam(mpg ~ s(hp) + s(wt) + s(qsec), data = mtcars) |> 
      summary()
    #> Family: gaussian 
    #> Link function: identity 
    #> 
    #> Formula:
    #> mpg ~ s(hp, bs = "cr") + s(wt, bs = "cr") + s(qsec, bs = "cr")
    #> 
    #> Parametric coefficients:
    #>             Estimate Std. Error t value Pr(>|t|)    
    #> (Intercept)  20.0906     0.2923   68.72   <2e-16 ***
    #> ---
    #> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    #> 
    #> Approximate significance of smooth terms:
    #>           edf Ref.df      F  p-value    
    #> s(hp)   8.867  8.981  3.012   0.0203 *  
    #> s(wt)   2.380  2.987 13.509 8.82e-05 ***
    #> s(qsec) 1.000  1.000  6.242   0.0218 *  
    #> ---
    #> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    #> 
    #> R-sq.(adj) =  0.925   Deviance explained = 95.4%
    #> GCV = 4.6666  Scale est. = 2.7348    n = 32