rggplot2gam

Extract model fit from geom_smooth


I have a ggplot where I have used the geom_smooth(method='gam') function. I was wondering if there is a way to extract model parameters, like coefficients and deviance explained.

Similar to the summary() of mgcv::gam()'s output:

> summary(gam)

Family: gaussian 
Link function: identity 

Formula:
mAODscale ~ s(numDate, bs = "cr")

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.041461   0.002198   18.86   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
             edf Ref.df     F p-value    
s(numDate) 8.731  8.979 74.54  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.379   Deviance explained = 38.4%
GCV = 0.0053239  Scale est. = 0.0052765  n = 1092

Data

test <- structure(list(numDate = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 
5, 5, 5, 6, 6, 6, 7, 7, 7, 8, 8, 8, 9, 9, 9, 10, 10, 10, 11, 
11, 11, 12, 12, 12, 13, 13, 13, 14, 14, 14, 15, 15, 15, 16, 16, 
16, 17, 17, 17, 18, 18, 18, 19, 19, 19, 20, 20, 20, 21, 21, 21, 
22, 22, 22, 23, 23, 23, 24, 24, 24, 25, 25, 25, 26, 26, 26, 27, 
27, 27, 28, 28, 28, 29, 29, 29, 30, 30, 30, 31, 31, 31, 32, 32, 
32, 33, 33, 33, 34), mAODscale = c(0.0388813764134284, 0.0461877433707656, 
0.096886995092774, 0.0371438764134382, 0.0394419100374392, 0.0893994950927635, 
0.0413855430800965, 0.0418085767040992, 0.101328661759439, 0.0428313764134316, 
0.0403835767041016, 0.0978328284261067, 0.0444813764134295, 0.0566460767040979, 
0.136811995092771, 0.0404647097467716, 0.0461127433707702, 0.104611995092768, 
0.0391105430800991, 0.0388294100374367, 0.0883828284261057, 0.0377355430801032, 
0.0334335767040983, 0.0744786617594428, 0.0346647097467638, 0.0316752433707705, 
0.0691911617594343, 0.0365730430800966, 0.0329127433707725, 0.0721203284261094, 
0.0337897097467703, 0.0271960767041008, 0.0568703284261005, 0.0321188764134348, 
0.0226835767040967, 0.0467286617594311, 0.0389522097467676, 0.0317169100374315, 
0.0724911617594444, 0.0374147097467699, 0.0301335767041024, 0.066378661759444, 
0.0359688764134347, 0.0245710767041061, 0.0492828284261009, 0.0340355430801083, 
0.0208835767040938, 0.0401828284261114, 0.033193876413435, 0.0170627433707722, 
0.0363119950927739, 0.0320272097467722, 0.0147294100374324, 0.0298161617594417, 
0.0305563764134433, 0.0123752433707693, 0.0200744950927714, 0.0294522097467649, 
0.00966691003743847, 0.0144453284261061, 0.029439709746768, 0.00845441003743019, 
0.0127494950927769, 0.029218876413438, 0.00767941003742578, 0.0118203284260971, 
0.0283438764134303, 0.00608774337077023, 0.00807032842610056, 
0.027393235387791, 0.00498582029383954, 0.00612513611841337, 
0.0261313764134314, 0.004612743370771, 0.00541616175944171, 0.0260813764134298, 
0.00472524337077118, 0.00372449509276862, 0.0256522097467666, 
0.00535024337077061, 0.00356616175943714, 0.0250313764134376, 
0.0161044100374284, 0.00060366175944182, 0.0241147097467689, 
0.025246076704093, -0.00540050490722876, 0.0233022097467739, 
0.022454410037426, -0.00630883824055672, 0.0208772097467715, 
0.0152252433707645, -0.00819217157389573, 0.0194897097467646, 
-0.00307475662923196, -0.00896300490722979, 0.0178938764134386, 
0.00323774337076088, -0.00836717157389444, 0.0153022097467641
)), row.names = c(NA, -100L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x000002cacef91f60>)

Code and Plot with geom_smooth()

Note: this is only using a subset of my dataset.

p <- ggplot(test50, aes(x=numDate, y=mAODscale)) +
  geom_point() +
  stat_smooth(method='gam',
              formula=y~s(x,bs="cs",fx=TRUE,k=10))

enter image description here


Solution

  • When specifying a geom_smooth model within ggplot, you use the symbols x and y in the formula to represent the variables mapped to the x and y aesthetics (as specified in aes). In your case, x is numDate and y is mAODscale.

    Since there are no transformations going on in the data prior to the model being run (other than this change in the variable names), all you need to do is substitute the symbols x and y in the formula for the appropriate variable names in your original data frame. Then use that formula to run a gam. The summary of this model is the result you are looking for:

    library(mgcv)
    
    mod <- gam(mAODscale ~ s(numDate, bs = "cs", fx = TRUE, k = 10), data = test)
    summary(mod)
    #> 
    #> Family: gaussian 
    #> Link function: identity 
    #> 
    #> Formula:
    #> mAODscale ~ s(numDate, bs = "cs", fx = TRUE, k = 10)
    #> 
    #> Parametric coefficients:
    #>             Estimate Std. Error t value Pr(>|t|)    
    #> (Intercept) 0.031871   0.001839   17.33   <2e-16 ***
    #> ---
    #> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    #> 
    #> Approximate significance of smooth terms:
    #>            edf Ref.df     F  p-value    
    #> s(numDate)   9      9 13.25 3.01e-13 ***
    #> ---
    #> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    #> 
    #> R-sq.(adj) =  0.527   Deviance explained =   57%
    #> GCV = 0.00037584  Scale est. = 0.00033826  n = 100
    

    The model itself is not stored in the ggplot object. It is built on-the-fly and discarded during the ggplot_build process. We can see that a direct call to gam gives the same result as the model used in ggplot if we set a trace on gam and get it to print out a summary of its output object to the console whenenver it is called:

    trace(gam, quote(print(summary(object))), at = 40)
    #> Tracing function "gam" in package "mgcv"
    #> [1] "gam"
    

    Now when we print our ggplot, we should get a summary of the gam that ggplot uses to create the smooth:

    ggplot(test, aes(x=numDate, y=mAODscale)) +
      geom_point() +
      stat_smooth(method='gam',
                  formula=y ~ s(x, bs = "cs", fx = TRUE, k = 10))
    

    #> Tracing method(formula, data = data, weights = weight, method = "REML") step 40 
    #> 
    #> Family: gaussian 
    #> Link function: identity 
    #> 
    #> Formula:
    #> y ~ s(x, bs = "cs", fx = TRUE, k = 10)
    #> 
    #> Parametric coefficients:
    #>             Estimate Std. Error t value Pr(>|t|)    
    #> (Intercept) 0.031871   0.001839   17.33   <2e-16 ***
    #> ---
    #> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    #> 
    #> Approximate significance of smooth terms:
    #>      edf Ref.df     F  p-value    
    #> s(x)   9      9 13.25 3.01e-13 ***
    #> ---
    #> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    #> 
    #> R-sq.(adj) =  0.527   Deviance explained =   57%
    #> -REML = -220.56  Scale est. = 0.00033826  n = 100
    

    You can see that this is the same as our direct call, except that the variable names have changed to x and y.

    Don't forget to untrace(gam) when you are done.

    Created on 2023-10-03 with reprex v2.0.2