I am interested in testing the linear trend separately from the nonlinear component in a GAMM.
I have followed an example from the mgcv documentation and answers to relevant questions on stackexchange to use m=c(2,0)
with the default tp
smooth. The example works according to my expectations: (1) the slope estimated for the parametric coefficient for x is pretty close to the estimate from simple linear regression; and (2) the smooth appears detrended and of the relevant (very small) magnitude.
However, when I try to do this with my data, the result makes no sense to me. It appears that the linear component is arbitrary (and occasionally not even with the correct sign, as in the following example) and the smooth is not detrended (it obviously includes a strong linear component).
I wonder if I have misunderstood what this is supposed to do or if I am coding this incorrectly somehow. Since the doc example works fine I do not suppose it is a bug in mgcv.
How else can I test the linear trend separately from the nonlinear component in a GAMM?
This is the relevant example from the mgcv documentation:
require(mgcv)
n <- 100
set.seed(2)
x <- runif(n)
y <- x + x^2*.2 + rnorm(n) *.1
ed <- data.frame(y,x)
mg <- gam(y~s(x,m=c(2,0))+x,data=ed,method="REML")
As I understand it, this removes the linear component from the smooth so that it can be separately estimated.
Here are the slope estimates from gam and lm, pretty close:
> summary(mg)
Family: gaussian
Link function: identity
Formula:
y ~ s(x, m = c(2, 0)) + x
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.02133 0.05315 -0.401 0.689
x 1.18249 0.10564 11.193 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x) 0.9334 8 0.304 0.076 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.91 Deviance explained = 91.1%
-REML = -70.567 Scale est. = 0.012767 n = 100
> summary(lm(y~x,ed))
Call:
lm(formula = y ~ x, data = ed)
Residuals:
Min 1Q Median 3Q Max
-0.24082 -0.07942 -0.01136 0.09430 0.22836
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.03019 0.02212 -1.365 0.175
x 1.20052 0.03851 31.175 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1144 on 98 degrees of freedom
Multiple R-squared: 0.9084, Adjusted R-squared: 0.9075
F-statistic: 971.9 on 1 and 98 DF, p-value: < 2.2e-16
Here is the graph of the data with the superimposed predictions from lm (red) and gam (blue):
plot(y~x,ed)
abline(lm(y~x,ed),col="red")
points(ed$x,predict(mg,ed),col="blue")
plot with data and model predictions from lm and gam
And here is the smooth, which looks perfectly detrended and very small, consistent with the graph:
plot.gam(mg)
Here is the same analysis using my data.
read.table("ev.txt") -> ev
me <- gam(logart~s(item,m=c(2,0))+item,data=ev,method="REML")
The linear slope for item estimated by gam is negative (-0.11) although the actual slope is positive (+0.32):
> summary(me)
Family: gaussian
Link function: identity
Formula:
logart ~ s(item, m = c(2, 0)) + item
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.305740 0.003587 1758.100 <2e-16 ***
item -0.112921 0.049174 -2.296 0.0217 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(item) 6.125 8 3.583 3.15e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.0298 Deviance explained = 3.14%
-REML = -842.54 Scale est. = 0.039768 n = 4453
> summary(lm(logart~item,ev))
Call:
lm(formula = logart ~ item, data = ev)
Residuals:
Min 1Q Median 3Q Max
-0.65009 -0.12830 0.00627 0.13301 0.61885
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.311575 0.003001 2103.47 <2e-16 ***
item 0.031751 0.003048 10.42 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2001 on 4451 degrees of freedom
Multiple R-squared: 0.0238, Adjusted R-squared: 0.02359
F-statistic: 108.5 on 1 and 4451 DF, p-value: < 2.2e-16
Here is the plot of the data along with the predictions from lm and gam, both of which look perfectly reasonable:
plot(logart~item,ev,col="#20202020")
abline(lm(logart~item,ev),col="red")
points(ev$item,predict(me,ev),col="blue")
plot of my data with lm and gam predictions
So the overall model seems fine except that the smooth is not in fact detrended, and seems to be doing most of the work (note the y axis range):
plot.gam(me)
(The results are the same with bam
, and explicitly specifying bs='tp'
makes no difference, as expected.)
Is this even possible? How could different datasets lead to so different results in terms of the linear component in gam/bam (mis)matching the lm/lmer estimate? Have I misunderstood the point in (or the process for) testing for the nonlinearity separately? How else should/could I do it?
The following answer was sent by Simon Wood by e-mail:
The smoothing basis is not orthogonalized to the linear trend, so you would not expect the slope estimates to match those from lm. The point of this construction is really to enable testing of whether anything is needed beyond a linear trend, i.e. whether the smooth is needed rather than a simple linear effect.
I can't say I understand what exactly the linear slope represents in this model, but it sounds like the model is not supposed to produce a separation into a linear and a detrended nonlinear component as I had falsely assumed. To do that, Wood suggested one could
clone the smooth constructor and add the code to orthogonalize the basis