rmgcvlongitudinalrandom-effects

Nested sample design in mgcv syntax


I'm having trouble translating my sample design into the correct mgcv package syntax for random effects.

Here is the set-up: we trawl for fish, once per "collection" site, at the same 12 sites per month (hopefully), for 7 consecutive months, every year (these are my random intercepts, I think). However, sometimes sites in the database are either skipped by accident (NA), or are replaced by back-up sites (which is a near-by site to replace the intended one when poor conditions hinder us from trawling there). There is also a repeated measures aspect to the data wherein many fish lengths are recorded within the same site (this is just a covariate, but thought random intercepts are also for repeated measures). If my goal is to faithfully represent this design, how should I structure my random effects? Something like this (?):

(fSite = factor site; fCYR = factor calendar year)

s(fSite, bs='re') + s(fCYR, , bs='re')

or

s(fSite, fCYR, bs='re') <- Does this only work if the same 12 sites are present in each fCYR level?

Data:

> dput(station)
structure(list(CYR = c(2009, 2009, 2009, 2009, 2009, 2009, 2010, 
2010, 2010, 2010, 2010, 2010, 2010, 2011, 2011, 2011, 2011, 2011, 
2011, 2011, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2013, 2013, 
2013, 2013, 2013, 2013, 2013, 2014, 2014, 2014, 2014, 2014, 2014, 
2015, 2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 
2016, 2016, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2018, 
2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2019, 2019, 2019, 
2019, 2019, 2019, 2019, 2019, 2019), Month = c(6, 7, 8, 9, 10, 
11, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8, 
9, 10, 11, 4, 5, 6, 7, 8, 9, 10, 5, 6, 7, 8, 9, 10, 5, 6, 7, 
8, 9, 10, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8, 9, 10, 11, 12, 1, 
3, 5, 6, 7, 8, 9, 10, 11, 2, 3, 5, 6, 7, 8, 9, 10, 11), `Coll. Site 1` = c(21, 
23, 22, 23, 22, 23, 22, 22, 21, 68, 23, 22, 22, 70, 20, 23, 22, 
20, 21, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 
22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 21, 22, 22, 22, 
22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 
22, 22, NA, NA, 22, 22, 22, NA, NA, 22, 22, 22, 22, 22), `Coll. Site 2` = c(40, 
70, 70, 70, 68, 68, 68, 68, 70, 101, 112, 70, 70, 143, 70, 70, 
68, 67, 70, 67, 67, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 
70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 
70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 
70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70), 
    `Coll. Site 3` = c(70, 112, 112, 122, 123, 112, 122, 122, 
    111, 136, 134, 124, 135, 158, 122, 123, 124, 124, 112, 123, 
    123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 
    123, 123, 123, 123, 123, 123, 124, 123, 123, 123, 123, 123, 
    123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 
    123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 
    123, 123, 123, NA, 123, 123, 123, 123, 123, 123, 123), `Coll. Site 4` = c(147, 
    147, 133, 134, 146, 159, 54, 134, 134, 159, 156, 145, 146, 
    167, 159, 146, 144, 146, 147, 146, 145, 146, 146, 146, 146, 
    146, 146, 146, 146, 146, 146, 146, 145, 146, 146, 146, 146, 
    146, 146, 146, 146, 146, 146, 146, 146, 146, 146, 146, 146, 
    146, 146, 146, 146, 146, 146, 146, 146, 146, 146, 146, 146, 
    146, 146, 146, 146, 146, 146, 146, 146, 146, 146, 146, 146, 
    146, 146, 146, 146, 146, 146), `Coll. Site 5` = c(176, 173, 
    168, 168, 169, 172, 168, 168, 168, 168, 168, 169, 167, 241, 
    169, 172, 146, 168, 168, 169, 167, 168, 168, 168, 168, 168, 
    168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 
    168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 
    168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 
    168, 168, 168, 168, 168, 168, 168, 168, NA, 168, 168, 168, 
    168, 168, 168, 168, 168), `Coll. Site 6` = c(241, 258, 241, 
    241, 241, 241, 144, 241, 169, 241, 241, 241, 241, 266, 241, 
    241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 240, 241, 
    241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 
    241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 240, 
    241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 
    241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 
    241, 241, 241, 241), `Coll. Site 7` = c(280, 279, 266, 266, 
    253, 266, 24, 266, 241, 270, 266, 266, 266, 270, 270, 266, 
    266, 266, 266, 266, 266, 266, 266, 266, 266, 266, 257, 266, 
    266, 266, 266, 266, 266, 266, 266, 266, 266, 266, 266, 266, 
    266, 266, 266, 266, 266, 266, 266, 266, 266, 266, 266, 266, 
    257, 266, 266, 266, 266, 266, 266, 266, 266, 266, 266, 266, 
    266, 266, 266, 266, 266, 266, 266, 266, 266, 266, 266, 266, 
    266, NA, 266), `Coll. Site 8` = c(281, 283, 270, 270, 257, 
    270, NA, 270, 266, 279, 270, 270, 270, 301, 301, 270, 270, 
    270, 270, 270, 270, NA, 270, 270, 270, 270, 265, 270, 270, 
    270, 270, 270, 270, 270, 270, NA, 270, 270, 270, 270, 270, 
    270, 270, 270, 270, 270, 270, 270, 270, 270, 270, 270, 266, 
    270, 270, 270, 270, 270, 270, 270, 270, 270, 270, 270, 270, 
    270, 270, 270, 270, 270, 270, 270, 270, 270, 270, 270, 270, 
    270, 270), `Coll. Site 9` = c(314, 302, 301, 314, 301, 301, 
    NA, 301, 270, 301, 301, 301, 301, 402, 402, 301, 301, 301, 
    314, 301, 314, 312, 301, 301, 302, 301, 301, 301, 301, 301, 
    NA, 301, 301, 301, 301, 301, 301, 301, 301, 301, 301, 301, 
    301, 301, 301, 301, 301, 301, 301, 301, 301, 301, 301, 301, 
    301, 301, 301, 301, 301, 301, 301, 301, 301, 301, 301, 301, 
    301, 301, 301, 301, 301, 301, 301, 301, 301, 301, 301, 301, 
    301), `Coll. Site 10` = c(401, 314, 303, 609, 314, 402, NA, 
    314, 301, 314, 314, 314, 314, 618, 618, 314, 314, 314, 403, 
    402, 402, 314, 314, 314, 314, 314, 314, 314, 314, 314, 314, 
    314, 314, 314, 314, 314, 314, 314, 315, 314, 314, 314, 314, 
    314, 314, 314, 314, 314, 314, 314, 314, 314, 314, 314, 314, 
    314, 314, 314, 314, 314, 314, 314, 314, NA, 314, 314, 314, 
    314, 314, 314, 314, 314, 314, 314, 314, 314, 314, 314, 314
    ), `Coll. Site 11` = c(618, 402, 402, 618, 402, 618, NA, 
    402, 314, 402, 402, 402, 402, 314, 266, 402, 402, 402, 609, 
    618, 618, 402, 402, 402, 402, 402, 402, 402, 402, 402, 402, 
    402, 402, 402, 402, 402, 402, 402, 402, 402, 402, 402, 402, 
    402, 402, 402, 402, 402, 402, 402, 402, 402, 402, 402, 402, 
    402, NA, 402, 402, 402, 402, 402, 402, 402, 402, 402, 402, 
    402, 402, 402, 402, 402, 402, 402, 402, 402, 402, 402, 402
    ), `Coll. Site 12` = c(NA, 621, 618, NA, 621, NA, NA, 618, 
    402, 618, 619, 618, 618, NA, NA, 621, 618, 618, 616, 314, 
    301, 618, 621, 618, 618, 618, 616, 618, 618, 618, 618, 618, 
    618, 618, 618, 618, 618, 618, 618, 618, 618, 618, 618, 618, 
    618, 618, 618, 618, 618, 618, 618, 618, 618, 618, 618, 618, 
    618, 618, 618, 618, 618, 618, 621, 618, 618, 618, 618, 618, 
    618, 618, 618, 618, 618, 618, 618, 618, 618, 618, 618)), row.names = c(NA, 
-79L), class = c("tbl_df", "tbl", "data.frame"))

Solution

  • Does this only work if the same 12 sites are present in each fCYR level?

    Try it and see. tldr: yes, it works; no it doesn't require the same 12 sites to be present in each year. The key thing is that unique sites should be coded uniquely. However, using both the random effect forms you mention in the model can be beneficial as it can lead to a more parsimonious fit.

    Assuming your data is in df and properly formatted for what I think your model is

    library("mgcv")
    library("tidyr")
    library("dplyr")
    library("janitor")
    
    df2 <- pivot_longer(df, cols = c(-Month, -CYR), names_to = "site",
                 values_to = "value", names_prefix = "^Coll\\. ") |>
    janitor::clean_names() |>
    mutate(fcyr = factor(cyr), site = factor(site))
    

    we can fit three models to compare what random effects are doing:

    knots <- list(month = c(0.5, 12.5))
    m1 <- gam(value ~ s(month, bs = "cc") +
                s(fcyr, bs = "re") + s(site, bs = "re"),
              data = df2, family = nb(), knots = knots)
    m2 <- gam(value ~ s(month, bs = "cc") + s(fcyr, site, bs = "re"),
              data = df2, family = nb(), knots = knots)
    m3 <- gam(value ~ s(month, bs = "cc") +
                s(fcyr, bs = "re") + s(site, bs = "re") +
                s(fcyr, site, bs = "re"),
              data = df2, family = nb(), knots = knots)
    

    As there are 132 unique (site, year) pairs

    > with(df2, nlevels(interaction(fcyr, site, drop = TRUE)))                    
    [1] 132
    

    m2 uses almost 131 EDF (one df gets lost due to the model intercept IIRC)

    > model_edf(m1)                                                               
    # A tibble: 1 × 2
      model   edf
      <chr> <dbl>
    1 m1     19.8
    
    r$> model_edf(m2)                                                               
    # A tibble: 1 × 2
      model   edf
      <chr> <dbl>
    1 m2     132.
    
    r$> model_edf(m3)                                                               
    # A tibble: 1 × 2
      model   edf
      <chr> <dbl>
    1 m3     58.1
    

    while m3 which contains all three ranef terms uses on ~58 EDF.

    As your comment speculates, you get a more parsimonious fit to the data if you include the site and year random effects plus their combination than if you just fitted the site:fcyr form as per m2.

    Both m2 and m3 allow you to estimate a random intercept for all 132 (site,year) pairs. So what's the difference?

    As your comment surmises, the two "main" random effects encode the average effects of each site and the average effect of each year. The s(site, fcyr, bs = "re") term then estimates the deviations from the site and year averages for individual (site,year) pairs. Once you have estimated the average site and year effects, the deviations around individual sites due to year (or equivalently deviations over years about individual site means) are relatively small. Hence the s(site,fcyr, bs = "re") term only uses ~40 EDF; much of the potential deviation is shrunk away because the random site and year effects are already accounted for in the model.

    So in this instance you could use either form:

    1. ~ s(fcyr, bs = "re") + s(site, bs = "re") + s(fcyr, site, bs = "re"), or
    2. ~ s(fcyr, site, bs = "re")

    but option 1 is more parsimonious. That need not be the case however. If the within site and within year variation is large, it is not inconceivable that the decomposed form (1.) will offer no advantage over 2.

    The main thing with random effects is to make sure that the groups are labelled uniquely. Say you have 3 fields and take measurements repeatedly from 6 locations within each field. You wouldn't code the six locations as loc1, loc2, ..., loc6 in each field, because the model would think that any row with location equal to loc1 is the same location. Say the fields are labelled A, B and C, you would code the locations as locA1, locA2, ..., locB1, ..., locC6 to avoid any ambiguity. In the mixed effects software world in R, this kind of question comes up a lot; there, with the way the formula interface to things like lme() and lmer() work you could get away with the loc1 coding form if you said location was nested in field. Hence discussions about crossed or nested random effects. That whole discussion goes away if you code your grouping variable correctly.

    With mgcv, if you used the loc1 form and then fitted the formula 1. form of the model you'd get things wrong

    y ~ s(field, bs="re") + s(location, bs="re") + s(field, location, bs="re")
    

    as the s(location, bs="re") term would think all observations labelled loc1 were all the same location ("subject"), when they aren't. In that case, the 2. form of the model would be the correct one

    y ~ s(field, location, bs="re")
    

    In summary, which form you used really comes down to how you coded your grouping factors and not whether the design is nested or crossed.