rsascovariance-matrix

Reproducing complex SAS `PROC MIXED` specification in R `nlme::lme()` with random effects and correlation structure


TL;DR I'm struggling to specify a PROC MIXED SAS model in R using nlme:::lme.

What I'm trying to do

I am trying to reproduce the SAS code provided in Hamlet et al. (2003) in R code. The SAS code that the authors use makes use of PROC MIXED:

proc mixed;
class persnum vtype replicate;
model response = vtype / solution ddfm=kr;
random vtype / type=un subject=persnum g gcorr v vcorr;
repeated vtype / type=un subject=replicate(persnum) r rcorr;
run;

What I've done

I know that solution ddfm=kr specifies the Kenward-Roger method for computing the denominator degrees of freedom for the fixed effects, which isn't available in R. Ignoring that, I've tried to reproduce the model using nlme::lme() (and even geepack::geeglm() even though I know I'm switching out the random effects). My nlme::lme() R code is:

model <-
  nlme::lme(
    fixed = response ~ vtype
    ,random = list( ID = pdSymm( ~ vtype) )
    ,correlation = corSymm( form = ~ vtype | persnum/replicate)
    ,method = "ML"
    ,data = phPACO_long
  )

The model errors:

Error in if (length(uCov) != maxCov) { : missing value where TRUE/FALSE needed

I can't find uCov or maxCov in the nlme git repo so I don't know how to find and fix the error. Does anyone know about this error, where it comes from, and how I might fix it?

What others have done

I've seen that TueBimet reproduced PROC MIXED code that had a REPEATED statement but not a RANDOM statement, just by using the random argument in nlme::lme(). I don't have SAS so I can't check the validity of this approach. Also, the SAS code I'm trying to translate has different groupings for the RANDOM statement and the REPEATED statement, so my gut tells me this won't work for me.

The data

The structure of the long-form data is:

structure(list(persnum= structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
7L, 7L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 
8L, 8L, 8L, 8L, 8L, 8L), levels = c("1", "2", "3", "4", "5", 
"6", "7", "8"), class = "factor"), replicate= structure(c(1L, 1L, 
2L, 2L, 3L, 3L, 4L, 4L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 1L, 1L, 
2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 6L, 6L, 7L, 7L, 8L, 8L, 9L, 9L, 
1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 1L, 1L, 2L, 2L, 3L, 3L, 
4L, 4L, 5L, 5L, 6L, 6L, 7L, 7L, 8L, 8L, 1L, 1L, 2L, 2L, 3L, 3L, 
4L, 4L, 5L, 5L, 6L, 6L, 1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 
3L, 3L, 4L, 4L, 5L, 5L, 6L, 6L, 7L, 7L, 8L, 8L), levels = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9"), class = "ordered"), vtype= structure(c(2L, 
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L), levels = c("pH", "P_aCO2"), class = "factor"), response= c(6.68, 3.97, 6.53, 4.12, 
6.43, 4.09, 6.33, 3.97, 6.85, 5.27, 7.06, 5.37, 7.13, 5.41, 7.17, 
5.44, 7.4, 5.67, 7.42, 3.64, 7.41, 4.32, 7.37, 4.73, 7.34, 4.96, 
7.35, 5.04, 7.28, 5.22, 7.3, 4.82, 7.34, 5.07, 7.36, 5.67, 7.33, 
5.1, 7.29, 5.53, 7.3, 4.75, 7.35, 5.51, 7.35, 4.28, 7.3, 4.44, 
7.3, 4.32, 7.37, 3.23, 7.27, 4.46, 7.28, 4.72, 7.32, 4.75, 7.32, 
4.99, 7.38, 4.78, 7.3, 4.73, 7.29, 5.12, 7.33, 4.93, 7.31, 5.03, 
7.33, 4.93, 6.86, 6.85, 6.94, 6.44, 6.92, 6.52, 7.19, 5.28, 7.29, 
4.56, 7.21, 4.34, 7.25, 4.32, 7.2, 4.41, 7.19, 3.69, 6.77, 6.09, 
6.82, 5.58), persnum.replicate= structure(c(1L, 1L, 9L, 9L, 17L, 17L, 
25L, 25L, 2L, 2L, 10L, 10L, 18L, 18L, 26L, 26L, 3L, 3L, 11L, 
11L, 19L, 19L, 27L, 27L, 32L, 32L, 37L, 37L, 41L, 41L, 44L, 44L, 
47L, 47L, 4L, 4L, 12L, 12L, 20L, 20L, 28L, 28L, 33L, 33L, 5L, 
5L, 13L, 13L, 21L, 21L, 29L, 29L, 34L, 34L, 38L, 38L, 42L, 42L, 
45L, 45L, 6L, 6L, 14L, 14L, 22L, 22L, 30L, 30L, 35L, 35L, 39L, 
39L, 7L, 7L, 15L, 15L, 23L, 23L, 8L, 8L, 16L, 16L, 24L, 24L, 
31L, 31L, 36L, 36L, 40L, 40L, 43L, 43L, 46L, 46L), levels = c("1.1", 
"2.1", "3.1", "4.1", "5.1", "6.1", "7.1", "8.1", "1.2", "2.2", 
"3.2", "4.2", "5.2", "6.2", "7.2", "8.2", "1.3", "2.3", "3.3", 
"4.3", "5.3", "6.3", "7.3", "8.3", "1.4", "2.4", "3.4", "4.4", 
"5.4", "6.4", "8.4", "3.5", "4.5", "5.5", "6.5", "8.5", "3.6", 
"5.6", "6.6", "8.6", "3.7", "5.7", "8.7", "3.8", "5.8", "8.8", 
"3.9"), class = c("ordered", "factor"))), row.names = c(NA, -94L
), class = c("tbl_df", "tbl", "data.frame"))

p.s.

The model fits if I exclude the mvar in the correlation argument but the resulting covariance matrix does not match Hamlett et al.'s (which makes sense because it assumes a different correlation structure).


Solution

  • This is indeed a bit of a challenge, let's start with the "ground truth" results from SAS (ignoring the Kenward-Roger degrees of freedom & P-values [edit] and using maximum likelihood, see below).

    The one thing I did was change the repeated statement from a variance-covariance parametrization to a variance-correlation one; this has no effect on the fit but more closely reports results in the same scale as nlme::lme which parametrizes R-side as correlation only:

            Solution for Fixed Effects
    Effect       vtype     Estimate          Std.Err 
    
    Intercept              5.0081686402      0.2436367946
    vtype        P_aCO2    2.1068507515      0.2529814138
    
    
           Covariance Parameter Estimates
    Cov Parm      Subject               Estimate
    
    UN(1,1)       persnum                0.074648628
    UN(2,1)       persnum                0.025216431
    UN(2,2)       persnum                0.425242835
    Var(1)        replicate(persnum)     0.011559757
    Var(2)        replicate(persnum)     0.254671475
    Corr(2,1)     replicate(persnum)    -0.509151762
    

    You will need several changes to your nlme::lme call to get here, but it is possible.

    The cause of your error is that lme can only handle numeric covariates in its random effect specification. You'll have to manually encode dummies for vtype, though as we'll see next you only need one:

    phPACO_long <- within(phPACO_long, {vPH = 0+(vtype=="pH")})
    

    By default SAS does not add an overall intercept to its random/repeated effects, so you get one dummy per level of vtype. If you try to do this in your current model you'll get an error from lme because it requires covariates to have a single level within a persnum/replicate block, which isn't true in your data. But fret not, we can include an overall intercept plus a single dummy and get to the same final fit. This applies to both random and correlation.

    Second, there's and error in your random specification: the effect needs to have the name of the block, persnum. Also, correlation blocks must be nested within random blocks, but luckily that is the case here. (They must have the same names to pass this check)

    Third, SAS defaults to restricted maximum likelihood, as does lme. Remove method="ML" or explicitly set the default method="REML". Edit: as it turns out they use ML in the paper (your SAS specification left it at the default REML), so to reproduce the paper exactly you'll need to use ML.

    And last, as mentioned lme only accepts a correlation matrix in correlation, so you need to allow for heterogeneous residual variance via weights = varIdent(). This effectively produces an unstructured R-side covariance (or variance-correlation) matrix.

    Putting all of that together:

    model <- nlme::lme(
      fixed = response ~ vtype,
      random = list("persnum" = pdSymm(form = ~1+vPH)),
      correlation = corSymm(form = ~1+vPH | persnum/replicate),
      weights = varIdent(form = ~1 | vtype),
      method = "ML",
      data = phPACO_long
    )
    
    summary(model)
    
    #> Random effects:
    #>             StdDev 
    #> (Intercept) 0.2732194
    #> vPH         0.6704198
    #> Residual    0.1075163      
    ## Note that above is sigma, SAS reports sigma^2
    
    #> Correlation: 
    #>    1     
    #> 2 -0.509
    
    #> Variance function:
    #>  Parameter estimates:
    #>   P_aCO2       pH 
    #> 1.000000 4.693707
    
    #> Fixed effects:  response ~ vtype 
    #>                Value Std.Error
    #> (Intercept) 5.008169 0.2460781
    #> vtypeP_aCO2 2.106851 0.2555021
    

    Not everything is reported on the same scale here, but the estimates match very well. For example, SAS UN(1,1) = 0.074648628 is the random intercept SD 0.2732194^2 = 0.07464884, the R-side Var(1) = 0.011559757 is the residual SD 0.1075163^2 = 0.01155975, Var(2) = 0.254671475 is the residual SD times its weight (0.1075163 * 4.693707)^2 = 0.2546716, et cetera. Thanks to the shared variance-correlation parametrization the R off-diagonal also matches (within rounding).

    To convert between variance-covariance and variance-correlation parametrization, recall that corr_AB = cov_AB / (sd_A * sd_B) -- this sure was easier in MathJax. So, to recover the off-diagonal covariance as in the paper you simply calculate (using the SAS values here) sqrt(0.011559757) * sqrt(0.254671475) * -0.509151762 = -0.0276256.