rvariancenlme

reference category in varIdent function (nlme package) depends on data order


I found that function varIdent has an undesired behavior which is dependent on data order. In detail, the reference category depends on data order, see the next example

set.seed(123)
library(tidyverse)
library(nlme)
#> 
#> Attaching package: 'nlme'
#> The following object is masked from 'package:dplyr':
#> 
#>     collapse

n_smpl<-10
z0<-data.frame(sex=sample(c("F","M"),n_smpl,prob = c(0.6,0.4),replace = T) %>% factor)
summary(z0)
#>  sex  
#>  F:6  
#>  M:4
X<-model.matrix(~sex, z0)

sd_e<-1.3
e<-rnorm(nrow(z0),0,sd_e)
my_bts<-c(10,1.4)

z0$y<-(X %*% my_bts+e)%>% drop

m0<-gls(model=y~sex,weights=varIdent(form=~1|sex),data=z0[order(z0$sex),])
m1<-gls(model=y~sex,weights=varIdent(form=~1|sex),data=z0[order(z0$sex,decreasing =T),])

summary(m0)
#> Generalized least squares fit by REML
#>   Model: y ~ sex 
#>   Data: z0[order(z0$sex), ] 
#>        AIC      BIC    logLik
#>   35.82722 36.14498 -13.91361
#> 
#> Variance function:
#>  Structure: Different standard deviations per stratum
#>  Formula: ~1 | sex 
#>  Parameter estimates:
#>         F         M 
#> 1.0000000 0.5305269 
#> 
#> Coefficients:
#>                 Value Std.Error   t-value p-value
#> (Intercept) 10.344227 0.5847687 17.689434  0.0000
#> sexM         0.967754 0.6973690  1.387721  0.2026
#> 
#>  Correlation: 
#>      (Intr)
#> sexM -0.839
#> 
#> Standardized residuals:
#>         Min          Q1         Med          Q3         Max 
#> -1.38845834 -0.72023259 -0.02681155  0.85333098  1.31623645 
#> 
#> Residual standard error: 1.432385 
#> Degrees of freedom: 10 total; 8 residual
summary(m1)
#> Generalized least squares fit by REML
#>   Model: y ~ sex 
#>   Data: z0[order(z0$sex, decreasing = T), ] 
#>        AIC      BIC    logLik
#>   35.82722 36.14498 -13.91361
#> 
#> Variance function:
#>  Structure: Different standard deviations per stratum
#>  Formula: ~1 | sex 
#>  Parameter estimates:
#>        M        F 
#> 1.000000 1.884919 
#> 
#> Coefficients:
#>                 Value Std.Error   t-value p-value
#> (Intercept) 10.344227 0.5847687 17.689434  0.0000
#> sexM         0.967754 0.6973690  1.387721  0.2026
#> 
#>  Correlation: 
#>      (Intr)
#> sexM -0.839
#> 
#> Standardized residuals:
#>         Min          Q1         Med          Q3         Max 
#> -1.38845834 -0.72023259 -0.02681155  0.85333098  1.31623645 
#> 
#> Residual standard error: 0.7599187 
#> Degrees of freedom: 10 total; 8 residual

Created on 2023-04-04 by the reprex package (v2.0.1)

Look at Variance function outputs for m0 and m1 models, the reference category is different in each model. This does not happens in the mean structure component of the model.


Solution

  • This seems to be a bug, or at least an infelicity, in nlme. If your sex variable wasn't already a factor with defined levels I would say it was your fault, or at least expected - but it is.

    I have reported this here.

    A patched version that fixes this problem is available at https://github.com/bbolker/nlme: remotes::install_github("bbolker/nlme") to install it (you'll need to have development tools installed to install from source).

    A slightly more compact reproducible example:

    library(nlme)
    dd <- transform(mtcars,
                    am = factor(am))
    dd <- dd[order(dd$am),]
    dd_rev <- dd[order(dd$am, decreasing = TRUE),]
    
    m0 <- gls(mpg ~ hp, weights = varIdent(form = ~1|am), data = dd)
    m1 <- update(m0, data = dd_rev)
    
    coef(m0$modelStruct$varStruct)  # 0.9776946
    coef(m1$modelStruct$varStruct)  # -0.9776946