GLMERTREE ranefstart and offset

I am using lmertree to fit a degradation model of the form

ln(y)=offset(ln(t0_value))+b*time

where y is the outcome of interest, t0_value is the initial concentration of the substance at time 0, b is the parameter to be estimated and time is a variable measuring time. This is a longitudinal study, therefore there exist an id variable which indexes measures form the same subject (HC), and finally some covariates a the level of the subject (i.e non time dependent) which are of interest.

I have being experimenting different kinds of lmertree models and exploring different options in the function, and I got confused with the options ranefstart and offset, in particular, if I set ranefstart=T I get stunningly different results from that when ranefstart=NULL

Now I show some of the code used to fit the models:

lmm_tree1 <- lmertree(log(y) ~-1+ time | ((-1+time)|HC) |
                       TD+EP2+DFA+DTCX+TIF+SCV, 
                       data = z0l,
                       offset = log(z0l[,"value_t0"]),
                       ranefstart =T)

lmm_tree2 <- lmertree(log(y) ~-1+ time | ((-1+time)|HC) |
                       TD+EP2+DFA+DTCX+TIF+SCV, 
                       data = z0l,
                       offset = log(z0l[,"value_t0"]),
                       ranefstart =NULL)

lmm_tree <- lmertree(log(y) ~-1+ time | ((-1+time)|HC) |
                       TD+EP2+DFA+DTCX+TIF+SCV, 
                       data = z0l,
                       offset = log(z0l[,"value_t0"]),
                       ranefstart = z0l[,"value_t0"])

Note that I have eliminated the intercept and used the offset option to specify the model that I want.

Models lmm_tree2 and lmm_tree3 are very similar (they differ in depth but splits criteria are quite similar), however, model lmm_tree1 has only one node.

The question is: when and why should I use the ranefstart option?

Solution

Argument ranefstart

Function lmertree iterates between estimating the LM tree-part of the model (here log(y) ~ -1 + time | TD + EP2 + DFA + DTCX + TIF + SCV) and the random-effects part of the model (here log(y) ~ ((-1+time)|HC)). Argument ranefstart controls the initialization, but this will often be inconsequential. It might be consequential when there is variation in the response (here log(y)) that can potentially be explained by both the LM tree as well as the random effects:

By default, estimation is initialized with the LM tree part; then this variation will likely be captured by the tree.
If you override the default and specify ranefstart = TRUE, lmertree will initialize estimation with the random-effects part; then this variation will likely be captured by the random effects.

You obtained quite similar results using ranefstart = NULL and ranefstart = TRUE, indicating that the final model is not sensitive to initialization with the tree versus the random effects. In that case, using the default is fine.

By specifying ranefstart = z0l[,"value_t0"], variable z0l[,"value_t0"] will be included as an offset in the linear predictor in the first iteration of the estimation. In further iterations, this offset will no longer be used. Because you already specified offset = log(z0l[,"value_t0"]), in the first iteration the offset is doubled; this could send the iterative estimation into a wrong direction, and further iterations might not be able to correct for this anymore. This could explain why you got such different results.

The use of the ranefstart argument is useful only, when you expect that a substantial part of the variance in the response can be potentially accounted for by the tree, as well as by the random effects, and you prefer this variation to be accounted for by the random-effects part of the model, not by the tree.

Argument cluster

In addition: You mention you have subject-level (level-II) partitioning covariates. I would advise the use of the cluster argument in that case, because by default the parameter-stability tests employed for selecting the splitting variables assume the partitioning variables to be measured at the lowest level (level I). By specifying cluster = HC, the parameter-stability tests will account for the partitioning variables being measured at level II. When the parameter-stability tests are performed at level I, the tests may be overpowered.

Plotting level-II group size in terminal nodes

In a future version of glmertree it would be nice to have an argument of the plotting function to specify whether n should reflect sample size at level I, II, etc. For now, I can suggest the following approach:

## Fit an example lmertree with partitioning variables at level II
library("glmertree")
data("GrowthCurveDemo", package = "glmertree")
form <- y ~ time | person | x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8
lt <- lmertree(form, cluster = person, data = GrowthCurveDemo)

## Create a tree copy with node sample sizes at level II
node_ids <- predict(lt, type = "node")
lt_node <- as.list(lt$tree$node)
for (i in unique(node_ids)) {
  lt_node[[i]]$info$nobs <-
    length(unique(lt$tree$data[node_ids == i, "(cluster)"]))
}
lt2 <- lt
lt2$tree$node <- as.partynode(lt_node)

## Compare resulting trees:
plot(lt, which = "tree", fitted = "marginal",
     main = "n is group size at level I")
plot(lt2, which = "tree", fitted = "marginal", 
     main = "n is group size at level II")

Note that what you would need to adjust for your own code/tree is that your (g)lmertree should be named lt. Also, the use of fitted = "marginal" is not required, but for longitudinal data often makes for a more intuitive plot. See ?plot.lmertree for more info.