rparty

Confidence intervals: Using mob() trees (partykit package) with logistic() model


How can I calculate the CI for the models in my terminal nodes of MOB objects? Alternatively, Can I extract each model of terminal nodes so that it is, in my case, a logistic regression model and calculate the CI of each?

Thank you


Solution

  • Basics

    With refit.modelparty() you can refit the models associated with the nodes of a modelparty object as returned by mob() or glmtree(). Then you can apply the usual functions to these to extract information that you are interested in.

    Caveats

    1. The inference after the learning of the tree is not "honest" anymore, i.e., in your case the coverage of the confidence intervals will likely not be at the nominal level. To obtain honest results, you would have to split your data and learn the tree on one subsample, and then refit the models in the terminal nodes on the other subsample.

    2. If you use glmtree() then the models within each node are not full glm objects in order to make the fitting of the tree more efficient. Therefore, the profile likelihood confidence intervals from the confint() method do not work out of the box. You can either resort to Wald confidence intervals (e.g., available in coefci() from lmtest) or refit the models (similar to comment 1.).

    Illustration

    As a starting point let us first consider a naive example based on glmtree() and coefci():

    ## Pima Indians diabetes data
    data("PimaIndiansDiabetes", package = "mlbench")
     
    ## recursive partitioning of a logistic regression model
    library("partykit")
    pid_tree <- glmtree(diabetes ~ glucose | pregnant +
      pressure + triceps + insulin + mass + pedigree + age,
      data = PimaIndiansDiabetes, family = binomial)
    
    ## extract fitted models in terminal nodes
    pid_glm <- refit.modelparty(pid_tree,
      node = nodeids(pid_tree, terminal = TRUE))
    
    ## compute Wald confidence intervals
    library("lmtest")
    lapply(pid_glm, coefci)
    ## $`2`
    ##                    2.5 %      97.5 %
    ## (Intercept) -13.36226424 -6.54075507
    ## glucose       0.03496439  0.08245134
    ## 
    ## $`4`
    ##                   2.5 %      97.5 %
    ## (Intercept) -8.27393574 -5.13723535
    ## glucose      0.03466962  0.05900533
    ## 
    ## $`5`
    ##                   2.5 %      97.5 %
    ## (Intercept) -3.84548740 -1.69642032
    ## glucose      0.01529946  0.03177217
    

    For the honest approach we could do:

    ## id for sample splitting
    n <- nrow(PimaIndiansDiabetes)
    id <- sample(1:n, round(n/2))
    
    ## estimate tree on learning sample
    pid_tree <- glmtree(diabetes ~ glucose | pregnant +
      pressure + triceps + insulin + mass + pedigree + age,
      data = PimaIndiansDiabetes[id,], family = binomial)
    
    ## out-of-sample prediction
    pid_new <- PimaIndiansDiabetes[-id,]
    pid_new$node <- predict(pid_tree, newdata = pid_new, type = "node")
    
    ## fit separate models on each subset, splitted by predicted node
    pid_glm <- lapply(
      split(pid_new, pid_new$node),
      function(d) glm(diabetes ~ glucose, data = d, family = binomial)
    )
    
    ## obtain profile likelihood confidence intervals
    lapply(pid_glm, confint)
    ## Waiting for profiling to be done...
    ## Waiting for profiling to be done...
    ## Waiting for profiling to be done...
    ## $`2`
    ##                   2.5 %     97.5 %
    ## (Intercept) -8.25379802 -4.5031005
    ## glucose      0.02743191  0.0569824
    ## 
    ## $`4`
    ##                    2.5 %     97.5 %
    ## (Intercept) -25.32777607 -4.5078931
    ## glucose       0.02090339  0.1617026
    ## 
    ## $`5`
    ##                   2.5 %      97.5 %
    ## (Intercept) -6.38422374 -2.66969443
    ## glucose      0.02222873  0.05060984
    

    In models based on a linear predictor like the logistic GLM it is also possible to fit a single model representing the entire tree. You just have to include interactions of all regressors with the node indicator. To obtain the same parameters and confidence intervals you need to use a nested coding (/) rather than the default interaction codeing (*):

    pid_glm2 <- glm(diabetes ~ 0 + factor(node)/glucose,
      data = pid_new, family = binomial)
    confint(pid_glm2)
    ## Waiting for profiling to be done...
    ##                              2.5 %      97.5 %
    ## factor(node)2          -8.25379802 -4.50310050
    ## factor(node)4         -25.32777607 -4.50789313
    ## factor(node)5          -6.38422332 -2.66969490
    ## factor(node)2:glucose   0.02743191  0.05698240
    ## factor(node)4:glucose   0.02090339  0.16170262
    ## factor(node)5:glucose   0.02222874  0.05060983