How can I calculate the CI for the models in my terminal nodes of MOB objects? Alternatively, Can I extract each model of terminal nodes so that it is, in my case, a logistic regression model and calculate the CI of each?
Thank you
With refit.modelparty()
you can refit the models associated with the nodes of a modelparty
object as returned by mob()
or glmtree()
. Then you can apply the usual functions to these to extract information that you are interested in.
The inference after the learning of the tree is not "honest" anymore, i.e., in your case the coverage of the confidence intervals will likely not be at the nominal level. To obtain honest results, you would have to split your data and learn the tree on one subsample, and then refit the models in the terminal nodes on the other subsample.
If you use glmtree()
then the models within each node are not full glm
objects in order to make the fitting of the tree more efficient. Therefore, the profile likelihood confidence intervals from the confint()
method do not work out of the box. You can either resort to Wald confidence intervals (e.g., available in coefci()
from lmtest
) or refit the models (similar to comment 1.).
As a starting point let us first consider a naive example based on glmtree()
and coefci()
:
## Pima Indians diabetes data
data("PimaIndiansDiabetes", package = "mlbench")
## recursive partitioning of a logistic regression model
library("partykit")
pid_tree <- glmtree(diabetes ~ glucose | pregnant +
pressure + triceps + insulin + mass + pedigree + age,
data = PimaIndiansDiabetes, family = binomial)
## extract fitted models in terminal nodes
pid_glm <- refit.modelparty(pid_tree,
node = nodeids(pid_tree, terminal = TRUE))
## compute Wald confidence intervals
library("lmtest")
lapply(pid_glm, coefci)
## $`2`
## 2.5 % 97.5 %
## (Intercept) -13.36226424 -6.54075507
## glucose 0.03496439 0.08245134
##
## $`4`
## 2.5 % 97.5 %
## (Intercept) -8.27393574 -5.13723535
## glucose 0.03466962 0.05900533
##
## $`5`
## 2.5 % 97.5 %
## (Intercept) -3.84548740 -1.69642032
## glucose 0.01529946 0.03177217
For the honest approach we could do:
## id for sample splitting
n <- nrow(PimaIndiansDiabetes)
id <- sample(1:n, round(n/2))
## estimate tree on learning sample
pid_tree <- glmtree(diabetes ~ glucose | pregnant +
pressure + triceps + insulin + mass + pedigree + age,
data = PimaIndiansDiabetes[id,], family = binomial)
## out-of-sample prediction
pid_new <- PimaIndiansDiabetes[-id,]
pid_new$node <- predict(pid_tree, newdata = pid_new, type = "node")
## fit separate models on each subset, splitted by predicted node
pid_glm <- lapply(
split(pid_new, pid_new$node),
function(d) glm(diabetes ~ glucose, data = d, family = binomial)
)
## obtain profile likelihood confidence intervals
lapply(pid_glm, confint)
## Waiting for profiling to be done...
## Waiting for profiling to be done...
## Waiting for profiling to be done...
## $`2`
## 2.5 % 97.5 %
## (Intercept) -8.25379802 -4.5031005
## glucose 0.02743191 0.0569824
##
## $`4`
## 2.5 % 97.5 %
## (Intercept) -25.32777607 -4.5078931
## glucose 0.02090339 0.1617026
##
## $`5`
## 2.5 % 97.5 %
## (Intercept) -6.38422374 -2.66969443
## glucose 0.02222873 0.05060984
In models based on a linear predictor like the logistic GLM it is also possible to fit a single model representing the entire tree. You just have to include interactions of all regressors with the node indicator. To obtain the same parameters and confidence intervals you need to use a nested coding (/
) rather than the default interaction codeing (*
):
pid_glm2 <- glm(diabetes ~ 0 + factor(node)/glucose,
data = pid_new, family = binomial)
confint(pid_glm2)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## factor(node)2 -8.25379802 -4.50310050
## factor(node)4 -25.32777607 -4.50789313
## factor(node)5 -6.38422332 -2.66969490
## factor(node)2:glucose 0.02743191 0.05698240
## factor(node)4:glucose 0.02090339 0.16170262
## factor(node)5:glucose 0.02222874 0.05060983