I am trying to predict the frequency of an outcome and I have a lot of data. I have already fitted a glm to the data and now I am trying to use ctree to understand any complex interaction in the dataset that I may have missed.
Instead of directly predicting the residual, I have tried to offset the ctree model to the glm prediction. However, I seem to get the same results when I: (a) use no offset at all, (b) specify the offset in the function, and (c) use the offset in the ctree equation.
I have tried looking at the documentation(here and here) but I have not found it helpful.
I have created some dummy data to mimic what I am doing:
library(partykit)
# Set random number seed
set.seed(15)
# Create Dataset
freq <- rpois(10000, 1.2)
example_df <- data.frame(var_1 = rnorm(10000, 180, 20) * freq / 10,
var_2 = runif(10000, 1, 8),
var_3 = runif(10000, 1, 2.5) + freq / 1000)
example_df$var_4 = example_df$var_1 * example_df$var_3 + rnorm(10000, 0.1, 0.5)
example_df$var_5 = example_df$var_2 * example_df$var_3 + rnorm(10000, 2, 50)
# Create GLM
base_mod <- glm(freq ~ ., family="poisson", data=example_df)
base_pred <- predict(base_mod)
# Create trees
exc_offset <- ctree(freq ~ ., data = example_df, control = ctree_control(alpha = 0.01, minbucket = 1000))
func_offset <- ctree(freq ~ ., data = example_df, offset = base_pred, control = ctree_control(alpha = 0.01, minbucket = 1000))
equ_offset <- ctree(freq ~ . + offset(base_pred), data = example_df, control = ctree_control(alpha = 0.01, minbucket = 1000))
I expected the outcomes of the trees to be different when the offset is included from when the offset isn't included. However, the outputs seem to be the same:
# Predict outcomes
summary(predict(exc_offset, example_df))
summary(predict(func_offset, example_df))
summary(predict(equ_offset, example_df))
# Show trees
exc_offset
func_offset
equ_offset
Does anyone know what is going on? Have should I use the offsets?
The ctree()
algorithm is not based on a linear predictor and hence including an offset is not possible out-of-the-box. It is possible to include an offset by using a model-based ytrafo
score, though. See vignette("ctree", package = "partykit")
for more details (also available on CRAN at https://CRAN.R-project.org/web/packages/partykit/vignettes/ctree.pdf).
However, the more natural solution is to use a GLM model-based tree with the glmtree()
function. I think you try to fit this tree:
glmtree(freq ~ ., data = example_df, offset = base_pred, family = poisson,
alpha = 0.01, minsize = 1000)
See vignette("mob", package = "partykit")
for more details (also available on CRAN at https://CRAN.R-project.org/web/packages/partykit/vignettes/mob.pdf).
But rather than estimating the offset once and then the tree once, it is also easily possible to iterate this process to obtain a better fit. We called this PALM tree (partially additive linear tree), available in the palmtree
package (https://doi.org/10.1007/s11634-018-0342-1).
Finally, I would encourage you to explore which of the available covariates is used as:
Possibly, the resulting model might be more interpretable when the right parts for each covariate.