I am trying to predict what a patients response to a treatment will be. I am an ethusiastic amateur when it comes to machine learning, but I can usually muddle my way through eventually. This one is beyond me.
I am using R 4.2.2 and tidymodels
in RStudio 2023.06.0+421 on an M1 Macbook Air.
My dataset is, unfortunately, very small. I have used it to construct logistic regression, Gaussian Naive Bayes, and C5.0 decision trees with varying degrees of success, but I figure XGBoost is worth at least trying. The data consists of 131 observations of various blood tests and ventilator settings. After pre-processing, there are 39 variables for each observation.
I have initially created a resampled object:
xgb_v_fold <-
vfold_cv(data = prone_session_1,
v = 5,
repeats = 5,
strata = mortality_28)
The recipe for data processing is:
xgb_recipe <-
recipe(prone_session_1, formula = mortality_28 ~ .) %>%
step_rm(patient_id,
bmi,
weight_kg,
fi_o2_supine,
pa_o2_supine) %>%
step_dummy(all_factor_predictors(), -mortality_28) %>%
step_impute_bag(all_predictors()) %>%
step_zv()
I have set the model up for hyperparameter tuning:
xgb_mod <-
boost_tree(mode = 'classification',
engine = 'xgboost',
mtry = tune(),
trees = tune(),
min_n = tune(),
tree_depth = tune(),
learn_rate = tune(),
loss_reduction = tune(),
sample_size = tune(),
stop_iter = tune()
)
For the tuning grid I have used a mix of default settings in the dials
package, and supplemented this with results from the finalize()
command.
xgb_param_fin <- extract_parameter_set_dials(xgb_mod) %>%
finalize(juice(xgb_recipe))
xgb_grid <- grid_regular(mtry(range = c(1, 39)),
trees(),
min_n(),
tree_depth(range = c(1, 5)),
learn_rate(),
loss_reduction(),
sample_size(range = c(1, 1)),
stop_iter(),
levels = 10
)
Combining all these gives a workflow object:
xgb_results <-
workflow() %>%
add_model(xgb_mod) %>%
add_recipe(xgb_recipe) %>%
tune_grid(resamples = xgb_v_fold,
grid = xgb_grid)
When I run the workflow I get a series of repeated and similar error messages. I ran it through the night last night just in case, and in the morning it was still spitting out the same message without ever concluding the calculations. This error message is below.
→ NA | error: ℹ In index: 2.
Caused by error in `predict.xgb.Booster()`:
! [07:19:55] src/gbm/gbtree.cc:549: Check failed: tree_end <= model_.trees.size() (223 vs. 7) : Invalid number of trees.
Stack trace:
[bt] (0) 1 xgboost.so 0x000000013a10bd3c dmlc::LogMessageFatal::~LogMessageFatal() + 124
[bt] (1) 2 xgboost.so 0x000000013a16f3b0 xgboost::gbm::GBTree::PredictBatch(xgboost::DMatrix*, xgboost::PredictionCacheEntry*, bool, unsigned int, unsigned int) + 496
[bt] (2) 3 xgboost.so 0x000000013a271434 xgboost::LearnerImpl::PredictRaw(xgboost::DMatrix*, xgboost::PredictionCacheEntry*, bool, unsigned int, unsigned int) const + 116
[bt] (3) 4 xgboost.so 0x000000013a261fb4 xgboost::LearnerImpl::Predict(std::__1::shared_ptr<xgboost::DMatrix>, bool, xgboost::HostDeviceVector<float>*, unsigned int, unsigned int, bool, bool, bool, bool, bool) + 628
[bt] (4) 5 xgboost.so 0x000000013a2ca9e0 XGBoosterPredictFromDMatrix + 800
[b
→
How to resolve this?
I was able to reproduce, and agree this one was tricky! In:
Check failed: tree_end <= model_.trees.size() (223 vs. 7) : Invalid number of trees.
model_.trees.size()
refers to the value of trees()
, tree_end
being the last tree used in 1:trees()
used to predict on new observations. XGBoost is saying that it can't predict with trees in later iterations than were actually trained.
Commenting out the calls to tune trees()
resolves the errors. This isn't really an effective reduction in the size of your search space, as tuning over stop_iter
across resamples will result in varying numbers of trees. There is some debate as to whether trees()
ought to be regarded as a tuning parameter.
I used the following reprex to reproduce the error:
library(tidymodels)
mtcars <- tibble(mtcars[rep(1:32, 10),])
xgb_v_fold <-
vfold_cv(data = mtcars,
v = 5,
repeats = 5)
xgb_recipe <-
recipe(mtcars, formula = mpg ~ cyl + disp) %>%
step_dummy(all_factor_predictors()) %>%
step_impute_bag(all_predictors()) %>%
step_zv()
xgb_mod <-
boost_tree(mode = 'regression',
engine = 'xgboost',
mtry = tune(),
trees = tune(),
min_n = tune(),
tree_depth = tune(),
learn_rate = tune(),
loss_reduction = tune(),
sample_size = tune(),
stop_iter = tune()
)
xgb_grid <- grid_regular(mtry(c(1, 5)),
trees(),
min_n(),
tree_depth(range = c(1, 5)),
learn_rate(),
loss_reduction(),
sample_size(range = c(1, 1)),
stop_iter(),
levels = 10
)
xgb_results <-
workflow() %>%
add_model(xgb_mod) %>%
add_recipe(xgb_recipe) %>%
tune_grid(resamples = xgb_v_fold,
grid = xgb_grid)
and resolved by writing:
xgb_mod <-
boost_tree(mode = 'regression',
engine = 'xgboost',
mtry = tune(),
min_n = tune(),
tree_depth = tune(),
learn_rate = tune(),
loss_reduction = tune(),
sample_size = tune(),
stop_iter = tune()
)
xgb_grid <- grid_regular(mtry(c(1, 5)),
min_n(),
tree_depth(range = c(1, 5)),
learn_rate(),
loss_reduction(),
sample_size(range = c(1, 1)),
stop_iter(),
levels = 10
)
xgb_results <-
workflow() %>%
add_model(xgb_mod) %>%
add_recipe(xgb_recipe) %>%
tune_grid(resamples = xgb_v_fold,
grid = xgb_grid)