Thanks for your help in advance! I'm new to tidymodels (and modeling in general) and am having a hard time identifying what's going wrong to troubleshoot my workflow set up.
I'm running four different models to predict baseball win percentages based on a historical dataset. They are a linear model, elastic net model, random forest model, and XGBoost model. I know all the models work (I have tested them individually), but I am trying to use a workflow to test, cross-validate, and select the best models.
I have two different types of recipes, a basic recipe that includes some hyperparameterization tuning steps (selecting variables, step_zv, step_nzv, step_interact, step_corr, and step_impute_bag) for the random forest and XGBoost models. The linear and elastic net models use a recipe that adds a normalization step.
After setting up my workflows and grids, when I try to run workflow_map(), I get two errors:
My questions:
--
For reference, here is some of the relevant code:
# Split data
team_split <- initial_split(mlb_final)
# Extract training and testing data
team_train <- training(team_split)
team_test <- testing(team_split)
# Resampling strategy
team_rs <- vfold_cv(team_train)
# Random forest model
mlb_forest <- rand_forest(min_n = tune()) %>%
set_engine("ranger",
importance = "permutation") %>%
set_mode("regression")
# Linear model
mlb_linear <- linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
# XGBoost
mlb_xgb <- boost_tree(
trees = tune(),
min_n = tune(),
tree_depth = tune(),
learn_rate = tune()
) %>%
set_engine("xgboost") %>%
set_mode("regression")
# Elastic Net
mlb_elastic <- linear_reg(
penalty = tune(),
mixture = tune()
) %>%
set_engine("glmnet") %>%
set_mode("regression")
I've set up my workflows like this:
linear_workflow <- workflow() |>
add_model(mlb_linear) |>
add_recipe(normalized_recipe)
elastic_workflow <- workflow() |>
add_model(mlb_elastic) |>
add_recipe(normalized_recipe)
rf_workflow <- workflow() |>
add_model(mlb_forest) |>
add_recipe(basic_recipe)
xgb_workflow <- workflow() |>
add_model(mlb_xgb) |>
add_recipe(basic_recipe)
And my grids like this:
grid_ctrl <- control_grid(
save_pred = TRUE,
parallel_over = NULL,
save_workflow = TRUE,
verbose = TRUE
)
rf_grid <- grid_regular(
min_n(range = c(5, 50)), # Min number of observations per leaf (tuning parameter)
mtry(range = c(2, 10)), # Number of variables to randomly sample at each split
levels = 5 # Levels of grid search
)
xgb_grid <- grid_regular(
trees(range = c(100, 500)),
min_n(range = c(5, 15)),
tree_depth(range = c(3, 6)),
learn_rate(range = c(0.05, 0.1)),
levels = 5
)
elastic_grid <- grid_regular(
penalty(range = c(-2, 1), trans = log10_trans()),
mixture(range = c(0, 1)),
levels = 5
)
linear_grid <- 5
I then combined into normalized and basic workflow sets.
normalized_mlb <- workflow_set(
preproc = list(normalized = normalized_recipe),
models = list(linear = mlb_linear,
elastic = mlb_elastic)
)
basic_mlb <- workflow_set(
preproc = list(basic = basic_recipe),
models = list(rf = mlb_forest,
xgb = mlb_xgb)
)
And then tried to use workflow_map() for both normalized and basic workflows
lm_models <- normalized_mlb |>
workflow_map("fit_resamples",
seed = 100,
verbose = TRUE,
resamples = team_rs,
control = grid_ctrl)
basic_models <- basic_mlb |>
workflow_map("fit_resamples",
seed = 100,
verbose = TRUE,
resamples = team_rs,
control = grid_ctrl)
The workflows are split into normalized and basic workflows because, initially, I was trying to run them together and running into issues. However, I'm still not sure how to address these errors.
I used some simulated data to try to reproduce the results (and could).
Some of the workflows have tuning parameters and some don't. workflow_map()
has the default argument of fn = "tune_grid"
but will fall back to "fit_resamples"
if the workflow doesn't have tuning parameters.
If you take out fn = "tune_grid"
from your code, it runs.
I can't reproduce
"Error in summary.connection(connection) : invalid connection"
I assume it is related to parallel processing? If you are working over a remote session, it could be related to a connection problem too.
One other thing... we won't have an obvious way of adding custom grids (yet). You can do this though:
basic_models <- basic_mlb |>
workflow_map(seed = 100, #<- removed "fit_resamples"
verbose = TRUE,
resamples = team_rs,
control = grid_ctrl) %>%
option_add(grid = xgb_grid, id = "basic_xgb") %>%
option_add(grid = rf_grid, id = "basic_rf")