I'm encountering an issue when using tidymodels with xgboost in a workflow. After applying a recipe that includes step_dummy()
to convert categorical variables into dummy variables, I receive the following error when trying to make predictions:
Error in `validate_column_names()`:
! The following required columns are missing: 'A', 'B', 'C', 'D'.
Here's a simplified version of my code:
library(tidymodels)
library(xgboost)
library(dplyr)
set.seed(123)
datensatz <- tibble(
outcome = rnorm(100, mean = 60, sd = 10),
A = factor(sample(c("h", "i", "j"), 100, replace = TRUE)),
B = factor(sample(c("e", "f", "g"), 100, replace = TRUE)),
C = factor(sample(1:3, 100, replace = TRUE)),
D = factor(sample(c("a", "b"), 100, replace = TRUE))
)
# splitting
data_split <- initial_split(datensatz, prop = 0.75)
train_data <- training(data_split)
test_data <- testing(data_split)
# Rezept
recipe_obj <- recipe(outcome ~ ., data = train_data) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors())
prepared_recipe <- prep(recipe_obj)
test_data_prepared <- bake(prepared_recipe, new_data = test_data)
# XGBoost Modell Spezifikation
xgboost_spec <- boost_tree(
trees = 1000,
tree_depth = 6,
min_n = 10,
loss_reduction = 0.01,
sample_size = 0.8,
mtry = 0.8,
learn_rate = 0.01
) %>%
set_mode("regression") %>%
set_engine("xgboost", count = FALSE, colsample_bytree = 0.8)
# Workflow
workflow_obj <- workflow() %>%
add_recipe(recipe_obj) %>%
add_model(xgboost_spec)
# Modell trainieren
xgboost_fit <- fit(workflow_obj, data = train_data)
# Modellvorhersage auf den vorbereiteten Testdaten
predictions <- predict(xgboost_fit, new_data = test_data_prepared)
# Ergebnisse
predictions
# Error occurs here
I suspect the issue is related to the fact that step_dummy()
removes the original categorical columns (A, B, C, D)
and replaces them with dummy variables. However, the workflow seems to expect the original columns when making predictions.
How can I resolve this issue and ensure that the prediction step correctly uses the dummy variables created by step_dummy()
?
Additional Info:
I'm using the `xgboost engine` within the `tidymodels` framework.
The error message suggests that the workflow expects the original categorical variables, but these are no longer present after applying `step_dummy()`.
If you are using a recipe in a workflow, then you don't need to manually prep()
and bake()
the test data set. So you can delete the following lines
prepared_recipe <- prep(recipe_obj)
test_data_prepared <- bake(prepared_recipe, new_data = test_data)
and predict with predict(xgboost_fit, new_data = test_data)
instead of predict(xgboost_fit, new_data = test_data_prepared)
library(tidymodels)
library(xgboost)
library(dplyr)
set.seed(123)
datensatz <- tibble(
outcome = rnorm(100, mean = 60, sd = 10),
A = factor(sample(c("h", "i", "j"), 100, replace = TRUE)),
B = factor(sample(c("e", "f", "g"), 100, replace = TRUE)),
C = factor(sample(1:3, 100, replace = TRUE)),
D = factor(sample(c("a", "b"), 100, replace = TRUE))
)
# splitting
data_split <- initial_split(datensatz, prop = 0.75)
train_data <- training(data_split)
test_data <- testing(data_split)
# Rezept
recipe_obj <- recipe(outcome ~ ., data = train_data) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors())
# XGBoost Modell Spezifikation
xgboost_spec <- boost_tree(
trees = 1000,
tree_depth = 6,
min_n = 10,
loss_reduction = 0.01,
sample_size = 0.8,
mtry = 0.8,
learn_rate = 0.01
) %>%
set_mode("regression") %>%
set_engine("xgboost", count = FALSE, colsample_bytree = 0.8)
# Workflow
workflow_obj <- workflow() %>%
add_recipe(recipe_obj) %>%
add_model(xgboost_spec)
# Modell trainieren
xgboost_fit <- fit(workflow_obj, data = train_data)
# Modellvorhersage auf den vorbereiteten Testdaten
predictions <- predict(xgboost_fit, new_data = test_data)
# Ergebnisse
predictions
#> # A tibble: 25 × 1
#> .pred
#> <dbl>
#> 1 62.9
#> 2 58.2
#> 3 57.8
#> 4 59.5
#> 5 60.0
#> 6 61.9
#> 7 58.2
#> 8 61.4
#> 9 60.7
#> 10 54.9
#> # ℹ 15 more rows
Created on 2024-08-30 with reprex v2.1.1