Error in validate_column_names(): Missing required columns after applying recipe in Tidymodels workflow with XGBoost

I'm encountering an issue when using tidymodels with xgboost in a workflow. After applying a recipe that includes step_dummy() to convert categorical variables into dummy variables, I receive the following error when trying to make predictions:

Error in `validate_column_names()`:
! The following required columns are missing: 'A', 'B', 'C', 'D'.

Here's a simplified version of my code:

library(tidymodels)
library(xgboost)
library(dplyr)

set.seed(123)
datensatz <- tibble(
  outcome = rnorm(100, mean = 60, sd = 10),
  A = factor(sample(c("h", "i", "j"), 100, replace = TRUE)),
  B = factor(sample(c("e", "f", "g"), 100, replace = TRUE)),
  C = factor(sample(1:3, 100, replace = TRUE)),
  D = factor(sample(c("a", "b"), 100, replace = TRUE))
)

# splitting
data_split <- initial_split(datensatz, prop = 0.75)
train_data <- training(data_split)
test_data <- testing(data_split)


# Rezept
recipe_obj <- recipe(outcome ~ ., data = train_data) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%  
  step_zv(all_predictors()) %>%  
  step_normalize(all_numeric_predictors())  

prepared_recipe <- prep(recipe_obj)
test_data_prepared <- bake(prepared_recipe, new_data = test_data)

# XGBoost Modell Spezifikation
xgboost_spec <- boost_tree(
  trees = 1000,                    
  tree_depth = 6,                  
  min_n = 10,                      
  loss_reduction = 0.01,           
  sample_size = 0.8,               
  mtry = 0.8,                      
  learn_rate = 0.01                
) %>%
  set_mode("regression") %>%
  set_engine("xgboost", count = FALSE, colsample_bytree = 0.8)

# Workflow
workflow_obj <- workflow() %>%
  add_recipe(recipe_obj) %>%
  add_model(xgboost_spec)

# Modell trainieren
xgboost_fit <- fit(workflow_obj, data = train_data)

# Modellvorhersage auf den vorbereiteten Testdaten
predictions <- predict(xgboost_fit, new_data = test_data_prepared)

# Ergebnisse 
predictions
# Error occurs here

I suspect the issue is related to the fact that step_dummy() removes the original categorical columns (A, B, C, D) and replaces them with dummy variables. However, the workflow seems to expect the original columns when making predictions.

How can I resolve this issue and ensure that the prediction step correctly uses the dummy variables created by step_dummy()?

Additional Info:

I'm using the `xgboost engine` within the `tidymodels` framework.
The error message suggests that the workflow expects the original categorical variables, but these are no longer present after applying `step_dummy()`.

Solution

If you are using a recipe in a workflow, then you don't need to manually prep() and bake() the test data set. So you can delete the following lines

prepared_recipe <- prep(recipe_obj)
test_data_prepared <- bake(prepared_recipe, new_data = test_data)

and predict with predict(xgboost_fit, new_data = test_data) instead of predict(xgboost_fit, new_data = test_data_prepared)

library(tidymodels)
library(xgboost)
library(dplyr)

set.seed(123)
datensatz <- tibble(
  outcome = rnorm(100, mean = 60, sd = 10),
  A = factor(sample(c("h", "i", "j"), 100, replace = TRUE)),
  B = factor(sample(c("e", "f", "g"), 100, replace = TRUE)),
  C = factor(sample(1:3, 100, replace = TRUE)),
  D = factor(sample(c("a", "b"), 100, replace = TRUE))
)

# splitting
data_split <- initial_split(datensatz, prop = 0.75)
train_data <- training(data_split)
test_data <- testing(data_split)

# Rezept
recipe_obj <- recipe(outcome ~ ., data = train_data) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%  
  step_zv(all_predictors()) %>%  
  step_normalize(all_numeric_predictors())  

# XGBoost Modell Spezifikation
xgboost_spec <- boost_tree(
  trees = 1000,                    
  tree_depth = 6,                  
  min_n = 10,                      
  loss_reduction = 0.01,           
  sample_size = 0.8,               
  mtry = 0.8,                      
  learn_rate = 0.01                
) %>%
  set_mode("regression") %>%
  set_engine("xgboost", count = FALSE, colsample_bytree = 0.8)

# Workflow
workflow_obj <- workflow() %>%
  add_recipe(recipe_obj) %>%
  add_model(xgboost_spec)

# Modell trainieren
xgboost_fit <- fit(workflow_obj, data = train_data)

# Modellvorhersage auf den vorbereiteten Testdaten
predictions <- predict(xgboost_fit, new_data = test_data)

# Ergebnisse 
predictions
#> # A tibble: 25 × 1
#>    .pred
#>    <dbl>
#>  1  62.9
#>  2  58.2
#>  3  57.8
#>  4  59.5
#>  5  60.0
#>  6  61.9
#>  7  58.2
#>  8  61.4
#>  9  60.7
#> 10  54.9
#> # ℹ 15 more rows

^{Created on 2024-08-30 with reprex v2.1.1}