rtidymodels

Recipe steps on response cannot be replicated with the test data set


I defined a workflow_set with a couple of different models with a common recipe. The recipe is created via the following function (irrelevant steps removed):

library(tidymodels)
preprocess_data <- function(data, resp, threshold = 20L) {
  resp <- ensym(resp)
  resp_str <- as_string(resp)
  valid_resp <- c("resp_1", "resp_2")
  other_resp <- setdiff(valid_resp, resp_str)
  recipe(data) %>%
    update_role(everything(), new_role = "predictor") %>%
    update_role(all_of(resp_str), new_role = "outcome") %>%
    update_role(all_of(other_resp), new_role = "alternative outcome") %>%
    step_other({{resp}}, threshold = {{threshold}}) %>%
    step_filter({{resp}} != "other") %>%
    step_mutate({{resp}} := factor({{resp}})) %>%
    step_rm(has_role("alternative outcome"))
}

Fitting works as expected, and I finally extracted and fitted my favorite model:

final_workflow <- extract_workflow(wfs, id = "relevant_id")
final_fit <- fit(final_workflow, new_data = training(data_split))

However, if I try to predict values, I get the following error:

predict(final_fit, new_data = testing(data_split))
# Error in `step_other()`:
# ! The following required column is missing from `new_data`: resp2

This is strange, because first of all the data is of course in the testing data.frame and secondly prep + bake on the recipe with the testing data set works as expected.

Where is my mistake?


Solution

  • I think that there are two issues.

    First, we generally firewall the outcome from the predictors (and other columns) when making predictions. This is one of our "hidden guardrails" to help prevent inadvertent data leakage. While it can be convenient to get the outcome column along with everything else, you shouldn't need it to make predictions, so we don't include it.

    Second, we generally discourage using the outcome in a recipe step (related to the first point above). You may be able to avoid failure by using skip = TRUE for the steps that use {{resp}} (see the documentation for more information), but we can't guarantee it.