I'm using the R package {recipes} for data preprocessing. Assume that I would like to transform some variable and then declare the transformed variable as the outcome variable for modeling.
However, an error is thrown:
library(tidymodels)
rec <- recipe( ~ ., data = mtcars) |>
step_mutate(mpg2 = mpg * 2) |>
update_role(mpg2, new_role = "outcome")
#> Error in `update_role()`:
#> ! Can't subset columns that don't exist.
#> ✖ Column `mpg2` doesn't exist.
Created on 2023-01-15 with reprex v2.0.2
The help pages of step_mutate() and update_role() do not mention the case of updating the role of an mutated variables. When I update the role of a variable without having mutated it, no error is thrown.
There are SO questions around with a similar error message (such as here, here, or here), but those questions seem to tap into different aspects.
sessionInfo()
#> R version 4.2.1 (2022-06-23)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur ... 10.16
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] yardstick_1.1.0 workflowsets_1.0.0 workflows_1.1.2 tune_1.0.1
#> [5] tidyr_1.2.1 tibble_3.1.8 rsample_1.1.1 recipes_1.0.4
#> [9] purrr_1.0.1 parsnip_1.0.3 modeldata_1.0.1 infer_1.0.4
#> [13] ggplot2_3.4.0 dplyr_1.0.10 dials_1.1.0 scales_1.2.1
#> [17] broom_1.0.2 tidymodels_1.0.0
#>
#> loaded via a namespace (and not attached):
#> [1] foreach_1.5.2 splines_4.2.1 R.utils_2.12.2
#> [4] prodlim_2019.11.13 assertthat_0.2.1 highr_0.10
#> [7] GPfit_1.0-8 yaml_2.3.6 globals_0.16.2
#> [10] ipred_0.9-13 pillar_1.8.1 backports_1.4.1
#> [13] lattice_0.20-45 glue_1.6.2 digest_0.6.31
#> [16] hardhat_1.2.0 colorspace_2.0-3 htmltools_0.5.4
#> [19] Matrix_1.5-3 R.oo_1.25.0 timeDate_4022.108
#> [22] pkgconfig_2.0.3 lhs_1.1.6 DiceDesign_1.9
#> [25] listenv_0.9.0 gower_1.0.1 lava_1.7.1
#> [28] timechange_0.2.0 styler_1.8.1 generics_0.1.3
#> [31] ellipsis_0.3.2 furrr_0.3.1 withr_2.5.0
#> [34] nnet_7.3-18 cli_3.6.0 survival_3.5-0
#> [37] magrittr_2.0.3 evaluate_0.19 R.methodsS3_1.8.2
#> [40] fs_1.5.2 fansi_1.0.3 future_1.30.0
#> [43] parallelly_1.34.0 R.cache_0.16.0 MASS_7.3-58.1
#> [46] class_7.3-20 tools_4.2.1 lifecycle_1.0.3
#> [49] stringr_1.5.0 munsell_0.5.0 reprex_2.0.2
#> [52] compiler_4.2.1 rlang_1.0.6 grid_4.2.1
#> [55] iterators_1.0.14 rstudioapi_0.14 rmarkdown_2.19
#> [58] gtable_0.3.1 codetools_0.2-18 DBI_1.1.3
#> [61] R6_2.5.1 lubridate_1.9.0 knitr_1.41
#> [64] fastmap_1.1.0 future.apply_1.10.0 utf8_1.2.2
#> [67] stringi_1.7.12 parallel_4.2.1 Rcpp_1.0.9
#> [70] vctrs_0.5.1 rpart_4.1.19 tidyselect_1.2.0
#> [73] xfun_0.36
```
This behavior is currently not properly documented.
The reason why you are having problems is because add_role(), update_role() and remove_role() can only be applied to the variables passed to recipe(), and they are all executed before the step functions.
This means that the following two snippets of code returns the same result
recipe( ~ ., data = mtcars) |>
step_mutate(mpg2 = mpg * 2) |>
update_role(mpg2, new_role = "outcome")
recipe( ~ ., data = mtcars) |>
update_role(mpg2, new_role = "outcome") |>
step_mutate(mpg2 = mpg * 2)
Reference here https://github.com/tidymodels/recipes/blob/ab2405a0393bba06d9d7a52b4dbba6659a6dfcbd/R/roles.R#L132 :
Roles can only be changed on the original data supplied to
recipe()
More talk here https://github.com/tidymodels/recipes/issues/437.
The role argument of step_mutate() allows you to specify the role of the variables it creates
library(recipes)
recipe( ~ ., data = mtcars) |>
step_mutate(mpg2 = mpg * 2, role = "outcome") |>
prep() |>
summary()
#> # A tibble: 12 × 4
#> variable type role source
#> <chr> <list> <chr> <chr>
#> 1 mpg <chr [2]> predictor original
#> 2 cyl <chr [2]> predictor original
#> 3 disp <chr [2]> predictor original
#> 4 hp <chr [2]> predictor original
#> 5 drat <chr [2]> predictor original
#> 6 wt <chr [2]> predictor original
#> 7 qsec <chr [2]> predictor original
#> 8 vs <chr [2]> predictor original
#> 9 am <chr [2]> predictor original
#> 10 gear <chr [2]> predictor original
#> 11 carb <chr [2]> predictor original
#> 12 mpg2 <chr [2]> outcome derived
Additionally, it is not recommended that you try to create/modify the outcome inside a recipe. Such modifications should happen before, preferable before data splitting.