I'm using the R package {recipes}
for data preprocessing. Assume that I would like to transform some variable and then declare the transformed variable as the outcome variable for modeling.
However, an error is thrown:
library(tidymodels)
rec <- recipe( ~ ., data = mtcars) |>
step_mutate(mpg2 = mpg * 2) |>
update_role(mpg2, new_role = "outcome")
#> Error in `update_role()`:
#> ! Can't subset columns that don't exist.
#> ✖ Column `mpg2` doesn't exist.
Created on 2023-01-15 with reprex v2.0.2
The help pages of step_mutate()
and update_role()
do not mention the case of updating the role of an mutated variables. When I update the role of a variable without having mutated it, no error is thrown.
There are SO questions around with a similar error message (such as here, here, or here), but those questions seem to tap into different aspects.
sessionInfo()
#> R version 4.2.1 (2022-06-23)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur ... 10.16
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] yardstick_1.1.0 workflowsets_1.0.0 workflows_1.1.2 tune_1.0.1
#> [5] tidyr_1.2.1 tibble_3.1.8 rsample_1.1.1 recipes_1.0.4
#> [9] purrr_1.0.1 parsnip_1.0.3 modeldata_1.0.1 infer_1.0.4
#> [13] ggplot2_3.4.0 dplyr_1.0.10 dials_1.1.0 scales_1.2.1
#> [17] broom_1.0.2 tidymodels_1.0.0
#>
#> loaded via a namespace (and not attached):
#> [1] foreach_1.5.2 splines_4.2.1 R.utils_2.12.2
#> [4] prodlim_2019.11.13 assertthat_0.2.1 highr_0.10
#> [7] GPfit_1.0-8 yaml_2.3.6 globals_0.16.2
#> [10] ipred_0.9-13 pillar_1.8.1 backports_1.4.1
#> [13] lattice_0.20-45 glue_1.6.2 digest_0.6.31
#> [16] hardhat_1.2.0 colorspace_2.0-3 htmltools_0.5.4
#> [19] Matrix_1.5-3 R.oo_1.25.0 timeDate_4022.108
#> [22] pkgconfig_2.0.3 lhs_1.1.6 DiceDesign_1.9
#> [25] listenv_0.9.0 gower_1.0.1 lava_1.7.1
#> [28] timechange_0.2.0 styler_1.8.1 generics_0.1.3
#> [31] ellipsis_0.3.2 furrr_0.3.1 withr_2.5.0
#> [34] nnet_7.3-18 cli_3.6.0 survival_3.5-0
#> [37] magrittr_2.0.3 evaluate_0.19 R.methodsS3_1.8.2
#> [40] fs_1.5.2 fansi_1.0.3 future_1.30.0
#> [43] parallelly_1.34.0 R.cache_0.16.0 MASS_7.3-58.1
#> [46] class_7.3-20 tools_4.2.1 lifecycle_1.0.3
#> [49] stringr_1.5.0 munsell_0.5.0 reprex_2.0.2
#> [52] compiler_4.2.1 rlang_1.0.6 grid_4.2.1
#> [55] iterators_1.0.14 rstudioapi_0.14 rmarkdown_2.19
#> [58] gtable_0.3.1 codetools_0.2-18 DBI_1.1.3
#> [61] R6_2.5.1 lubridate_1.9.0 knitr_1.41
#> [64] fastmap_1.1.0 future.apply_1.10.0 utf8_1.2.2
#> [67] stringi_1.7.12 parallel_4.2.1 Rcpp_1.0.9
#> [70] vctrs_0.5.1 rpart_4.1.19 tidyselect_1.2.0
#> [73] xfun_0.36
```
This behavior is currently not properly documented.
The reason why you are having problems is because add_role()
, update_role()
and remove_role()
can only be applied to the variables passed to recipe()
, and they are all executed before the step functions.
This means that the following two snippets of code returns the same result
recipe( ~ ., data = mtcars) |>
step_mutate(mpg2 = mpg * 2) |>
update_role(mpg2, new_role = "outcome")
recipe( ~ ., data = mtcars) |>
update_role(mpg2, new_role = "outcome") |>
step_mutate(mpg2 = mpg * 2)
Reference here https://github.com/tidymodels/recipes/blob/ab2405a0393bba06d9d7a52b4dbba6659a6dfcbd/R/roles.R#L132 :
Roles can only be changed on the original data supplied to
recipe()
More talk here https://github.com/tidymodels/recipes/issues/437.
The role
argument of step_mutate()
allows you to specify the role of the variables it creates
library(recipes)
recipe( ~ ., data = mtcars) |>
step_mutate(mpg2 = mpg * 2, role = "outcome") |>
prep() |>
summary()
#> # A tibble: 12 × 4
#> variable type role source
#> <chr> <list> <chr> <chr>
#> 1 mpg <chr [2]> predictor original
#> 2 cyl <chr [2]> predictor original
#> 3 disp <chr [2]> predictor original
#> 4 hp <chr [2]> predictor original
#> 5 drat <chr [2]> predictor original
#> 6 wt <chr [2]> predictor original
#> 7 qsec <chr [2]> predictor original
#> 8 vs <chr [2]> predictor original
#> 9 am <chr [2]> predictor original
#> 10 gear <chr [2]> predictor original
#> 11 carb <chr [2]> predictor original
#> 12 mpg2 <chr [2]> outcome derived
Additionally, it is not recommended that you try to create/modify the outcome inside a recipe. Such modifications should happen before, preferable before data splitting.