tidymodelsr-recipes

Can't update role of mutated variables


Background

I'm using the R package {recipes} for data preprocessing. Assume that I would like to transform some variable and then declare the transformed variable as the outcome variable for modeling.

Problem and minimal example:

However, an error is thrown:

library(tidymodels)
rec <- recipe( ~ ., data = mtcars) |> 
  step_mutate(mpg2 = mpg * 2) |> 
  update_role(mpg2, new_role = "outcome")
#> Error in `update_role()`:
#> ! Can't subset columns that don't exist.
#> ✖ Column `mpg2` doesn't exist.

Created on 2023-01-15 with reprex v2.0.2

What I've tried

The help pages of step_mutate() and update_role() do not mention the case of updating the role of an mutated variables. When I update the role of a variable without having mutated it, no error is thrown.

There are SO questions around with a similar error message (such as here, here, or here), but those questions seem to tap into different aspects.

Sessioninfo

sessionInfo()
#> R version 4.2.1 (2022-06-23)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur ... 10.16
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] yardstick_1.1.0    workflowsets_1.0.0 workflows_1.1.2    tune_1.0.1        
#>  [5] tidyr_1.2.1        tibble_3.1.8       rsample_1.1.1      recipes_1.0.4     
#>  [9] purrr_1.0.1        parsnip_1.0.3      modeldata_1.0.1    infer_1.0.4       
#> [13] ggplot2_3.4.0      dplyr_1.0.10       dials_1.1.0        scales_1.2.1      
#> [17] broom_1.0.2        tidymodels_1.0.0  
#> 
#> loaded via a namespace (and not attached):
#>  [1] foreach_1.5.2       splines_4.2.1       R.utils_2.12.2     
#>  [4] prodlim_2019.11.13  assertthat_0.2.1    highr_0.10         
#>  [7] GPfit_1.0-8         yaml_2.3.6          globals_0.16.2     
#> [10] ipred_0.9-13        pillar_1.8.1        backports_1.4.1    
#> [13] lattice_0.20-45     glue_1.6.2          digest_0.6.31      
#> [16] hardhat_1.2.0       colorspace_2.0-3    htmltools_0.5.4    
#> [19] Matrix_1.5-3        R.oo_1.25.0         timeDate_4022.108  
#> [22] pkgconfig_2.0.3     lhs_1.1.6           DiceDesign_1.9     
#> [25] listenv_0.9.0       gower_1.0.1         lava_1.7.1         
#> [28] timechange_0.2.0    styler_1.8.1        generics_0.1.3     
#> [31] ellipsis_0.3.2      furrr_0.3.1         withr_2.5.0        
#> [34] nnet_7.3-18         cli_3.6.0           survival_3.5-0     
#> [37] magrittr_2.0.3      evaluate_0.19       R.methodsS3_1.8.2  
#> [40] fs_1.5.2            fansi_1.0.3         future_1.30.0      
#> [43] parallelly_1.34.0   R.cache_0.16.0      MASS_7.3-58.1      
#> [46] class_7.3-20        tools_4.2.1         lifecycle_1.0.3    
#> [49] stringr_1.5.0       munsell_0.5.0       reprex_2.0.2       
#> [52] compiler_4.2.1      rlang_1.0.6         grid_4.2.1         
#> [55] iterators_1.0.14    rstudioapi_0.14     rmarkdown_2.19     
#> [58] gtable_0.3.1        codetools_0.2-18    DBI_1.1.3          
#> [61] R6_2.5.1            lubridate_1.9.0     knitr_1.41         
#> [64] fastmap_1.1.0       future.apply_1.10.0 utf8_1.2.2         
#> [67] stringi_1.7.12      parallel_4.2.1      Rcpp_1.0.9         
#> [70] vctrs_0.5.1         rpart_4.1.19        tidyselect_1.2.0   
#> [73] xfun_0.36
```

Solution

  • This behavior is currently not properly documented.

    The reason why you are having problems is because add_role(), update_role() and remove_role() can only be applied to the variables passed to recipe(), and they are all executed before the step functions.

    This means that the following two snippets of code returns the same result

    recipe( ~ ., data = mtcars) |> 
      step_mutate(mpg2 = mpg * 2) |> 
      update_role(mpg2, new_role = "outcome")
    
    recipe( ~ ., data = mtcars) |>
      update_role(mpg2, new_role = "outcome") |>
      step_mutate(mpg2 = mpg * 2)
    

    Reference here https://github.com/tidymodels/recipes/blob/ab2405a0393bba06d9d7a52b4dbba6659a6dfcbd/R/roles.R#L132 :

    Roles can only be changed on the original data supplied to recipe()

    More talk here https://github.com/tidymodels/recipes/issues/437.

    The role argument of step_mutate() allows you to specify the role of the variables it creates

    library(recipes)
    
    recipe( ~ ., data = mtcars) |> 
      step_mutate(mpg2 = mpg * 2, role = "outcome") |>
      prep() |>
      summary()
    #> # A tibble: 12 × 4
    #>    variable type      role      source  
    #>    <chr>    <list>    <chr>     <chr>   
    #>  1 mpg      <chr [2]> predictor original
    #>  2 cyl      <chr [2]> predictor original
    #>  3 disp     <chr [2]> predictor original
    #>  4 hp       <chr [2]> predictor original
    #>  5 drat     <chr [2]> predictor original
    #>  6 wt       <chr [2]> predictor original
    #>  7 qsec     <chr [2]> predictor original
    #>  8 vs       <chr [2]> predictor original
    #>  9 am       <chr [2]> predictor original
    #> 10 gear     <chr [2]> predictor original
    #> 11 carb     <chr [2]> predictor original
    #> 12 mpg2     <chr [2]> outcome   derived
    

    Additionally, it is not recommended that you try to create/modify the outcome inside a recipe. Such modifications should happen before, preferable before data splitting.