rtidymodels

Why does recipes::bake() turn my char id into a factor with value NA


I have a character id that I'd like to keep in the data but not use for training. When I look at the baked data, I see my ID column is now a factor with a level for each value, but the values are all NA.

REPREX:

# Make a dataset with a unique <chr> ID
myData <- mtcars %>% 
  mutate(myID = paste("j", row_number()))

myDataSplit <- initial_split(myData)

class(testing(myDataSplit)$myID) # [1] "character"

# change role of myID from 'predictor' to 'ID'
myRecipe <- recipe(mpg ~., data = training(myDataSplit)) %>%
  update_role(myID, new_role="ID") 
  

myTesting <- myRecipe  %>%
  prep() %>% 
  bake(testing(myDataSplit)) 

myID is now a factor with all the levels, but always NA

> unique(myTesting$myID)
[1] <NA>
Levels: j1 j10 j13 j15 j16 j17 j19 j21 j22 j23 j24 j25 j26 j27 j28 j29 j3 j30 j32 j4 j5 j6 j7 j8

I have also tried (unsuccessfully):

step_factor2string(myID)

and

myRecipe <- recipe(mpg ~., data = training(myDataSplit),  **convert_strings = FALSE**)

I'm using a worlflow() which automatically bakes the when I call predict(myworflow, testing(myDataSplit). That seems to work, but what happens to myID when I call bake() makes me think that I am missing something important. Thanks!


Solution

  • In this case, you'll want to use the strings_as_factors argument for prep():

    library(tidymodels)
    df <- mtcars |> 
      mutate(myID = paste("j", row_number()))
    df_split <- initial_split(df)
    
    rec <- recipe(mpg ~ ., data = training(df_split)) |> 
      update_role(myID, new_role="ID") 
      
    
    rec  |> 
      prep(strings_as_factors = FALSE) |> 
      bake(testing(df_split)) 
    #> # A tibble: 8 × 12
    #>     cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb myID    mpg
    #>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
    #> 1     6 225     105  2.76  3.46  20.2     1     0     3     1 j 6    18.1
    #> 2     4  75.7    52  4.93  1.62  18.5     1     1     4     2 j 19   30.4
    #> 3     8 318     150  2.76  3.52  16.9     0     0     3     2 j 22   15.5
    #> 4     8 304     150  3.15  3.44  17.3     0     0     3     2 j 23   15.2
    #> 5     8 400     175  3.08  3.84  17.0     0     0     3     2 j 25   19.2
    #> 6     4 120.     91  4.43  2.14  16.7     0     1     5     2 j 27   26  
    #> 7     6 145     175  3.62  2.77  15.5     0     1     5     6 j 30   19.7
    #> 8     4 121     109  4.11  2.78  18.6     1     1     4     2 j 32   21.4
    

    Created on 2023-12-09 with reprex v2.0.2

    If you are in a real-life situation where you do want some of your predictors to be factors, you might have to set those ahead of time, before the recipe.