I have a character id that I'd like to keep in the data but not use for training. When I look at the baked data, I see my ID column is now a factor with a level for each value, but the values are all NA.
REPREX:
# Make a dataset with a unique <chr> ID
myData <- mtcars %>%
mutate(myID = paste("j", row_number()))
myDataSplit <- initial_split(myData)
class(testing(myDataSplit)$myID) # [1] "character"
# change role of myID from 'predictor' to 'ID'
myRecipe <- recipe(mpg ~., data = training(myDataSplit)) %>%
update_role(myID, new_role="ID")
myTesting <- myRecipe %>%
prep() %>%
bake(testing(myDataSplit))
myID is now a factor with all the levels, but always NA
> unique(myTesting$myID)
[1] <NA>
Levels: j1 j10 j13 j15 j16 j17 j19 j21 j22 j23 j24 j25 j26 j27 j28 j29 j3 j30 j32 j4 j5 j6 j7 j8
I have also tried (unsuccessfully):
step_factor2string(myID)
and
myRecipe <- recipe(mpg ~., data = training(myDataSplit), **convert_strings = FALSE**)
I'm using a worlflow() which automatically bakes the when I call predict(myworflow, testing(myDataSplit). That seems to work, but what happens to myID when I call bake() makes me think that I am missing something important. Thanks!
In this case, you'll want to use the strings_as_factors
argument for prep()
:
library(tidymodels)
df <- mtcars |>
mutate(myID = paste("j", row_number()))
df_split <- initial_split(df)
rec <- recipe(mpg ~ ., data = training(df_split)) |>
update_role(myID, new_role="ID")
rec |>
prep(strings_as_factors = FALSE) |>
bake(testing(df_split))
#> # A tibble: 8 × 12
#> cyl disp hp drat wt qsec vs am gear carb myID mpg
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 6 225 105 2.76 3.46 20.2 1 0 3 1 j 6 18.1
#> 2 4 75.7 52 4.93 1.62 18.5 1 1 4 2 j 19 30.4
#> 3 8 318 150 2.76 3.52 16.9 0 0 3 2 j 22 15.5
#> 4 8 304 150 3.15 3.44 17.3 0 0 3 2 j 23 15.2
#> 5 8 400 175 3.08 3.84 17.0 0 0 3 2 j 25 19.2
#> 6 4 120. 91 4.43 2.14 16.7 0 1 5 2 j 27 26
#> 7 6 145 175 3.62 2.77 15.5 0 1 5 6 j 30 19.7
#> 8 4 121 109 4.11 2.78 18.6 1 1 4 2 j 32 21.4
Created on 2023-12-09 with reprex v2.0.2
If you are in a real-life situation where you do want some of your predictors to be factors, you might have to set those ahead of time, before the recipe.