I try to use the tidymodels R-package for an ml pipeline. I can define a preprocessing pipeline (a recipe) on the training data and apply it to each re-sample of my cross-validation. But this uses the (global) training data to preprocess the folds. What I would find rather correct is to define a preprocessing recipe on each "analysis" (i.e., training) part of the fold and apply it to the "assessment" (i.e., testing) part of the fold.
The following code gives an example of my problem:
library(tidyverse)
library(tidymodels)
set.seed(1000)
mtcars = mtcars |> select(mpg, hp)
init_split <- initial_split(mtcars, prop = 0.9)
preprocessing_recipe <- recipe(mpg ~ hp,
data = training(init_split)
) |>
step_normalize(all_predictors())
preprocessing_recipe = preprocessing_recipe %>% prep()
preprocessing_recipe
cv_folds <- bake(preprocessing_recipe, new_data = training(init_split)) %>%
vfold_cv(v = 3)
## these resamples are not properly scaled:
training(cv_folds$splits[[1]]) %>% lapply(mean)
## $hp
## [1] 0.1442218
training(cv_folds$splits[[1]]) %>% lapply(sd)
## $hp
## [1] 1.167365
## while the preprocessing on the training data leads to exactly scaled data:
preprocessing_recipe$template %>% lapply(mean)
## $hp
## [1] -1.249001e-16
preprocessing_recipe$template %>% lapply(sd)
## $hp
## [1] 1
The reason why the above fails is clear. But how can I change the above pipeline (efficiently, elegantly) to define a recipe on each train part of the fold and apply it to the test part? In my view this is the way to do this that avoids data leakage. I haven't found any hints in the documentation of any posts. Thanks!
When you are using a recipe you are as part of a full pipeline, you are unlikely to want to prep()
or bake()
it yourself outside of diagnostic purposes. What we recommend is to use the recipe with a workflow()
to be able to attach it to a modeling model. Here I'm adding a linear regression specification. These two together can be fit()
and predict()
ed on. but you can also fit them inside your cross-validation loop, with fit_resamples()
or tune_grid()
depending on your needs.
For more information see:
library(tidyverse)
library(tidymodels)
set.seed(1000)
mtcars <- mtcars |>
select(mpg, hp)
init_split <- initial_split(mtcars, prop = 0.9)
mtcars_training <- training(init_split)
mtcars_folds <- vfold_cv(mtcars_training, v = 3)
preprocessing_recipe <- recipe(mpg ~ hp,
data = mtcars_training) |>
step_normalize(all_predictors())
lm_spec <- linear_reg()
wf_spec <- workflow() |>
add_recipe(preprocessing_recipe) |>
add_model(lm_spec)
resampled_fits <- fit_resamples(
wf_spec,
resamples = mtcars_folds,
control = control_resamples(extract = function(x) {
tidy(x, "recipe", number = 1)
})
)
We can see that the workflow is fit inside each fold by looking at the estimates of the recipe. I added a function to the extract
argument of control_resamples()
that pull out trained mean and sd that were calculated in the recipe.
resampled_fits |>
collect_extracts() |>
pull(.extracts)
#> [[1]]
#> # A tibble: 2 × 4
#> terms statistic value id
#> <chr> <chr> <dbl> <chr>
#> 1 hp mean 140. normalize_x5pUR
#> 2 hp sd 77.3 normalize_x5pUR
#>
#> [[2]]
#> # A tibble: 2 × 4
#> terms statistic value id
#> <chr> <chr> <dbl> <chr>
#> 1 hp mean 144. normalize_x5pUR
#> 2 hp sd 57.4 normalize_x5pUR
#>
#> [[3]]
#> # A tibble: 2 × 4
#> terms statistic value id
#> <chr> <chr> <dbl> <chr>
#> 1 hp mean 150. normalize_x5pUR
#> 2 hp sd 74.9 normalize_x5pUR
And we can see that they match the mean and sd from the original folds
mtcars_folds$splits |>
map(analysis) |>
map(~ tibble(mean = mean(.x$hp), sd = sd(.x$hp)))
#> [[1]]
#> # A tibble: 1 × 2
#> mean sd
#> <dbl> <dbl>
#> 1 140. 77.3
#>
#> [[2]]
#> # A tibble: 1 × 2
#> mean sd
#> <dbl> <dbl>
#> 1 144. 57.4
#>
#> [[3]]
#> # A tibble: 1 × 2
#> mean sd
#> <dbl> <dbl>
#> 1 150. 74.9