In this workflow
for regularized regression with cross-validation, the data is supplied twice: in the recipe
call and the vfold_cv
call.
Doesn't this cause a conflict if the recipe
performs some pre-processing on the data and the vfold_cv
function does not? It seems the recipe
can go as far as arranging and expanding dummies, etc. that change the order of the observations and the data types of the columns in the data frames.
library(tidymodels)
model <- workflow() |>
add_recipe(
recipe(mpg ~ wt + cyl + drat + qsec + vs + am, data = mtcars) |> # here
step_scale() |>
step_center()
) |>
add_model(
linear_reg(penalty = tune()) |>
set_engine("glmnet")
) |>
tune_grid(
resamples = vfold_cv(v = 5, data = mtcars), # and here
#grid = 30 # doesn't capture the minimum rmse
grid = tibble(penalty = 10^seq(-8, 1, length.out = 40))
)
#> Warning: package 'glmnet' was built under R version 4.3.3
#> → A | warning: A correlation computation is required, but `estimate` is constant and has 0
#> standard deviation, resulting in a divide by 0 error. `NA` will be returned.
#> There were issues with some computations A: x1There were issues with some computations A: x2There were issues with some computations A: x4There were issues with some computations A: x5
model |>
autoplot()
Created on 2024-04-15 with reprex v2.0.2
A recipe is a specification of what to do with a dataset, it's not doing anything by itself. It does take a dataset as input so that it can know which columns are of which type (numeric, character, etc). That's necessary for being able to use selectors like all_numeric_predictors()
in, say, step_pca(all_numeric_predictors())
.
The output of vfold_cv()
holds the data you give it and an allocation of rows to the different folds.
The pieces come together when you tune (via tune_grid()
in your case):
It takes the fold allocations from the rsample object, puts the data of one fold in the assessment set and the rest in the analysis set to build the model, and then applies the instructions in the recipe to both the analysis and the assessment set. That is to ensure that you also properly cross-validate your preprocessing steps.
There are a few more details to it but this hopefully gives you an idea of how the pieces fit together.