rtidymodels

How does the data sent to `recipe` relate to the data sent to the `resample` function?


In this workflow for regularized regression with cross-validation, the data is supplied twice: in the recipe call and the vfold_cv call.

Doesn't this cause a conflict if the recipe performs some pre-processing on the data and the vfold_cv function does not? It seems the recipe can go as far as arranging and expanding dummies, etc. that change the order of the observations and the data types of the columns in the data frames.

library(tidymodels)

model <- workflow() |> 
  add_recipe(
    recipe(mpg ~ wt + cyl + drat + qsec + vs + am, data = mtcars) |> # here
      step_scale() |> 
      step_center()
  ) |> 
  add_model(
    linear_reg(penalty = tune()) |> 
      set_engine("glmnet")
  ) |> 
  tune_grid(
    resamples = vfold_cv(v = 5, data = mtcars), # and here
    #grid = 30 # doesn't capture the minimum rmse
    grid = tibble(penalty = 10^seq(-8, 1, length.out = 40))
  )
#> Warning: package 'glmnet' was built under R version 4.3.3
#> → A | warning: A correlation computation is required, but `estimate` is constant and has 0
#>                standard deviation, resulting in a divide by 0 error. `NA` will be returned.
#> There were issues with some computations   A: x1There were issues with some computations   A: x2There were issues with some computations   A: x4There were issues with some computations   A: x5

model |> 
  autoplot()

Created on 2024-04-15 with reprex v2.0.2


Solution

  • A recipe is a specification of what to do with a dataset, it's not doing anything by itself. It does take a dataset as input so that it can know which columns are of which type (numeric, character, etc). That's necessary for being able to use selectors like all_numeric_predictors() in, say, step_pca(all_numeric_predictors()).

    The output of vfold_cv() holds the data you give it and an allocation of rows to the different folds.

    The pieces come together when you tune (via tune_grid() in your case):

    It takes the fold allocations from the rsample object, puts the data of one fold in the assessment set and the rest in the analysis set to build the model, and then applies the instructions in the recipe to both the analysis and the assessment set. That is to ensure that you also properly cross-validate your preprocessing steps.

    There are a few more details to it but this hopefully gives you an idea of how the pieces fit together.