rmachine-learningtidymodels

Use fit_resamples with custom split data?


I have a custom function that splits my data into training and testing sets based on various criteria and rules. I'd like to use this function in a tidymodels workflow together with fit_resamples. However, when I can make my list look like a list made with vfold_cv, it does not seem to work. The example code I am using:

data(ames, package = "modeldata")

split_data <- function(df, n) {
  set.seed(123) # for reproducibility
  df$id <- seq.int(nrow(df))
  list_of_splits <- list()
  
  for(i in 1:n) {
    train_index <- sample(df$id, size=ceiling(nrow(df)*.8))
    train_set <- df[train_index,]
    test_set <- df[-train_index,]
    list_of_splits[[i]] <- list(train_set = train_set, test_set = test_set)
  }
  
  return(list_of_splits)
}

splits <- split_data(ames, 5)

resamples <- map(splits, ~rsample::make_splits(
  x = .$train_set |> select(colnames(.$test_set)),
  assessment = .$test_set
))

names(resamples) <- paste0("Fold", seq_along(resamples))

resamples <- tibble::tibble(splits = resamples,
                            id = names(resamples))

lm_model <- 
  linear_reg() %>% 
  set_engine("lm")

lm_wflow <- 
  workflow() %>% 
  add_model(lm_model) %>%
  add_formula(Sale_Price ~ Longitude + Latitude)

res <- lm_wflow %>%
  fit_resamples(resamples = resamples)

The error returned after running that last line is:

Error in `check_rset()`:
! The `resamples` argument should be an 'rset' object, such as the type produced by `vfold_cv()` or other 'rsample' functions.

If I try to force the class to be "rset" class(resamples) <- "rset", the list no longer looks correct and I get the same error.

What is the correct method of using a custom crossfold data set?

Note - additional question: In the example code above, the test and training set size is consistent across folds. In my actual data, this will vary slightly - does this matter at all?


Solution

  • This sounds like rsample::manual_rset() might do what you want?