I have a custom function that splits my data into training and testing sets based on various criteria and rules. I'd like to use this function in a tidymodels workflow together with fit_resamples
. However, when I can make my list look like a list made with vfold_cv
, it does not seem to work. The example code I am using:
data(ames, package = "modeldata")
split_data <- function(df, n) {
set.seed(123) # for reproducibility
df$id <- seq.int(nrow(df))
list_of_splits <- list()
for(i in 1:n) {
train_index <- sample(df$id, size=ceiling(nrow(df)*.8))
train_set <- df[train_index,]
test_set <- df[-train_index,]
list_of_splits[[i]] <- list(train_set = train_set, test_set = test_set)
}
return(list_of_splits)
}
splits <- split_data(ames, 5)
resamples <- map(splits, ~rsample::make_splits(
x = .$train_set |> select(colnames(.$test_set)),
assessment = .$test_set
))
names(resamples) <- paste0("Fold", seq_along(resamples))
resamples <- tibble::tibble(splits = resamples,
id = names(resamples))
lm_model <-
linear_reg() %>%
set_engine("lm")
lm_wflow <-
workflow() %>%
add_model(lm_model) %>%
add_formula(Sale_Price ~ Longitude + Latitude)
res <- lm_wflow %>%
fit_resamples(resamples = resamples)
The error returned after running that last line is:
Error in `check_rset()`:
! The `resamples` argument should be an 'rset' object, such as the type produced by `vfold_cv()` or other 'rsample' functions.
If I try to force the class to be "rset" class(resamples) <- "rset"
, the list no longer looks correct and I get the same error.
What is the correct method of using a custom crossfold data set?
Note - additional question: In the example code above, the test and training set size is consistent across folds. In my actual data, this will vary slightly - does this matter at all?
This sounds like rsample::manual_rset()
might do what you want?