rresamplingtidymodelsimbalanced-data

Tidymodels and Imbalanced datasets - Subsampling when resampling


When dealing with imbalanced datasets, my understanding is possible solutions are subsampling or oversampling the training set. However, the test set should reflect the imbalance of the original dataset.

The question is what happens when using cross-validation.

Let's say we are using 5-fold cross-validation and we are using oversampling. My understanding is that for example in the first iteration, oversampling should be performed to balance the training set (folds 1-4) while the testing set (fold 5) should remain imbalanced.

Does tidymodels take care of that on its own or not? It is still unclear to me even though I read the following resources:

https://community.rstudio.com/t/concerns-about-how-data-leakage-is-managed-using-tidymodels-thermis-package/76791

https://www.tidymodels.org/learn/models/sub-sampling/

Example taken from the second link

imbal_data <- 
  readr::read_csv("https://tidymodels.org/learn/models/sub-sampling/imbal_data.csv") %>% 
  mutate(Class = factor(Class))
dim(imbal_data)
table(imbal_data$Class)

library(tidymodels)
library(themis)
imbal_rec <- 
  recipe(Class ~ ., data = imbal_data) %>%
  step_rose(Class)

library(discrim)
qda_mod <- 
  discrim_regularized(frac_common_cov = 0, frac_identity = 0) %>% 
  set_engine("klaR")

qda_rose_wflw <- 
  workflow() %>% 
  add_model(qda_mod) %>% 
  add_recipe(imbal_rec)
qda_rose_wflw

set.seed(5732)
cv_folds <- vfold_cv(imbal_data, strata = "Class", repeats = 5)

cls_metrics <- metric_set(roc_auc, j_index)

set.seed(2180)
qda_rose_res <- fit_resamples(
  qda_rose_wflw, 
  resamples = cv_folds, 
  metrics = cls_metrics
)

collect_metrics(qda_rose_res)


Solution

  • Does tidymodels take care of that on its own or not?

    It does. The sub-sampling tools in the themis are set up to skip those computations for data being predicted.

    Let's say you have a 10% event rate, n = 1000, and use down-sampling. The data used to build the model would have equal class frequencies of 100 samples per class (n = 200), while the data being predicted would have the full 100 samples from that fold (untouched).