[SOLVED] rsample group_bootstrap is ~2000 times slower than bootstrap. Why?

rsample group_bootstrap is ~2000 times slower than bootstrap. Why?

The package rsample contains a function for bootstrapping and another function which allows bootstrapping on groups of the data.

The grouped vesion is a lot slower (~2000 times). I was expecting it to be a bit slower, but I do not get why it is that much slower.

(The groups in my example are not meaningful, but this should be irrelevant for testing the speed of both functions.)

library(tidyverse)
library(rsample)

dat <- tibble(x = 1:1000)

microbenchmark::microbenchmark(
  grouped = {
    dat %>%
      group_bootstraps(group = x, times = 10)}, 
  simple = {
    dat %>%
      bootstraps(times = 10)},
  times = 5)

Solution

An alternative to grouping is bootstrapping on data which is nested on the grouping variable.

dat <- tibble(x = rep(1:1000, 2), y = 1:2000) 

f <- function(df, column){
  tibble(
    "estimate" = mean(pull(df, {{column}})),
    "term" = "mean"
  )
}


# Version 1 using rsample::group_bootstraps
start_time <- Sys.time()

dat %>%
  group_bootstraps(group = x, times = 10) %>%
  mutate(mean_stats = purrr::map(splits, ~ f(analysis(.), y))) %>% 
  int_pctl(mean_stats)

print(difftime(Sys.time(), start_time, units = "secs"))

# Version 2 using tidyr nest and unnest
start_time <- Sys.time()
dat %>%
  nest(.by = x) %>%
  bootstraps(times = 10) %>%
  mutate(mean_stats = purrr::map(splits, ~ f(unnest(analysis(.), data), y))) %>% 
  int_pctl(mean_stats)

print(difftime(Sys.time(), start_time, units = "secs"))

This is roughly 10 times slower than not grouping at all which is much better than the grouping function in the package for my specific problem.

I think the downside is that it is not guaranteed to have the same number of rows across all resamples, especially if the groups have different number of rows.