rparallel-processingbootstrappingtidymodelsrsample

How to speed up the tidymodels bootstrapping with parallelization


I have the following code, that performs bootstrapping and calculates the confidence interval.

library(resample)
ibrary(broom)
library(dplyr)
library(purrr)
library(tibble)

lm_est <- function(split, ...) {
  lm(mpg ~ disp + hp, data = analysis(split)) %>%
    tidy()
}

set.seed(52156)
car_rs <-
  bootstraps(mtcars, 500, apparent = TRUE) %>%
  mutate(results = map(splits, lm_est))

int_pctl(car_rs, results) # this is important 

It produces

> int_pctl(car_rs, results)
# A tibble: 3 × 6
  term         .lower .estimate   .upper .alpha .method   
  <chr>         <dbl>     <dbl>    <dbl>  <dbl> <chr>     
1 (Intercept) 27.7      31.0    34.1       0.05 percentile
2 disp        -0.0431   -0.0295 -0.0123    0.05 percentile
3 hp          -0.0643   -0.0281 -0.00930   0.05 percentile

But it runs very slowly. How can I speed it up with parallelization? Note that the output of the parallelization needs to be able to be processed by int_pctl().

I tried this but failed:

library(parallel)
# set the number of cores to use for parallelization
cores <- detectCores() - 1
cl <- makeCluster(cores)

# use mcmapply to parallelize the bootstrapping process
car_rs$results <- mcmapply(lm_est, car_rs$splits, mc.cores = cores, mc.preschedule = TRUE)

stopCluster(cl)


Solution

  • There are parallel versions of purrr::map*() functions in the furrr package that you can use.

    library(rsample)
    library(broom)
    library(dplyr)
    #> 
    #> Attaching package: 'dplyr'
    #> The following objects are masked from 'package:stats':
    #> 
    #>     filter, lag
    #> The following objects are masked from 'package:base':
    #> 
    #>     intersect, setdiff, setequal, union
    library(purrr)
    library(tibble)
    library(furrr)  #<- added
    #> Loading required package: future
    
    plan(multisession, workers = parallel::detectCores())  #<- added
    
    
    lm_est <- function(split, ...) {
      library(broom) #<- added to load inside of remote workers
      lm(mpg ~ disp + hp, data = analysis(split)) %>%
        tidy()
    }
    
    set.seed(52156)
    car_rs <-
      bootstraps(mtcars, 1500, apparent = TRUE) %>%
      mutate(results = future_map(splits, lm_est))   #<- changed
    
    int_pctl(car_rs, results) # this is important 
    #> # A tibble: 3 × 6
    #>   term         .lower .estimate   .upper .alpha .method   
    #>   <chr>         <dbl>     <dbl>    <dbl>  <dbl> <chr>     
    #> 1 (Intercept) 27.7      30.8    33.6       0.05 percentile
    #> 2 disp        -0.0443   -0.0298 -0.0146    0.05 percentile
    #> 3 hp          -0.0584   -0.0267 -0.00718   0.05 percentile
    

    Created on 2023-01-26 by the reprex package (v2.0.1)