rlinuxfuturetidymodelsfurrr

Error in future_map: argument ".f" is missing, with no default


Requesting your help or expert opinion on a parallelization issue I am facing.

I regularly run an Xgboost classifier model on a rather large dataset (dim(train_data) = 357,401 x 281, dims after recipe prep() are 147,304 x 1159 ) for a multiclass prediction. In base R the model runs in just over 4 hours using registerDoParallel(using all 24 cores of my server). I am now trying to run it in the Tidymodels environment, however, I am yet to find a robust parallelization option to tune the grid.

I attempted the following parallelization options within tidymodels. All of them seem to work on a smaller subsample (eg 20% data), but options 1-4 fail when I run the entire dataset, mostly due to memory allocation issues.

  1. makePSOCKcluster(), library(doParallel)
  2. registerDoFuture(), library(doFuture)
  3. doMC::registerDoMC()
  4. plan(cluster, workers), doFuture, parallel
  5. registerDoParallel(), library(doParallel)
  6. future::plan(multisession), library(furrr)

Option 5 (doParallel) has worked with 100% data in the tidymodel environment, however, it takes 4-6 hours to tune the grid. I would request your attention to option 6 (future/ furrr), this appeared to be the most efficient of all methods I tried. This method however worked only once (successful code included below, please note I have incorporated a racing method and stopping grid into the tuning).

doParallel::registerDoParallel(cores = 24)
library(furrr)
future::plan(multisession, gc = T) 

tic()
race_rs <-  future_map_dfr(
  tune_race_anova(
    xgb_earlystop_wf,
    resamples     = cv_folds,
    metrics       = xgb_metrics,
    grid          = stopping_grid,
    control       = control_race(
      verbose       = TRUE,
      verbose_elim  = TRUE,
      allow_par     = TRUE,
      parallel_over = 'everything'
    )
  ),
  .progress = T,
  .options = furrr_options(packages = "parsnip"),
)
toc()

Interestingly, after one success all subsequent attempts have failed. I am always getting the same error (below). Each time the tuning progresses through all CV folds (n=5), and runs till the racing method has eliminated all but 1 parameter, however, it fails eventually with the below error!

Error in future_map(.x = .x, .f = .f, ..., .options = .options, .env_globals = .env_globals, :
argument ".f" is missing, with no default

The OS & Version details I use are as follows:

R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux
Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblasp-r0.3.3.so

I am intrigued by how furrr/future option worked once, but failed in all attempts since. I have also tried using the development version of tune

Any help or advice on parallelization options will be greatly appreciated.

Thanks Rj


Solution

  • Apparently, in tidymodels code, the parallelization happens internally, and there is no need to use furrr/future to do manual parallel computation. Moreover, the above code may be syntactically incorrect. For a more detailed explanation of why this is please see this post by mattwarkentin in the R Studio community forum.