Requesting your help or expert opinion on a parallelization issue I am facing.
I regularly run an Xgboost classifier model on a rather large dataset (dim(train_data) = 357,401 x 281, dims after recipe prep() are 147,304 x 1159 ) for a multiclass prediction. In base R the model runs in just over 4 hours using registerDoParallel(using all 24 cores of my server). I am now trying to run it in the Tidymodels environment, however, I am yet to find a robust parallelization option to tune the grid.
I attempted the following parallelization options within tidymodels. All of them seem to work on a smaller subsample (eg 20% data), but options 1-4 fail when I run the entire dataset, mostly due to memory allocation issues.
Option 5 (doParallel) has worked with 100% data in the tidymodel environment, however, it takes 4-6 hours to tune the grid. I would request your attention to option 6 (future/ furrr), this appeared to be the most efficient of all methods I tried. This method however worked only once (successful code included below, please note I have incorporated a racing method and stopping grid into the tuning).
doParallel::registerDoParallel(cores = 24)
library(furrr)
future::plan(multisession, gc = T)
tic()
race_rs <- future_map_dfr(
tune_race_anova(
xgb_earlystop_wf,
resamples = cv_folds,
metrics = xgb_metrics,
grid = stopping_grid,
control = control_race(
verbose = TRUE,
verbose_elim = TRUE,
allow_par = TRUE,
parallel_over = 'everything'
)
),
.progress = T,
.options = furrr_options(packages = "parsnip"),
)
toc()
Interestingly, after one success all subsequent attempts have failed. I am always getting the same error (below). Each time the tuning progresses through all CV folds (n=5), and runs till the racing method has eliminated all but 1 parameter, however, it fails eventually with the below error!
Error in future_map(.x = .x, .f = .f, ..., .options = .options, .env_globals = .env_globals, :
argument ".f" is missing, with no default
The OS & Version details I use are as follows:
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux
Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblasp-r0.3.3.so
I am intrigued by how furrr/future option worked once, but failed in all attempts since. I have also tried using the development version of tune
Any help or advice on parallelization options will be greatly appreciated.
Thanks Rj
Apparently, in tidymodels code, the parallelization happens internally, and there is no need to use furrr/future to do manual parallel computation. Moreover, the above code may be syntactically incorrect. For a more detailed explanation of why this is please see this post by mattwarkentin in the R Studio community forum.