I'm trying to parallelize the training of multiple ML models using the autoML feature provided by H2O. The core code I'm using is the following:
library(foreach)
library(doParallel)
project_folder <- "/home/user/Documents/"
ncores <- parallel::detectCores(logical = FALSE)
nlogiccpu <- parallel::detectCores()
max_mem_size <- "4G"
cl<-makeCluster(nlogiccpu)
registerDoParallel(cl)
df4 <-foreach(i = as.numeric(seq(1,length(divisions))), .combine=rbind) %dopar% {
library(dplyr)
library(h2o)
h2o.init(nthreads = ncores, max_mem_size = max_mem_size)
div <- divisions[i]
df.h2o <- as.h2o(
df %>% filter(code == div) )
y <- "TARGET"
x <- names(df.train.x.discretized)
automl.models.h2o <- h2o.automl(
x = x,
y = y,
training_frame = df.h2o,
nfolds = 10,
seed = 111,
project_name = paste0("PRJ_", div)
)
leader <- automl.models.h2o@leader
div_folder <- file.path(project_folder, paste0("Division_", div))
h2o.saveModel(leader,
path = file.path(div_folder, "TARGET_model_bin"))
...
}
Only a part of all the models are trained and saved in their folder, because at some point I got the following error:
water.exceptions.H2OIllegalArgumentException: Illegal argument: training_frame of function: grid: Cannot append new models to a grid with different training input
I suppose grids are used during the autoML phase, so I tried to find a parameter to pass the grid_id
as I can do in the h2o.grid
function as following:
grid <- h2o.grid(“gbm”, grid_id = paste0(“gbm_grid_id”, div),
...)
but I can't find the way to do that. The H2O package version I'm using is the 3.24.0.2.
Any suggestion?
The short answer to the question is that you cannot use different training frames in a single grid. Each grid of models must be associated with a single training set (the idea is that you do not want to compare models trained on different training sets). This is why you are hitting the error. It looks like each of your df.h2o
training frames are different subsets of the original df
frame.
Another note: H2O and R's parallel functionality don't mix well. H2O model training is already parallelized, but in a different way (for scalability reasons). The training of a single model is parallelized within H2O (on multiple cores), but H2O is not designed to train multiple models at once. If you want to train multiple models at once on a single machine, then you would have to start multiple H2O clusters in different R sessions on different ports.