I use targets
as a pipelining tool for an ML project with H2O
.
The main uniqueness of using H2O here is that it creates a new "cluster" (basically a new local process/server which communicates via Rest APIs as far as I understand).
The issue I am having is two-fold.
A minimum working example I came up with looks like this (being the _targets.R
file):
library(targets)
library(h2o)
# start h20 cluster once _targets.R gets evaluated
h2o.init(nthreads = 2, max_mem_size = "2G", port = 54322, name = "TESTCLUSTER")
create_dataset_h2o <- function() {
# connect to the h2o cluster
h2o.init(ip = "localhost", port = 54322, name = "TESTCLUSTER", startH2O = FALSE)
# convert the data to h2o dataframe
as.h2o(iris)
}
train_model <- function(hex_data) {
# connect to the h2o cluster
h2o.init(ip = "localhost", port = 54322, name = "TESTCLUSTER", startH2O = FALSE)
h2o.randomForest(x = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"),
y = c("Species"),
training_frame = hex_data,
model_id = "our.rf",
seed = 1234)
}
predict_model <- function(model, hex_data) {
# connect to the h2o cluster
h2o.init(ip = "localhost", port = 54322, name = "TESTCLUSTER", startH2O = FALSE)
h2o.predict(model, newdata = hex_data)
}
list(
tar_target(data, create_dataset_h2o()),
tar_target(model, train_model(data), format = "qs"),
tar_target(predict, predict_model(model, data), format = "qs")
)
This kinda works but faces the two issues I was outlying above and below...
Usually I would out a h2o::h2o.shutdown(prompt = FALSE)
at the end of my script, but this does not work in this case.
Alternatively, I came up with a new target that is always run.
# in _targets.R in the final list
tar_target(END, h2o.shutdown(prompt = FALSE), cue = tar_cue(mode = "always"))
This works when I run tar_make()
but not when I use tar_visnetwork()
.
Another option is to use.
# after the h2o.init(...) call inside _targets.R
on.exit(h2o.shutdown(prompt = FALSE), add = TRUE)
Another alternative that I came up with is to handle the server outside of targets and only connect to it. But I feel that this might break the targets workflow...
Do you have any other idea how to handle this?
The code in the MWE does not save the data for the targets model
and predict
in the correct format (format = "qs"
). Sometimes (I think when the cluster gets restarted or so), the data gets "invalidated" and h2o throws an error. The data in h2o format in the R session is a pointer to the h2o dataframe (see also docs).
For keras, which similarly stores the models outside of R, there is the option format = "keras"
, which calls keras::save_model_hdf5()
behind the scenes. Similarly, H2O would require h2o::h2o.exportFile()
and h2o::h2o.importFile()
for the dataset and h2o::h2o.saveModel()
and h2o::h2o.loadModel()
for models (see also docs).
Is there a way to create additional formats for tar_targets
or do I need to write the data to file, and return the file? The downside to this is that this file is outside of the _targets
folder system, if I am not mistaken.
I would recommend handling the H2O cluster outside the pipeline in a separate script. That way, tar_visnetwork()
would not start or stop the cluster, and you could more cleanly separate the software engineering from the data analysis.
# run_pipeline.R
start_h2o_cluster(port = ...)
on.exit(stop_h2o_cluster(port = ...))
targets::tar_make_clustermq(workers = 4)
It sounds like H2O objects are not exportable. Currently, you would need to save those files manually, identify the paths, and write format = "file"
in tar_target()
. I am willing to consider H20-based formats. Are all objects in some way covered by h2o.exportFile()
, h2o.importFile()
, h2o::h2o.saveModel()
, and h2o::h2o.loadModel()
, or are there more kinds of objects with different serialization functions? And does h2o
have utilities to perform this (un)serialization in memory like serialize_model()
/unserialize_model()
in keras
?