rh2otargets-r-package

R targets with H2O


I use targets as a pipelining tool for an ML project with H2O. The main uniqueness of using H2O here is that it creates a new "cluster" (basically a new local process/server which communicates via Rest APIs as far as I understand).

The issue I am having is two-fold.

  1. How can I stop/operate the cluster within the targets framework in a smart way
  2. How can I save & load the data/models within the targets framework

MWE

A minimum working example I came up with looks like this (being the _targets.R file):

library(targets)
library(h2o)

# start h20 cluster once _targets.R gets evaluated
h2o.init(nthreads = 2, max_mem_size = "2G", port = 54322, name = "TESTCLUSTER")

create_dataset_h2o <- function() {
  # connect to the h2o cluster
  h2o.init(ip = "localhost", port = 54322, name = "TESTCLUSTER", startH2O = FALSE)
  # convert the data to h2o dataframe
  as.h2o(iris)
}
train_model <- function(hex_data) {
  # connect to the h2o cluster
  h2o.init(ip = "localhost", port = 54322, name = "TESTCLUSTER", startH2O = FALSE)

  h2o.randomForest(x = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"),
                   y = c("Species"),
                   training_frame = hex_data,
                   model_id = "our.rf",
                   seed = 1234)
}
predict_model <- function(model, hex_data) {
  # connect to the h2o cluster
  h2o.init(ip = "localhost", port = 54322, name = "TESTCLUSTER", startH2O = FALSE)
  h2o.predict(model, newdata = hex_data)
}

list(
  tar_target(data, create_dataset_h2o()),
  tar_target(model, train_model(data), format = "qs"),
  tar_target(predict, predict_model(model, data), format = "qs")
)

This kinda works but faces the two issues I was outlying above and below...

Ad 1 - stopping the cluster

Usually I would out a h2o::h2o.shutdown(prompt = FALSE) at the end of my script, but this does not work in this case. Alternatively, I came up with a new target that is always run.

# in _targets.R in the final list
  tar_target(END, h2o.shutdown(prompt = FALSE), cue = tar_cue(mode = "always"))

This works when I run tar_make() but not when I use tar_visnetwork().

Another option is to use.

# after the h2o.init(...) call inside _targets.R
on.exit(h2o.shutdown(prompt = FALSE), add = TRUE)

Another alternative that I came up with is to handle the server outside of targets and only connect to it. But I feel that this might break the targets workflow...

Do you have any other idea how to handle this?

Ad 2 - saving the dataset and model

The code in the MWE does not save the data for the targets model and predict in the correct format (format = "qs"). Sometimes (I think when the cluster gets restarted or so), the data gets "invalidated" and h2o throws an error. The data in h2o format in the R session is a pointer to the h2o dataframe (see also docs).

For keras, which similarly stores the models outside of R, there is the option format = "keras", which calls keras::save_model_hdf5() behind the scenes. Similarly, H2O would require h2o::h2o.exportFile() and h2o::h2o.importFile() for the dataset and h2o::h2o.saveModel() and h2o::h2o.loadModel() for models (see also docs).

Is there a way to create additional formats for tar_targets or do I need to write the data to file, and return the file? The downside to this is that this file is outside of the _targets folder system, if I am not mistaken.


Solution

  • Ad 1

    I would recommend handling the H2O cluster outside the pipeline in a separate script. That way, tar_visnetwork() would not start or stop the cluster, and you could more cleanly separate the software engineering from the data analysis.

    # run_pipeline.R
    start_h2o_cluster(port = ...)
    on.exit(stop_h2o_cluster(port = ...))
    targets::tar_make_clustermq(workers = 4)
    

    Ad 2

    It sounds like H2O objects are not exportable. Currently, you would need to save those files manually, identify the paths, and write format = "file" in tar_target(). I am willing to consider H20-based formats. Are all objects in some way covered by h2o.exportFile(), h2o.importFile(), h2o::h2o.saveModel(), and h2o::h2o.loadModel(), or are there more kinds of objects with different serialization functions? And does h2o have utilities to perform this (un)serialization in memory like serialize_model()/unserialize_model() in keras?