rrparallelr-box

How to use imported names from box modules inside parallel code?


Here is a minimal example showing the issue:

mod.r:

#' @export
run_sqrt <- function (x) {
  sqrt(x)
}

mwe.r

box::use(
  ./mod[...],
  parallel,
  dp = doParallel,
  foreach[foreach, `%dopar%`],
)

cl <- parallel$makeCluster(2L)
dp$registerDoParallel(cl)

foreach(i = 1 : 5) %dopar% {
  run_sqrt(i)
}

parallel$stopCluster(cl)

This raises the error

Error in { : task 1 failed - "could not find function "run_sqrt""

I found this

parallel::clusterExport(cluster, setdiff(ls(), "cluster"))

in How to use `foreach` and `%dopar%` with an `R6` class in R?

But it didn't work


Solution

  • As you found this is a limitation of the ‘parallel’ package. It only knows about names defined in the current environment.

    There are several solutions for this. The following list is roughly in order of (my personal) preference, from most preferred to least preferred.

    1. Use explicitly qualified module access instead of attaching. So:

      1. Change ./mod[...] to ./mod inside box::use()

      2. Fully qualify the name inside foreach:

        foreach(i = 1 : 5) %dopar% {
          mod$run_sqrt(i)
        }
        

      Due to how parallel searches names, this will only work if the above code is executed in the global environment.

    2. Import ./mod inside the foreach body instead of at the beginning of your script. However, note that there is currently an open bug regarding this solution.

    3. Use parallel::clusterExport; this solution works if the correct names are provided, in this case run_sqrt. To make the minimal example work, add the following line before the foreach call:

      parallel$clusterExport(cl, "run_sqrt", envir = environment())
      

      The reason why your version didn’t work is because ls() won’t list run_sqrt, since the name is attached, it does not exist in the local scope. The same issue would exist with attached packages instead of modules. Furthermore, for reasons I do not understand, clusterExport by default searches names in the global environment only, you need to explicitly provide the current environment, via envir = environment().