rmultiprocessingfuturehpcr-future

R: running a mirai_cluster via parallelly::makeClusterPSOCK


my goal is to perform some heavy computation in R on a cluster of several Linux hosts, using docker containers. Within R I want to use foreach.

To do this, I believe the best approach is with the futureverse. Since I want the system to be able to use more then 125 cores, I want to use the future.mirai backend. Also I want to have some fine grain control over the way the cluster is set up, e.g. to have it run in docker containers. Thats why I want to use makeClusterPSOCK (see Section 5 of the Examples).

This whole setup feels rather complicated to me, so I start with a very small example on my Windows 11 laptop (8 cores) and try to bootstrap myself up from there. Currently my problem is that I lack any tutorial/example/documentation how I can combine the different parts of the futureverse into a bigger unity.
Here is an MWE:

rm(list = ls())
.rs.restartR()

library(doFuture)
library(future.mirai)

########################

# plan(sequential)
# plan(multisession)

# plan(mirai_multisession)
# plan(mirai_cluster)

# plan(cluster, workers = c("n1", "n2", "n3"))

workers <- rep("localhost", parallelly::availableCores())
cl <- parallelly::makeClusterPSOCK(
  workers = workers,
  dryrun = FALSE,
  verbose = TRUE,
  autoStop = TRUE)

plan(cluster, workers = cl)

#plan(mirai_cluster, workers = cl)

# Error in MiraiFuture(expr = expr, substitute = FALSE, envir = envir, workers = NULL, : 
# formal argument “workers” matches several given arguments
# Additionally: warning message:
# Detected 1 unknown future arguments: 'workers' 

##############################

y <- foreach(x = 1:1000, .combine = c) %dofuture% {
  Sys.getpid()
}
table(y)

The idea here is to try out the different strategies (sequential,multisession,mirai_multisession) and check that the number of pids is what I expect it to be. How can I overcome the 125 core limit in this scenario (how can I combine the future.mirai backend with parallelly::makeClusterPSOCK)?


Solution

  • author of Futureverse here.

    Using 'multisession'

    In R (>= 4.4.0) [2024-04-24], you can increase the number of connections that R can handle when you launch R and Rscript by specifying options --max-connection=N, e.g.

    $ Rscript --max-connections=512 -e "parallelly::availableConnections()"
    [1] 512
    

    This is documented in https://parallelly.futureverse.org/reference/availableConnections.html. This means you can use multisession with, say 148 parallel workers (one connection per worker) + some spare connection for other purposes by using:

    $ R --max-connections=192
    ...
    > library(future)
    > plan(multisession, workers = 148)
    > nbrOfWorkers()
    [1] 148
    
    > library(doFuture)
    > pids <- foreach(ii = seq_len(nbrOfWorkers())) %dofuture% Sys.getpid()
    > length(unique(pids))
    [1] 148
    

    Using 'future.callr::callr'

    The future.callr package does not rely on connections, so you can use that as:

    > library(future)
    > plan(future.callr::callr, workers = 148)
    > nbrOfWorkers()
    [1] 148
    

    Using 'future.mirai::mirai_multisession'

    The future.mirai package does not rely on connections, so you can use that as:

    > library(future)
    > plan(future.mirai::mirai_multisession, workers = 148)
    > nbrOfWorkers()
    [1] 148
    

    ... how can I combine the future.mirai backend with parallelly::makeClusterPSOCK?

    You cannot; the future.mirai backend is separate from PSOCK clusters of parallelly/parallel.

    PS. Questions specific to Futureverse are probably better asked on https://github.com/HenrikBengtsson/future/discussions/