my goal is to perform some heavy computation in R on a cluster of several Linux hosts, using docker containers. Within R I want to use foreach.
To do this, I believe the best approach is with the futureverse. Since I want the system to be able to use more then 125 cores, I want to use the future.mirai backend. Also I want to have some fine grain control over the way the cluster is set up, e.g. to have it run in docker containers. Thats why I want to use makeClusterPSOCK (see Section 5 of the Examples).
This whole setup feels rather complicated to me, so I start with a very small example on my Windows 11 laptop (8 cores) and try to bootstrap myself up from there.
Currently my problem is that I lack any tutorial/example/documentation how I can combine the different parts of the futureverse into a bigger unity.
Here is an MWE:
rm(list = ls())
.rs.restartR()
library(doFuture)
library(future.mirai)
########################
# plan(sequential)
# plan(multisession)
# plan(mirai_multisession)
# plan(mirai_cluster)
# plan(cluster, workers = c("n1", "n2", "n3"))
workers <- rep("localhost", parallelly::availableCores())
cl <- parallelly::makeClusterPSOCK(
workers = workers,
dryrun = FALSE,
verbose = TRUE,
autoStop = TRUE)
plan(cluster, workers = cl)
#plan(mirai_cluster, workers = cl)
# Error in MiraiFuture(expr = expr, substitute = FALSE, envir = envir, workers = NULL, :
# formal argument “workers” matches several given arguments
# Additionally: warning message:
# Detected 1 unknown future arguments: 'workers'
##############################
y <- foreach(x = 1:1000, .combine = c) %dofuture% {
Sys.getpid()
}
table(y)
The idea here is to try out the different strategies (sequential
,multisession
,mirai_multisession
) and check that the number of pids is what I expect it to be.
How can I overcome the 125 core limit in this scenario (how can I combine the future.mirai
backend with parallelly::makeClusterPSOCK
)?
author of Futureverse here.
In R (>= 4.4.0) [2024-04-24], you can increase the number of connections that R can handle when you launch R and Rscript by specifying options --max-connection=N
, e.g.
$ Rscript --max-connections=512 -e "parallelly::availableConnections()"
[1] 512
This is documented in https://parallelly.futureverse.org/reference/availableConnections.html. This means you can use multisession
with, say 148 parallel workers (one connection per worker) + some spare connection for other purposes by using:
$ R --max-connections=192
...
> library(future)
> plan(multisession, workers = 148)
> nbrOfWorkers()
[1] 148
> library(doFuture)
> pids <- foreach(ii = seq_len(nbrOfWorkers())) %dofuture% Sys.getpid()
> length(unique(pids))
[1] 148
The future.callr package does not rely on connections, so you can use that as:
> library(future)
> plan(future.callr::callr, workers = 148)
> nbrOfWorkers()
[1] 148
The future.mirai package does not rely on connections, so you can use that as:
> library(future)
> plan(future.mirai::mirai_multisession, workers = 148)
> nbrOfWorkers()
[1] 148
... how can I combine the future.mirai backend with parallelly::makeClusterPSOCK?
You cannot; the future.mirai backend is separate from PSOCK clusters of parallelly/parallel.
PS. Questions specific to Futureverse are probably better asked on https://github.com/HenrikBengtsson/future/discussions/