I am running simulations where some computing should be parallelized and some should not.
I am trying to figure out how to ensure reproducibility across purrr::map()
and furrr::future_map()
so that they yield the same result.
For some reason, I cannot use set.seed()
inside the mapped function.
For instance, consider the following code:
library(purrr)
library(furrr)
#> Loading required package: future
set.seed(42)
rnorm(1)
#> [1] 1.370958
set.seed(42)
map(1, ~rnorm(1))
#> [[1]]
#> [1] 1.370958
set.seed(42)
future_map(1, ~rnorm(1), .options=furrr_options(seed=TRUE))
#> [[1]]
#> [1] -0.1691382
set.seed(42)
future_map(1, ~rnorm(1), .options=furrr_options(seed=42))
#> [[1]]
#> [1] -0.02648871
future_map(1, ~rnorm(1), .options=furrr_options(seed=list(42L)))
#> Error in `validate_seed_list()`:
#> ! All pre-generated random seed elements of a list `seed` must be valid `.Random.seed` seeds, which means they should be all integers and consists of two or more elements, not just one.
Created on 2023-02-21 with reprex v2.0.2
As you can see, I could not get the 1.37
value using furrr
. Every call is reproducible but they yield different results.
In my real code, each function will run 100-200 times, which is less than length(.Random.seed)
(==626).
I thus thought setting the seed as a list could be a solution, but I don't really understand the documentation or the error message.
For reference, here is the help file that addresses random seed management: link
Is there a way to have purrr::map()
and furrr::future_map()
yield the same result?
EDIT: for reference, here is the related GitHub issue.
Author of futureverse here.
R uses RNGkind("Mersenne-Twister")
by default. This type of random number generator (RNG) is valid only in sequential processing.
For parallel processing, we have to use an RNG that is designed for parallel processing. If not, we will not get statistically sound random numbers and our results risk being biased. This is true for all parallel frameworks. R provides RNGkind("L'Ecuyer-CMRG")
for parallel processing. Most parallel solutions rely on this, if at all (some don't worry about parallel RNG). There are alternative parallel RNG methods available in different CRAN packages.
Because of (1) and (2), it is impossible to (a) reproduce random numbers produced in standard sequential processing in R, when (b) running in parallel. The only way to do achieve it is to change the sequential processing to also use parallel RNG (e.g. RNGkind("L'Ecuyer-CMRG")
). Unfortunately, it's not just a matter of changing the RNG-kind settings. One also has to update the implementation of the underlying algorithm (here purrr). In contrast, the futureverse does this at the core (and makes it part of the design requirements). So,
RNGkind("L'Ecuyer-CMRG")
everywhere, regardless of parallel backend ("plan") and number of parallel workers.Thus, in your case using furrr, you will get the exact same random numbers when you use plan(sequential)
(default), plan(multicore)
, plan(multisession)
, plan(future.callr::callr)
, plan(future.batchtools::batchtools_slurm)
, etc.
So, in summary:
You have to accept that:
library(purrr)
set.seed(42)
map(1:10, ~rnorm(1))
and
library(furrr)
set.seed(42)
future_map(1:10, ~rnorm(1), .options=furrr_options(seed = TRUE))
will produce a different sequence of random numbers, but both are still statistically sound. When accepting that, it is nice to know that regardless of which plan()
you set, you'll get identical random-number sequences, e.g.
plan(sequential)
set.seed(42)
future_map(1:10, ~rnorm(1), .options=furrr_options(seed = TRUE))
gives the same results as:
plan(multisession)
set.seed(42)
future_map(1:10, ~rnorm(1), .options=furrr_options(seed = TRUE))