rparallel-processingrandom-seedreproducible-researchmclapply

Parallel processing in R - setting seed with mclapply() vs. pbmclapply()


I'm parallelizing simulations in R (using mclapply() from the parallel package) and wanted to track my progress with each function call. So I instead decided to use pbmclapply() from the pbmcapply package in order to have a progress bar each time I run my simulations (pbmclapply() is specifically created as a wrapper for mclapply(), so they should have the same functionality except for the progress bar).

I was able to set a seed and get reproducible results without a problem using mclapply(), but pbmclapply() is giving me different results with each run, which I'm perplexed by. I've included a pretty simple reprex below.

For example, this is using mcapply():

## GIVES THE SAME RESULT EACH TIME IT IS RUN
library(parallel)
RNGkind("L'Ecuyer-CMRG")
set.seed(1)
x <- mclapply(1:100, function(i) {rnorm(1)}, mc.cores = 2)
y <- do.call(rbind, x)
z <- mean(y)
print(mean(z))

And this is the same code using pbmclapply():

## GIVES DIFFERENT RESULTS EACH TIME IT IS RUN
library(pbmcapply)
RNGkind("L'Ecuyer-CMRG")
set.seed(1)
x <- pbmclapply(1:100, function(i) {rnorm(1)}, mc.cores = 2)
y <- do.call(rbind, x)
z <- mean(y)
print(mean(z))

The only difference between the two blocks of code above is the use of pbmclapply() in the second and mclapply() in the first, yet the first block gives me a consistent result every time I run it, and the second block gives different results each time it is run (though a seed is set in the same way).

What is the difference in the seeding procedure between these two functions? I would appreciate any feedback as to why this is happening. Thanks!


Solution

  • The issue is that in the utils.R file within the pbmcapply package it runs the following line:

    if (isTRUE(mc.set.seed))
          mc.set.stream()
    

    If we compare this to what is being called when we run the mclapply() function in the parallel package we see that it runs:

    if (mc.set.seed) 
            mc.reset.stream()
    

    This affects the results as reset stream will allow the code to be run from the globally set seed, whereas running set stream sets it to the a new random starting value using the initial seed. We can see this in the functions attached below:

    mc.reset.stream <- function () 
    {
        if (RNGkind()[1L] == "L'Ecuyer-CMRG") {
            if (!exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE)) 
                sample.int(1L)
            # HERE! sets the seed to the global seed variable we set
            assign("LEcuyer.seed", get(".Random.seed", envir = .GlobalEnv, 
                inherits = FALSE), envir = RNGenv)
        }
    }
    
    mc.set.stream <- function () 
    {
        if (RNGkind()[1L] == "L'Ecuyer-CMRG") {
            assign(".Random.seed", get("LEcuyer.seed", envir = RNGenv), 
                envir = .GlobalEnv)
        }
        else {
            if (exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE)) 
                rm(".Random.seed", envir = .GlobalEnv, inherits = FALSE)
        }
    }
    

    I believe this change may be due to an issue with mclapply when you want to call the mclapply function more than once after setting the seed it will use the same random numbers. (i.e. by resetting the r session you should get the same results in the same order with pbmclapply so first time I get 0.143 then 0.064 and then -0.015). This is usually the preferred behaviour so you can call the funciton multiple times. See R doesn't reset the seed when "L'Ecuyer-CMRG" RNG is used? for more information.

    The differences between these two implementations can be tested with the following code if you change the line in the .customized_mcparallel funciton definition from mc.set.stream() to mc.reset.stream(). Here I have simplified the function calls in the package to strip out the progress bar and leave in only the calculation (removing error checks also) and the change in setting the random seed. (Additionally note these functions will no longer run on a Windows machine only Linux or MacOS).

    library(pbmcapply)
    RNGkind("L'Ecuyer-CMRG")
    set.seed(1)
    pbmclapply <- function()  {
    
      pkg <- asNamespace('pbmcapply')
      .cleanup <- get('.cleanup', pkg)
    
    
      progressMonitor <- .customized_mcparallel({
    
        mclapply(1:100, function(i) {
                rnorm(1)
            }, mc.cores = 2, mc.preschedule = TRUE, mc.set.seed = TRUE,
                           mc.cleanup = TRUE, mc.allow.recursive = TRUE)
      })
    
      # clean up processes on exit
      on.exit(.cleanup(progressMonitor$pid), add = T)
    
      # Retrieve the result
      results <- suppressWarnings(mccollect(progressMonitor$pid)[[as.character(progressMonitor$pid)]])
    
      return(results)
    }
    
    .customized_mcparallel <- function (expr, name, detached = FALSE){
      # loading hidden functions
      pkg <- asNamespace('parallel')
      mcfork <- get('mcfork', pkg)
      mc.advance.stream <- get('mc.advance.stream', pkg)
      mcexit <- get('mcexit', pkg)
      mcinteractive <- get('mcinteractive', pkg)
      sendMaster <- get('sendMaster', pkg)
      mc.set.stream <- get('mc.set.stream', pkg)
      mc.reset.stream <- get('mc.reset.stream', pkg)
    
      f <- mcfork(F)
      env <- parent.frame()
      mc.advance.stream()
      if (inherits(f, "masterProcess")) {
    
        mc.set.stream()
        # reset the group process id of the forked process
        mcinteractive(FALSE)
    
        sendMaster(try(eval(expr, env), silent = TRUE))
        mcexit(0L)
      }
    
      f
    }
    
    x <- pbmclapply()
    y <- do.call(rbind, x)
    z <- mean(y)
    print(z)
    
    

    For a complete remedy my best suggestion would be to either reimplement the functions in your own code (I copy pasted with some minor modifications to the functions from pbmcapply) or by forking the package and replacing the mc.set.seed in the utils.R file with mc.reset.seed. I can't think of a simpler solution at the moment, but hopefully this clarifies the issue.