rparallel-processing

R parallel inside of function


I regularly make use of the packages parallel and pbapply. However, I have come across some odd behavior that I assume is by design, but I can't figure out how to work around it. Basically, if I use a cluster within a function, the function's entire environment gets exported to each worker regardless of whether I specify an export. Below is a trivial, meaningless example, but it illustrates the point. Inside of the function I create a matrix that is 800 MB in size. I never ask for that to be exported, yet when the workers start going they all immediately expand to 800 MB. Is there some way to stop this implicit export of x from happening?

library(parallel)
library(pbapply)

f = function()
{
    x = matrix(runif(10000*10000), nrow = 10000)
    
    cl = makeCluster(10)
    
    ans = pbsapply(1:1000, function(i){
        w = matrix(runif(1000*1000), nrow = 1000)
        return(sum(w))
    }, cl = cl)
    
    stopCluster(cl)
}

Solution

  • Depending on snow::getClusterOption("type") calls either makeSOCKcluster or makeFORKcluster. If 'FORK', current environment gets exported. Either you code differently by avoiding such large objects in the environment or use makeSOCKcluster and explicitly use clusterExport.