rparallel-processingrparallel

Error in parallel processing: port cannot be open


I am running different R scripts in batch mode at once in a linux cluster to estimate a model in different data sets (it also happens when I run it in Mac). The scripts are exactly the same, except for the data set that they are using. I get the following message when I do that.

Error in socketConnection("localhost", port = port, server = TRUE, blocking = TRUE, : 
cannot open the connection
Calls: makePSOCKcluster -> newPSOCKnode -> socketConnection
In addition: Warning message:
In socketConnection("localhost", port = port, server = TRUE, blocking = TRUE, :
port 11426 cannot be opened

Here is a reproducible example. Create two files, tmp1.R and tmp2.R, and tmp.sh with the content:

Content of the files tmp1.R and tmp2.R:

library(dclone)
l <- list(1:100,1:100,1:100,1:100)
cl <- makePSOCKcluster(4)
parLapply(cl, X=l, fun=function(x) {Sys.sleep(2); sum(x); })
stopCluster(cl)

Content of the tmp.sh file:

#!/bin/sh
R CMD BATCH tmp1.R &
R CMD BATCH tmp2.R &

The first file in the list will be executed. The second will present the error above. Does anyone know how to solve it and still run all the scripts at once automatically without any manual intervention?

PS: I have read all the other similar questions, none have a reproducible example or an answer to the question above.


Solution

  • You don't need to start multiple clusters to run the same code on multiple datasets. Just send the correct data to each node.

    # make 4 distinct datasets
    df1 <- mtcars[1:8,]
    df2 <- mtcars[9:16,]
    df3 <- mtcars[17:24,]
    df4 <- mtcars[25:32,]
    
    # make the cluster
    cl <- makeCluster(4)
    
    clusterApply(cl, list(df1, df2, df3, df4), function(df) {
        # do stuff with df
        # each node will use a different subset of data
        lm(mpg ~ disp + wt, df)
    })
    

    If you want the data to be persistent on each node, so you can use it for subsequent analyses:

    clusterApply(cl, list(df1, df2, df3, df4), function(df) {
        assign("df", df, globalenv())
        NULL
    })
    

    This creates a df data frame on each node, which will be unique to that node.