rparallel-processingcpusnowsnowfall

Turn on all CPUs for all nodes on a cluster: snow/snowfall package


I am working on a cluster and am using the snowfall package to establish a socket cluster on 5 nodes with 40 CPUs each with the following command:

 > sfInit(parallel=TRUE, cpus = 200, type="SOCK", socketHosts=c("host1", "host2", "host3", "host4", "host5"));
 R Version:  R version 3.1.0 (2014-04-10) 

 snowfall 1.84-6 initialized (using snow 0.3-13): parallel execution on 5 CPUs.

I am seeing a much lower load on the slaves than expected when I check the cluster report and was disconcerted by the fact that it says "parallel execution on 5 CPUs" instead of "parallel execution on 200 CPUs". Is this merely an ambiguous reference to CPUs or are the hosts only running one CPU each?

EDIT: Here is an example of why this concerns me, if I only use the local machine and specify the max number of cores, I have:

 > sfInit(parallel=TRUE, type="SOCK", cpus = 40);
 snowfall 1.84-6 initialized (using snow 0.3-13): parallel execution on 40 CPUs.

I ran an identical job on the single node, 40 CPU cluster and it took 1.4 minutes while the 5 node, apparently 5 CPU cluster took 5.22 minutes. To me this confirms my suspicions that I am running with parallelism on 5 nodes but am only turning on 1 of the CPUs on each node.

My question is then: how do you turn on all CPUs for use across all available nodes?

EDIT: @SimonG I used the underlying snow package's intialization and we can clearly see that only 5 nodes are being turned on:

 > cl <- makeSOCKcluster(names = c("host1", "host2", "host3", "host4", "host5"), count = 200)
 > clusterCall(cl, runif, 3)
 [[1]]
 [1] 0.9854311 0.5737885 0.8495582

 [[2]]
 [1] 0.7272693 0.3157248 0.6341732

 [[3]]
 [1] 0.26411931 0.36189866 0.05373248

 [[4]]
 [1] 0.3400387 0.7014877 0.6894910

 [[5]]
 [1] 0.2922941 0.6772769 0.7429913

 > stopCluster(cl)
 > cl <- makeSOCKcluster(names = rep("localhost", 40), count = 40)
 > clusterCall(cl, runif, 3)
 [[1]]
 [1] 0.6914666 0.7273244 0.8925275

 [[2]]
 [1] 0.3844729 0.7743824 0.5392220

 [[3]]
 [1] 0.2989990 0.7256851 0.6390770     

 [[4]]
 [1] 0.07114831 0.74290601 0.57995908

 [[5]]
 [1] 0.4813375 0.2626619 0.5164171

 .
 .
 .

 [[39]]
 [1] 0.7912749 0.8831164 0.1374560

 [[40]]
 [1] 0.2738782 0.4100779 0.0310864

I think this shows it pretty clearly. I tried this in desperation:

 > cl <- makeSOCKcluster(names = rep(c("host1", "host2", "host3", "host4", "host5"), each = 40), count = 200)

and predictably got:

 Error in socketConnection(port = port, server = TRUE, blocking = TRUE,  : 
   all connections are in use

Solution

  • After thoroughly reading the snow documentation, I have come up with a (partial) solution.

    I read that only 128 connections may be opened at once with the distributed R version, and have found it to be true. I can open 25 CPUs on each node, but the cluster will not start if I try to start 26 on each. Here is the proper structure of the host list that needs to be passed to makeCluster:

    > library(snow);
    
    > unixHost13 <- list(host = "host1");
    > unixHost14 <- list(host = "host2");
    > unixHost19 <- list(host = "host3");
    > unixHost29 <- list(host = "host4");
    > unixHost30 <- list(host = "host5");
    
    > kCPUs <- 25;
    > hostList <- c(rep(list(unixHost13), kCPUs), rep(list(unixHost14), kCPUs),               rep(list(unixHost19), kCPUs), rep(list(unixHost29), kCPUs), rep(list(unixHost30), kCPUs));
    > cl <- makeCluster(hostList, type = "SOCK")
    > clusterCall(cl, runif, 3)
    [[1]]
    [1] 0.08430941 0.64479036 0.90402362
    
    [[2]]
    [1] 0.1821656 0.7689981 0.2001639
    
    [[3]]
    [1] 0.5917363 0.4461787 0.8000013
    .
    .
    .
    [[123]]
    [1] 0.6495153 0.6533647 0.2636664
    
    [[124]]
    [1] 0.75175580 0.09854553 0.66568129
    
    [[125]]
    [1] 0.79336203 0.61924813 0.09473841
    

    I found a reference saying in order to up the connections, R needed to be rebuilt with NCONNECTIONS set higher (see here).