Recently I try to make use of the department cluster to do parallel computing in R
. The cluster system is manged by SGE
. OpenMPI
has been installed and passed the installation test.
I submit my query to the cluster via qsub
command. In the script, I specify the number of node I want to use via the following command.
#PBS -l nodes=2:ppn=24
(two nodes with 24 threads each)
Then, mpirun -np 1 R --slave -f test.R
I have checked $PBS_NODEFILE
afterwards. Two nodes are allocated as I wish. I could find two nodes' names node1, node2
and each of them appears 24 times.
The content of ``test.R` is listed as follows.
library(Rmpi)
library(snow)
cl <- makeCluster(41,type="MPI")
clusterCall(cl, function() Sys.info()[c("nodename","machine")])
stopCluster(cl)
mpi.quit()
The output of clusterCall()
is quite disappointing. There is only one node's name node1
which appears 41 times. This is definitely wrong since there are only 24 threads on node1
. It seems that my R
script only finds one node or even one thread out of it. I just wonder what is the right way to construct a MPI
cluster?
First of all, your cluster is definitely not managed by SGE even if the latter is installed. SGE doesn't understand the #PBS
sentinel in the job files and it doesn't export the PBS_NODEFILE
environment variable (most environment variables that SGE exports start with SGE_
). It also won't accept the nodes=2:ppn=24
resource request as the distribution of the slots among the allocated nodes is controlled by the specified parallel environment. What you have is either PBS Pro or Torque. But SGE names the command line utilities the same and qsub
takes more or less the same arguments, which probably is why you think it is SGE that you have.
The problem you describe usually occurs if Open MPI is not able to properly obtain the node list from the environment, e.g. if it wasn't compiled with support for PBS Pro/Torque. In that case, it will start all MPI processes on the node on which mpirun
was executed. Check that the proper RAS module was compiled by running:
ompi_info | grep ras
It should list the various RAS modules and among them should be one called tm
:
...
MCA ras: tm (MCA v2.0, API v2.0, Component v1.6.5)
...
If the tm
module is not listed, then Open MPI will not automatically obtain the node list and the hostfile must be explicitly specified:
mpiexec ... -machinefile $PBS_NODEFILE ...
Under PBS Pro/Torque, Open MPI also needs the tm
PLM module. The lack of that module will prevent Open MPI from using the TM API to remotely launch the processes on the second node and it will therefore fall back to using SSH. In such case, you should make sure that passwordless SSH login, e.g. one using public key authentication, is possible from each cluster node into each other node.
Your first step in solving the issue is to check for the presence of the correct modules as shown above. If the modules are there, you should launch hostname
under mpiexec
and check if that works, e.g.:
#PBS -l nodes=2:ppn=24
echo "Allocated nodes:"
cat $PBS_NODEFILE
echo "MPI nodes:"
mpiexec --mca ras_base_display_alloc 1 hostname
then compare the two lists and also examine the ALLOCATED NODES
block. The lists should be more or less equal and both nodes should be shown in the allocated nodes table with 24 slots per node (cf. Num slots
). If the second list contains only one hostname, then Open MPI is not able to properly obtain the hostfile because something is preventing the tm
modules (given that they do exist) from initialising or being selected. This could either be the system-wide Open MPI configuration or some other RAS module having higher priority. Passing --mca ras_base_verbose 10
to mpiexec
helps in determining if that is the case.