rmpiqsubsnow

Initialize MPI cluster using Rmpi


Recently I try to make use of the department cluster to do parallel computing in R. The cluster system is manged by SGE. OpenMPI has been installed and passed the installation test.

I submit my query to the cluster via qsub command. In the script, I specify the number of node I want to use via the following command.
#PBS -l nodes=2:ppn=24 (two nodes with 24 threads each)
Then, mpirun -np 1 R --slave -f test.R
I have checked $PBS_NODEFILE afterwards. Two nodes are allocated as I wish. I could find two nodes' names node1, node2 and each of them appears 24 times.

The content of ``test.R` is listed as follows.

library(Rmpi)
library(snow)

cl <- makeCluster(41,type="MPI")
clusterCall(cl, function() Sys.info()[c("nodename","machine")])
stopCluster(cl)
mpi.quit()

The output of clusterCall() is quite disappointing. There is only one node's name node1 which appears 41 times. This is definitely wrong since there are only 24 threads on node1. It seems that my R script only finds one node or even one thread out of it. I just wonder what is the right way to construct a MPI cluster?


Solution

  • First of all, your cluster is definitely not managed by SGE even if the latter is installed. SGE doesn't understand the #PBS sentinel in the job files and it doesn't export the PBS_NODEFILE environment variable (most environment variables that SGE exports start with SGE_). It also won't accept the nodes=2:ppn=24 resource request as the distribution of the slots among the allocated nodes is controlled by the specified parallel environment. What you have is either PBS Pro or Torque. But SGE names the command line utilities the same and qsub takes more or less the same arguments, which probably is why you think it is SGE that you have.

    The problem you describe usually occurs if Open MPI is not able to properly obtain the node list from the environment, e.g. if it wasn't compiled with support for PBS Pro/Torque. In that case, it will start all MPI processes on the node on which mpirun was executed. Check that the proper RAS module was compiled by running:

    ompi_info | grep ras
    

    It should list the various RAS modules and among them should be one called tm:

    ...
    MCA ras: tm (MCA v2.0, API v2.0, Component v1.6.5)
    ...
    

    If the tm module is not listed, then Open MPI will not automatically obtain the node list and the hostfile must be explicitly specified:

    mpiexec ... -machinefile $PBS_NODEFILE ...
    

    Under PBS Pro/Torque, Open MPI also needs the tm PLM module. The lack of that module will prevent Open MPI from using the TM API to remotely launch the processes on the second node and it will therefore fall back to using SSH. In such case, you should make sure that passwordless SSH login, e.g. one using public key authentication, is possible from each cluster node into each other node.

    Your first step in solving the issue is to check for the presence of the correct modules as shown above. If the modules are there, you should launch hostname under mpiexec and check if that works, e.g.:

    #PBS -l nodes=2:ppn=24
    
    echo "Allocated nodes:"
    cat $PBS_NODEFILE
    echo "MPI nodes:"
    mpiexec --mca ras_base_display_alloc 1 hostname
    

    then compare the two lists and also examine the ALLOCATED NODES block. The lists should be more or less equal and both nodes should be shown in the allocated nodes table with 24 slots per node (cf. Num slots). If the second list contains only one hostname, then Open MPI is not able to properly obtain the hostfile because something is preventing the tm modules (given that they do exist) from initialising or being selected. This could either be the system-wide Open MPI configuration or some other RAS module having higher priority. Passing --mca ras_base_verbose 10 to mpiexec helps in determining if that is the case.