pythonparallel-processingmpicluster-computingslurm

How To Run MPI Python Script across multiple nodes on Slurm cluster? Error: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1


I'm running a script on a Slurm cluster that could benefit from parallel processing, so I'm trying to implement MPI. However, it doesn't seem to allow me to run processes on multiple nodes. I don't know if this is normally done automatically, but whenever I set --nodes=2 in the batch file for submission, I get the error message:

"Warning: can't run 1 processes on 2 nodes, setting nnodes to 1."

I've been trying to get it to work with a simple Hello World script, but still run into the above error. I added --oversubscribe to the options when I run the MPI script, but still get this error.

#SBATCH --job-name=a_test
#SBATCH --mail-type=ALL
#SBATCH --ntasks=1
#SBATCH --cpu-freq=high
#SBATCH --nodes=2
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=1gb
#SBATCH --mem-bind=verbose,local
#SBATCH --time=01:00:00
#SBATCH --output=out_%x.log

module load python/3.6.2
mpirun -np 4 --oversubscribe python par_PyScript2.py

I still get the expected output, but only after the error message:

"Warning: can't run 1 process on 2 nodes, setting nnodes to 1."

I'm worried that without being able to run on multiple nodes, my actual script will be a lot slower.


Solution

  • The reason for the warning is this line:

    #SBATCH --ntasks=1
    

    where you're specifying that you're going to run only 1 mpi process, just before you request 2 nodes.

    --ntasks sets the number of processes to run/ranks to use in your case. You then overwrite it with an equivalent -n which is why you're seeing the result.

    For your reference, this is the script I run on my system,

    #!/bin/bash
    
    #SBATCH -C knl 
    #SBATCH -q regular
    #SBATCH -t 00:10:00
    
    #SBATCH --nodes=2
    
    module load python3
    
    START_TIME=$SECONDS
    
    srun -n 4 python mpi_py.py >& py_${SLURM_JOB_ID}.log
    
    ELAPSED_TIME=$(($SECONDS - $START_TIME))
    echo $ELAPSED_TIME
    

    Performance notes: