bashcondaopenmpislurmmpi4py

Error in slurm for using mpirun conda environment


I am getting errors whenever I use mpirun inside batch script in an active conda environment (but this error does not happen if I don't use a batch script, or if I am not in a conda environment).

I have a simple test code called test.py

from mpi4py import MPI

comm = MPI.COMM_WORLD

n_proc = comm.Get_size()

proc_id = comm.Get_rank()

if proc_id == 0:
    print('Number of processors = '+str(n_proc))

print('Hello from proc id = '+str(proc_id))

If I just run mpirun -np 5 python test.py in the login node, I get the expected result:

Number of processors = 5
Hello from proc id = 0
Hello from proc id = 1
Hello from proc id = 2
Hello from proc id = 3
Hello from proc id = 4

But if I use the following batch script:

#!/bin/bash

# Submit this script with: sbatch <this-filename>

#SBATCH --time=0:30:00   # walltime
#SBATCH -n 5
#SBATCH --mem-per-cpu=10G   # memory per CPU core
#SBATCH --qos=normal # qos
#SBATCH -J "mpi"   # job name

## /SBATCH -p general # partition (queue)
## /SBATCH -o slurm.%N.%j.out # STDOUT
## /SBATCH -e slurm.%N.%j.err # STDERR

# LOAD MODULES, INSERT CODE, AND RUN YOUR PROGRAMS HERE
mpirun python test.py 

And run sbatch batch_script, then I get the following error:

Error: node list format not recognized. Try using '-hosts=<hostnames>'.
/var/spool/slurmd/job12649152/slurm_script: line 21: 224459 Aborted                 (core dumped) mpirun python test.py

I tried adding the line #SBATCH -hosts=n1, but I still got the exact same error (except that the filename of the output file became sts=n1). I also tried building another conda environment with an older version of mpich (mpich/3.2.1), but it didn't work either.


Solution

  • If any of the commands depend on Conda being initialized and/or an environment being activated, then the current shebang needs to be adjusted. Try instead

    #!/bin/bash -l
    

    This will tell the script to run in login mode, which will then source the initialization script (e.g., .bashrc), where the Conda initialization code is located by default.