I am getting errors whenever I use mpirun
inside batch script in an active conda environment (but this error does not happen if I don't use a batch script, or if I am not in a conda environment).
I have a simple test code called test.py
from mpi4py import MPI
comm = MPI.COMM_WORLD
n_proc = comm.Get_size()
proc_id = comm.Get_rank()
if proc_id == 0:
print('Number of processors = '+str(n_proc))
print('Hello from proc id = '+str(proc_id))
If I just run mpirun -np 5 python test.py
in the login node, I get the expected result:
Number of processors = 5
Hello from proc id = 0
Hello from proc id = 1
Hello from proc id = 2
Hello from proc id = 3
Hello from proc id = 4
But if I use the following batch script:
#!/bin/bash
# Submit this script with: sbatch <this-filename>
#SBATCH --time=0:30:00 # walltime
#SBATCH -n 5
#SBATCH --mem-per-cpu=10G # memory per CPU core
#SBATCH --qos=normal # qos
#SBATCH -J "mpi" # job name
## /SBATCH -p general # partition (queue)
## /SBATCH -o slurm.%N.%j.out # STDOUT
## /SBATCH -e slurm.%N.%j.err # STDERR
# LOAD MODULES, INSERT CODE, AND RUN YOUR PROGRAMS HERE
mpirun python test.py
And run sbatch batch_script
, then I get the following error:
Error: node list format not recognized. Try using '-hosts=<hostnames>'.
/var/spool/slurmd/job12649152/slurm_script: line 21: 224459 Aborted (core dumped) mpirun python test.py
I tried adding the line #SBATCH -hosts=n1
, but I still got the exact same error (except that the filename of the output file became sts=n1
). I also tried building another conda environment with an older version of mpich (mpich/3.2.1
), but it didn't work either.
If any of the commands depend on Conda being initialized and/or an environment being activated, then the current shebang needs to be adjusted. Try instead
#!/bin/bash -l
This will tell the script to run in login mode, which will then source the initialization script (e.g., .bashrc
), where the Conda initialization code is located by default.