I want to run two programs using mpi in parallel in the same job script. In SLURM I would usually just write a script for sbatch (shortened):
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
mpirun program1 &
mpirun program2
This works fine. The two programs will internally communicate with each other and coordinate execution. So overcommiting is fine. Moreover, they require each other and cannot run as stand-alone in the present configuration.
However, if I want to extend this to several nodes, e.g.
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
SLURM does not start the first job in the background. Instead, it starts in the foreground, fails because it does not find the second step and the second then also fails -- because it does not find the first.
I am a bit at a loss here because that is the suggested solution (e.g. Run a "monitor" task alongside mpi task in SLURM) to similar problems and I do not see a reason why this should not work over several nodes. Indeed it does, for instance on PBS.
With slurm the main issue is that the user should use srun
instead of mpirun
, i.e.:
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
srun --overlap program1 &
srun --overlap program2
wait
When running on multiple nodes, using srun
is crucial, --overlap
allows the job steps to share allocated resources, and wait
ensures that all steps are finished.