mpislurmsbatchintel-mpi

SLURM: Run two MPI jobs with different settings on same set of nodes


I have a slurm batch script, and I'm running Intel MPI.

I want to run two different MPI codes on the same set of nodes with different process placement configurations.

I'm running two MPI codes, one with -np = 8 and the other with -np = 2. For the -np=8 case, I want to place ranks [0, 1, 2, 3] from the first mpiexec on node0 and ranks [4, 5, 6, 7] from the second mpiexec on node1.

For the -np=2 case, I want to place rank [0] on node0 and rank [1] on node1.

I've tried -ppn, -perhost, I_MPI_PERHOST, all the options available here, Intel process placement but none of them are working.

My scripts are directly inheriting options from SBATCH instead of local settings. Please don't suggest srun; there's some issue with mpi and srun. I'm unable to run mpi with srun on multiple nodes (PMI error); mpiexec directly works on multiple nodes.

Is there any way I can achieve the above task? Here's the SBATCH script I'm currently using.

#!/bin/bash

#SBATCH -p small
#SBATCH -N 2
#SBATCH --exclusive
#SBATCH --time=01:00:00
#SBATCH --error=err.out
#SBATCH --output=out.out
#SBATCH --ntasks=10

module load compiler/intel/2018.2.199
module load apps/ucx/ucx_1.13.1

source /opt/ohpc/pub/apps/intel/2018_2/compilers_and_libraries_2018.2.199/linux/mpi/intel64/bin/mpivars.sh intel64
export I_MPI_FALLBACK=disable

#--> First mpiexec
mpiexec.hydra -n 8 ./hello.out & 

#--> Second mpiexe
mpiexec.hydra -n 2 ./world.out & 

wait

Here my first mpiexec runs ranks [0,1,2,3,4] on node0 and [5,6,7] on node1 where as my second mpiexec runs ranks [0,1] on node 0.

I want my first mpiexec to run ranks [0,1,2,3] on node0 and ranks [4,5,6,7] on node1 I want my second mpiexec to run rank [0] on node0 and rank [1] on node1

I can't use srun I can only use mpiexec

Is there any way I can set local settings to each mpiexec? Any suggestions will be helpful


Solution

  • The issue was with the version of Intel MPI.

    Using the latest version of Intel MPI fixed the issue.

    I credit Gilles Gouaillardet for the solution. Please refer to the comments section for the discussion.