mpislurmsbatch

When using slurm, how do I instruct it to use a different number of tasks per node? Heterogeneous jobs don't seem to work


Let's say I have two nodes that I want to run a job on, with node1 having 64 nodes and node2 having 48.

If I want to run 47 tasks on node2 and 1 task on node1, that is easy enough with a hostfile like

node1 max-slots=1
node2 max-slots=47

and then something like this jobfile:

#!/bin/bash

#SBATCH --time=00:30:00
#SBATCH --nodes=2
#SBATCH --nodelist=node1,node2
#SBATCH --partition=partition_name
#SBATCH --ntasks-per-node=48 
#SBATCH --cpus-per-task=1

export OMP_NUM_THREADS=1
mpirun --display-allocation --hostfile hosts --report-bindings hostname

The output of the display-allocation comes to

======================   ALLOCATED NODES   ======================
    node1: slots=48 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
    aliases: node1
    arm07: slots=48 max_slots=0 slots_inuse=0 state=UP
    Flags: SLOTS_GIVEN
    aliases: NONE
=================================================================

======================   ALLOCATED NODES   ======================
    node1: slots=1 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
    aliases: node1
    arm07: slots=47 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
    aliases: <removed>
=================================================================

so all good, all expected.

The problem arises when I want to launch a job with more tasks than one of the nodes can allocate i.e. with hostfile

node1 max-slots=63
node2 max-slots=1

Then,

  1. --ntasks-per-node=63 shows an error in node allocation
  2. --ntasks=64 does some equitable division like node1:slots=32 node2:slots=32 which then get reduced to node1:slots=32 node2:slots=1 when the hostfile is encountered. --ntasks=112 (64+48 to grab the whole nodes) gives an error in node allocation.
  3. #SBATCH --distribution=arbitrary with a properly formatted slurm hostfile runs with just 1 rank on the node in the first line of the hostfile, and doesn't automatically calculate ntasks from the number of lines in the hostfile. EDIT: Turns out SLURM_HOSTFILE only controls nodelist, and not CPU distribution in those nodes, so this won't work for my case anyway.
  4. Same as #3, but with --ntasks given, causes slurm to complain that SLURM_NTASKS_PER_NODE is not set
  5. A heterogeneous job with
#!/bin/bash

#SBATCH --time=00:30:00
#SBATCH --nodes=1
#SBATCH --nodelist=node1
#SBATCH --partition=partition_name
#SBATCH --ntasks-per-node=63 --cpus-per-task=1
#SBATCH hetjob
#SBATCH --nodes=1
#SBATCH --nodelist=node2
#SBATCH --partition=partition_name
#SBATCH --ntasks-per-node=1 --cpus-per-task=1

export OMP_NUM_THREADS=1
mpirun --display-allocation --hostfile hosts --report-bindings hostname

puts all ranks on the first node. The output head is

======================   ALLOCATED NODES   ======================
    node1: slots=63 max_slots=0 slots_inuse=0 state=UP
        Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
        aliases: node1
=================================================================

======================   ALLOCATED NODES   ======================
    node1: slots=63 max_slots=0 slots_inuse=0 state=UP
        Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
        aliases: node1
=================================================================

It seems like it tries to launch the executable independently on each node allocation, instead of launching one executable across the two nodes.

What else can I try? I can't think of anything else.


Solution

  • With this slurm script I get a heterogenous job allocation:

    #!/bin/bash
    #SBATCH --job-name=Test-hetjob
    #SBATCH --time=00:10:00
    #SBATCH --nodes=1
    #SBATCH --ntasks-per-node=2
    #SBATCH hetjob
    #SBATCH --nodes=1
    #SBATCH --ntasks-per-node=6
    
    srun -l /usr/bin/hostname : /usr/bin/hostname
    

    The output:

    0: host1
    1: host1
    6: host2
    2: host2
    5: host2
    3: host2
    4: host2
    7: host2
    

    It is important to use MPMD notion (:) for the execution of the application. It might be an artifact in our cluster setup or a general problem, that the second application seems to be executed with a broken env setting (e.g. empty PATH). For this reason, I execute hostname with absolute path. You might want to wrap the execution of both apps in a bash script to ensure proper bash initialization and loading necessary modules.

    Updated result with MPI-hello-world:

    #!/bin/bash
    #SBATCH --job-name=Test-hetjob
    #SBATCH --time=00:10:00
    #SBATCH --nodes=1
    #SBATCH --ntasks-per-node=2
    #SBATCH hetjob
    #SBATCH --nodes=1
    #SBATCH --ntasks-per-node=6
    
    srun -l --export=ALL ./a.out : --export=ALL ./a.out
    
    0: Hello world from processor host1, rank 0 out of 8 processors
    2: Hello world from processor host2, rank 2 out of 8 processors
    1: Hello world from processor host1, rank 1 out of 8 processors
    3: Hello world from processor host2, rank 3 out of 8 processors
    4: Hello world from processor host2, rank 4 out of 8 processors
    5: Hello world from processor host2, rank 5 out of 8 processors
    6: Hello world from processor host2, rank 6 out of 8 processors
    7: Hello world from processor host2, rank 7 out of 8 processors