[SOLVED] how to make job distribution to nodes depend on partition

how to make job distribution to nodes depend on partition

We have a heterogenous cluster with some small nodes (64 cores) in partition_smallnodes and some larger nodes (256 cores) in partition_largenodes.

I have hundreds of jobs to submit. For simplicity, let's assume they are only 2 jobs: 1 and 2. Every job runs as 2 MPI threads: 1_1, 1_2, 2_1, 2_2. I want to submit them to any of the available partitions.

How can I ask slurm to distribute the jobs in a way that depends on the partition. i.e. if the jobs 1 and 2 are going to partition_largenodes to run 1_1 and 1_2 on the same node and 2_1 and 2_2 on the other node, (not 1_1 and 2_1 on the same node and 1_2 and 2_2 on the other node), but if they are going to partition_smallnodes, to run 1_1, 1_2, 2_1, 2_2 on separate nodes each?

Solution

After searching for ideas, and making use of this thread, I managed to solve it as follows:

I divided the submission script into header and body.
There are several headers, one per each partition/queue, and one single body for all of them.
Every header defines its own number of nodes and number of tasks.
jobs submitted to different number of partitions are given different nice values according to my priority preference.
When the first job on any queue starts, it cancels all other jobs with the same name then it sources the body file.
Do NOT add #SBATCH --dependency singleton to your script. It caused all jobs to wait for the first submitted job. no other job goes before it is dispatched (at least this is my experience).

example header1

#!/bin/bash
#SBATCH --job-name="your_Unique_Job_Name_Here"
##SBATCH --dependency singleton

#SBATCH --partition=mw256
#SBATCH --cpus-per-task=40
#SBATCH --nodes=1
#SBATCH --ntasks=2
#SBATCH --nice=500
#SBATCH --mem=64

job_ids=$(squeue -o "%i" -h -n "$SLURM_JOB_NAME")
# my_job_id=$(squeue -j $SLURM_JOBID -o "%i" -h)
job_ids=$(echo "$job_ids" | grep -v "$SLURM_JOBID")
echo "$job_ids" | xargs scancel

# check if script is started via SLURM or bash
# if with SLURM: the variable '$SLURM_JOB_ID' will exist
# `if [ -n "${SLURM_JOB_ID:-}" ]` checks if $SLURM_JOB_ID is not an empty string
if [ -n "${SLURM_JOB_ID:-}" ];  then
    # check the original location through scontrol and $SLURM_JOB_ID
    echo "running from Slurm"
    SCRIPT_PATH=$(scontrol show job $SLURM_JOBID | awk -F= '/Command=/{print $2}')
else
    # otherwise: started with bash. Get the real location.
    echo "running from bash"
    SCRIPT_PATH=$(realpath $0)
fi
SCRIPT_PATH=`echo $SCRIPT_PATH| awk '{print $1}'`
echo SCRIPT_PATH=`realpath $SCRIPT_PATH`

source "`dirname $SCRIPT_PATH`"/submission_script_body.sh

Example header2

#!/bin/bash
#SBATCH --job-name="your_Unique_Job_Name_Here"
##SBATCH --dependency singleton

#SBATCH --partition=mw128
#SBATCH --cpus-per-task=40
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --nice=500
#SBATCH --mem=64
# the rest of the header is similar to header1

Example body

#!/bin/bash

if [[ -n $SLURM_CPUS_PER_TASK ]]; then
    ntomp=$SLURM_CPUS_PER_TASK
else
    case $SLURM_JOB_PARTITION in
        "mw256")
            ntomp=56
        ;;
        "exx96"| "mw128")
            ntomp=48
        ;;
        "tinymem")
            ntomp=40
        ;;
        *)
            exit 5
        ;;
    esac
fi
# continue your code here