slurmqos

slurm: Priotizing jobs by that has lower tasks


I'm very new to Slurm, i checked out the QoS documentations about that but I need to get advises from experienced Slurm users.

To explain the title, let's consider someone queued a long job that has 300 subtasks on it to our Slurm service. After that, someone else queued a job that has 3 subtasks. Since our default configuration runs on FIFO model, the job that has 3 subtasks will never be done unless the job that has 300 subtasks finished. This doesn't seem a fair way in our work plan, so we need to queue the 3 subtasks job as prior to 300 subtasks job even it is queued later.

Since I'm new to Slurm, I don't know the terms so much, so I had no idea to what to google, thank you.

And I was looking for these documentation, I'm in doubt about I'm looking to correct documentations:

https://slurm.schedmd.com/qos.html

https://slurm.schedmd.com/resource_limits.html

Job QoS limit seems a very similar to my topic.


Solution

  • From what I understood about the question, you need to take the number of tasks/job steps (say, srun calls) into the consideration for using priority and scheduling.

    If my undertstanding is correct, Slurm doesn't take this characteristic (number of srun calls) for priority and scheduling. This is because, the characteristic of tasks cannot be estimated from the task syntax. Using historical data, you can indeed estimate but per job granularity not task granularity- slurm uses this for advanced scheduling. (Except the application developer, nobody can estimate the time taken for two tasks simply by looking the task definition).

    For example, consider your scenario with one job having 300 tasks (300 srun call or so) and another with 3 tasks.

    Case 1:

    Assume that each job got 1 node (say with 24 cores). Your first job finished 300 tasks using 24 cores in 1hour and second job finish 3 task in same duration of time using same number of cores. Here what slurm sees is that (accounting is enabled), both your application used same number of resources.

    Case 2:

    Assume that each job got 1 node (say with 24 cores). Your first job finished 300 tasks using 24 cores in 1 hour and second job finished 3 tasks in 3 minutes. Here also what slurm sees that both your application is completely utilising the resources - per job granularity.

    How can slurm understand the difference between case 1 and case 2? It cannot, because both are utilising all the resources (the same number of nodes) and slurm scheduling won't consider the number of tasks/job steps for priority as well as scheduling.

    In other words it doesn't make sense to use the number of tasks metrics because a job with large 3 tasks can take more time than a job with small 300 tasks and vice versa. Here slurm cannot estimate large and small based on the number of tasks because it is application dependent.

    So coming back to your problem, we could think this with respective to time and resources. You could create partition in which say small partition can run the jobs that is less than 30 mins while large partition can run the jobs greater than an hour. And you set higher priority to jobs requiring in small partition during scheduling. This is one way to look into the issue and most centers use this concept.

    If you go for multifactor priority Slurm wil consider Age (waiting in the queue duration), Association fair-share etc to schedule the job but stll Slurm wont consider the number of tasks even in this case. You can set the priority flags to consider the time (using PriorityFavorSmall and SMALL_RELATIVE_TO_TIME in multifactor priority) but this is calculated with respect to the job size (number of nodes).

    In short: You should create different partitions based on job time requirements and priority should be given to partition with smaller time duration. In job script, hence the smaller jobs can use the smaller partitions and get faster allocation.

    Assumption from my side: Both jobs (300 tasks and 3 tasks) uses same number of nodes and cpus. If larger jobs uses more CPUS in a node than smaller jobs, there are other approaches (fair-share etc)