pythonbashgnuslurmgnu-parallel

SLURM: Run Same Parallel Script on Multiple Files


What I'm trying to do is to run the same python script on thousands of input csv files. The python script is a single-process script that takes as input a single file and creates an output file with some summaries about the input. I saw the question here, and tried something similar but it did not work (see end).

My SLURM submission script looks like the below, and the python script is nested inside a bash for-loop. I'm using the loop because I have a large file with the names of the input csv files (one per line) and use it to dispatch the python script. The script is contained in a file, submit_script.sh, and it is submitted to the scheduler using sbatch: sbatch submit_script.sh.

I'm also using sem (GNU parallel) to parallelize the python script.

#!/bin/bash
#SBATCH --nodes=8
#SBATCH --ntasks-1024
#SBATCH --ntasks-per-node=128

OUTPUT_DIR='/path/to/output/dir'
INPUT_FILES='/some/path/to/list/of/inputs.txt'
INPUT_DIR='path/to/input/dir'
while IFS=read -r line

 FILE_NAME=${INPUT_DIR}/${line}.txt
 sem -j -1 --id ${FILE_NAME} python_script.py --input ${FILE_NAME} > ${OUTPUT_DIR}/${OUTPUT_DIR}.txt

done < <(tr -d '\r' < ${INPUT_FILES})

sem --wait

What I was expecting is that with the above SLURM submission, I would get 8 nodes, within which I would be able to create 1024 instances of the python script, and these would be sectioned in buckets of 128 per node? I'm still getting the hang of SLURM, and I apologize for any misunderstandings on my part.

What is happenning is that I do get multiple instances of the script going, but its only about 70, not the 1024 I was expecting.

I tried not using sem and replacing with srun (with no extra parameters), but that did not work — it created multiple copies of the script alright, but all ran with the same input file and the output went into a single file.

What I'm trying to do, is get the SLURM allocation with N nodes, and be able to spawn some number M multiple copies of the python script so that each can tackle a single file. The python script does not use threads and/or processes.

Many thanks in advance, and I really appreciate any help provided!


Solution

  • First, you need to be careful with your usage of --ntasks-per-node. The directives for sbatch can be a little bit confusing at first:

    If you know the number of CPUs per node (such as if you are using a homogeneous cluster), you should just use --ntasks and --cpus-per-task instead. That way you avoid oversubscribing to a node (which will throttle your performance).

    Second, you want to use srun, not sem. See here: https://slurm.schedmd.com/srun.html

    Third, your script itself is not quite right. When you run sbatch, you say: "please run the contents of this script later". SLURM runs it on the first allocated node.

    When you use srun, you immediately run a job with the available SLURM resources at your disposal. Within an sbatch, that means you get to pick from the resources that were allocated with directives (#SBATCH ...). sbatch sets up some environment variables to make life easier.

    When you want to partition stuff, you need to set that up all by yourself. SLURM won't intelligently "figure it out" -- you'll need to explicitly schedule.

    I believe the final result needs to look something more like this:

    #!/bin/bash
    #SBATCH --nodes=8
    #SBATCH --ntasks-per-node=128
    
    INPUT_DIR='path/to/input/dir'
    OUTPUT_DIR='/path/to/output/dir'
    
    # Read the file names into an array
    INPUT_STEMS_FILE='/some/path/to/list/of/inputs.txt'
    INPUT_STEMS=()
    while IFS= read -r line; do
      INPUT_STEMS+=("$line")
    done < <(tr -d '\r' < INPUT_STEMS_FILE)
    
    for j in `seq 0 ${#INPUT_STEMS[@]}`; do
      # Iterate over the indices for each of the N files
    
      # Round-robin allocation to nodes (0, 1, ..., 8, 0, 1, ...)
      NODE_NUMBER=$(($j % $SLURM_NNODES))
    
      # Dynamically generate filename
      INPUT_FILE_NAME="$INPUT_DIR/$INPUT_STEMS[$(($i % $j))].txt"
      OUTPUT_FILE_NAME=$OUTPUT_DIR/$j.txt
    
      # Run a job on 1 task on 1 node, using the round-robin allocation.
      # The jobs run on different nodes, this way
      srun -N1 -n1 -r$NODE_NUMBER python_script.py --input $INPUT_FILE_NAME > $OUTPUT_FILE_NAME &
    done
    
    wait
    

    I am not 100% sure about the following line:

    srun -N1 -n1 -r$NODE_NUMBER python_script.py --input $INPUT_FILE_NAME > $OUTPUT_FILE_NAME
    

    but I believe you need something like this. IIRC, srun with that syntax should continuously schedule tasks on 1 node, that node being identified by "index" $NODE_NUMBER.

    This executes on each node using round-robin allocation. There are other ways of allocating the work.


    As an aside, you might find job arrays useful for your task: https://slurm.schedmd.com/job_array.html

    They are a bit easier to use, but you create many more jobs. When you schedule a job array, you create M jobs instead of 1 job with M tasks.