pythonslurmtcsh

How to submit parallel (Python) SLURM jobs with arguments in a for loop from tcsh?


I have a Python script. I want to run 4 slightly different versions of it in a for loop (only one variable difference in the script). When each of those finishes I want to run the same scripts but with another variable changed, because in the next loop, they would use HDF5 outputs generated by the previous scripts.

My underlying problem is that I am not very knowledgeable when it comes to tcsh or bash. I need to run it from tcsh because I pass the environment variables and modules which are loaded to the python script (and those are set up in the .cshrc file).

The other problem is until now I used to submit the jobs to SLURM one by one with the syntax ./python_script1.py > output1.out. (The output1.out file I'm not interested in, but it's nice to have.) I found many similar solutions but all of them use the srun command in the for loop.

I spent a couple of hours on this and I have tried to scramble together a very basic bash script and tried to run it in tcsh so it failed. I got a bunch of command not found errors. I understand the syntax is not quite the same. Relevant lines:

#!/bin/tcsh
#SBATCH --job-name=looptest    ## Name of the job
#SBATCH --output=looptest.out  ## Output file
#SBATCH --get-user-env
OUTPUT = file

for i in `seq 1 3`; do
  for j in `seq 1 3`; do
    srun \
      -N1 \
      --cpus-per-task=48 \
      ./slurmtest.py $i $j > "$OUTPUT_$i_$j.out" &
  done
done

wait

I understand that I can get arguments after the script name from python with sys.argv[i] command.

Now the relevant parts of the SLURM script that I used to run until I started experimenting looks like:

#!/bin/tcsh
#SBATCH --job-name=job1_1
#SBATCH --get-user-env
./script.py > output1_1.out

Then I would change the two things manually in the python script and change the job name and output name before every run. Some of the insides of my ideal Python script would look like:

import os
import sys


continue_from_number = sys.argv[1]
folder_name = 'folder_' + str(sys.argv[2]) + '/'

for filename in os.listdir(folder_name):
    if filename.startswith("output_" + str(continue_from_number) + "_"):
        oname = filename

# Do some things with 'oname', then
# save an output file with str(continue_from_number + 1) in its name

Ideally, I would like to run multiple of these scripts parallel (only argument #2 changed between them) in a for loop in a way that each loop waits for the jobs in the previous loop to finish, otherwise they will have no input to work with. Do I have to use the syntax --dependency=afterok<jobID1:jobID2:jobID3:jobID4>?

I could write the loop in Python but that's not an option since the runtime of my scripts is close to the job time limit on the cluster I'm running them on.

If I have to use srun that is fine but I would like to stay in tcsh if that's possible.


Solution

  • I eventually figured it out. I used a bash script after all, which I called from tcsh with sbatch slurm_script.sh and the tcsh environment variables were passed to the script this way (thanks @yut23). The slurm_script.sh looks like this:

    #!/bin/bash
    #SBATCH --job-name=cont_loop
    #SBATCH --output=cont_loop.out
    #SBATCH --time=48:00:00
    #SBATCH --nodes=1
    #SBATCH --ntasks=2
    #SBATCH --cpus-per-task=4
    #SBATCH --mem=16G      
    
    for i in `seq 1 4`; do
      for j in 10 11; do
        ./cont_loop.py $i $j> "${j}_cont_loop_${i}.out" &
      done
      wait
    done
    

    It's important to set the --ntasks flag to the number of jobs you want to run simultaneously, in this example it's 2 because the for loop for j only runs twice in each outer loop.

    (Also when testing without the outer loop I tried setting i = 1 which doesn't work in bash because there are spaces around the equal sign, so just use i=1 instead.)