I'm trying to use GNU parallel together with SLURM to run several instances of the same script with different input parameters. For that, I allocate 3 nodes via SLURM and then I create several threads via GNU Parallel and these threads are running the Python scripts, each one of them utilizing just one CPU core.
Also, the scripts are quite memory-heavy, so that I need to be able to restart the job, if it fails because of insufficient RAM. For that, I resorted to use --retry-failed
and --retries
flags.
My problem is, that all jobs except the ones on the last node do finish with these output:
/bin/bash: 0: command not found
/bin/bash: 1: command not found
/bin/bash: 2: command not found
/bin/bash: 3: command not found
/bin/bash: 4: command not found
/bin/bash: 5: command not found
/bin/bash: 6: command not found
Obviously, my input is somehow misinterpreted, but I have no idea how, as I'm not an experienced user of GNU Parallel.
My jobscript looks like this:
#!/usr/bin/env bash
#SBATCH --job-name job-name
#SBATCH --cpus-per-task=1
#SBATCH --array=0-2
[ -z "$PARALLEL_SEQ" ] && { exec parallel --retry-failed --retries 5 --joblog joblog.txt -a numtasks $0 ; }
TASKS_PER_NODE=`cat numtasks | wc -l`
IDX=$(( ${TASKS_PER_NODE} * ${SLURM_ARRAY_TASK_ID} + ${PARALLEL_SEQ} - 1 ))
mkdir "res-${IDX}"
cd "res-${IDX}"
source ${HOME}/.bashrc
conda activate myenv
cp ../myscript.py .
python3 ./myscript.py ${IDX}
It is unclear to me what numtasks contain. Is is just a sequence?
I would use a bash function. To me that is much more readable than conditionally exec $0
.
#!/usr/bin/env bash
#SBATCH --job-name job-name
#SBATCH --cpus-per-task=1
#SBATCH --array=0-2
doit() {
TASKS_PER_NODE=`cat numtasks | wc -l`
IDX=$(( ${TASKS_PER_NODE} * ${SLURM_ARRAY_TASK_ID} + ${PARALLEL_SEQ} - 1 ))
mkdir "res-${IDX}"
cd "res-${IDX}"
source ${HOME}/.bashrc
conda activate myenv
cp ../myscript.py .
python3 ./myscript.py ${IDX}
}
export -f doit
export SLURM_ARRAY_TASK_ID
parallel --retry-failed --retries 5 --joblog joblog.txt -a numtasks doit
You might also want to check out the options --memfree
/--memsuspend
.