I'm trying to use GNU parallel to run a set of experiments using MATLAB on our supercomputer which uses SLURM. I have a text file containing combinations of 4 parameters that are read in and passed to a MATLAB function. That text file is called gnu_parameters.txt
and and has 4 columns separated by a single space.
fs_method data_name use_vars 1
fs_method1 data_name use_vars 1
fs_method3 data_name use_vars 1
where parameters in columns 1-3 should be read in as a string, and parameter 4 is a number.
I want to run each combination of parameters in parallel to speed up the process. My SLURM script is below, but when I tell GNU-parallel where to put each parameter using the notation {1} {2} {3} {4}
, I get an error that MATLAB doesn't recognize the variable fs_method
. Looking at the log tells me that the error means fs_method
isn't read as a string by MATLAB. To fix that, I tried adding single quotes in the SLURM script like so:
#!/bin/bash -l
#SBATCH --time=4-00:00:00
#SBATCH --ntasks=1
#SBATCH --mem=1200g
#SBATCH --tmp=500g
#SBATCH --cpus-per-task=115
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=myemail
#SBATCH -p groupPartition
cd $WRK_DIR
module load matlab
module load parallel
export JOBS_PER_NODE=$(( $SLURM_CPUS_ON_NODE / $SLURM_CPUS_PER_TASK ))
echo $JOBS_PER_NODE
cat gnu_parameters.txt | parallel --jobs $JOBS_PER_NODE --joblog tasklog.log --progress --colsep ' ' 'matlab -nodisplay -r "run_holdout_parallel('{1}', '{2}', '{3}', {4});exit" '
Below are excerpts from the log file, the error file, and the output file.
Log
Seq Host Starttime JobRuntime Send Receive Exitval Signal Command
1 : 1719498346.300 14.911 0 298 0 0 matlab -nodisplay -r "run_holdout_parallel(fs_method, data_name, use_vars, 1);exit"
2 : 1719498361.751 14.387 0 298 0 0 matlab -nodisplay -r "run_holdout_parallel(fs_method1, data_name, use_vars, 1);exit"
3 : 1719498376.666 14.385 0 298 0 0 matlab -nodisplay -r "run_holdout_parallel(fs_method3, data_name, use_vars, 1);exit"
Error File
local:1/0/100%/0.0s sh: /dev/tty: No such device or address
local:1/0/100%/0.0s sh: /dev/tty: No such device or address
local:1/0/100%/0.0s {Unrecognized function or variable 'fs_method'.
}
local:0/1/100%/15.0s
Output file
< M A T L A B (R) >
Copyright 1984-2023 The MathWorks, Inc.
R2023b Update 7 (23.2.0.2515942) 64-bit (glnxa64)
January 30, 2024
To get started, type doc.
For product information, visit www.mathworks.com.
< M A T L A B (R) >
Copyright 1984-2023 The MathWorks, Inc.
R2023b Update 7 (23.2.0.2515942) 64-bit (glnxa64)
January 30, 2024
To get started, type doc.
For product information, visit www.mathworks.com.
But that returns the same error. How can I get these parameters passed as strings to MATLAB? Is there a better way to run these experiments in parallel than the method I'm doing?
I hate quoting. man parallel
says:
Conclusion: If this is confusing consider avoiding having to deal with quoting by writing a small script or a function (remember to export -f the function) and have GNU parallel call that.
So in your case make a function:
run_holdout() {
echo This should run_holdout_parallel on $1 $2 $3 $4
matlab -nodisplay -r "run_holdout_parallel(\"$1\", \"$2\", \"$3\", $4);exit"
}
When you can run that on the command line:
$ run_holdout fs_method3 data_name use_vars 1
and that works, then parallelize with:
$ export -f run_holdout
$ ... | parallel run_holdout {1} {2} {3} {4}