I've huge size(few million) job contain list and wants to run java written tool to perform the features comparison. This tool completes the calculation in
real 0m0.179s
user 0m0.005s
sys 0m0.000s sec
Running 5 nodes(each have 72 cpus) with pbs torque scheduler in the GNU parallel, tool runs fine and produces the results but as I set 72 jobs per node, it should run 72 x 5 jobs at a time but I can see only it runs 25-35 jobs! Checking of cpu utilization on each node also shows low utilization.
I desire to run 72 X 5 jobs or more at a time and produce the results by utilizing all the available source (72 X 5 cpus).
As I mentioned have ~200 millions of job to run, I desire to complete it faster(1-2 hours) by using/increasing the number of nodes/cpus.
Current code, input and job state:
example.lst (it has ~300 million lines)
ZNF512-xxxx_2_N-THRA-xxtx_2_N
ZNF512-xxxx_2_N-THRA-xxtx_3_N
ZNF512-xxxx_2_N-THRA-xxtx_4_N
.......
cat job_script.sh
#!/bin/bash
#PBS -l nodes=5:ppn=72
#PBS -N job01
#PBS -j oe
#work dir
export WDIR=/shared/data/work_dir
cd $WDIR;
# use available 72 cpu in each node
export JOBS_PER_NODE=72
#gnu parallel command
parallelrun="parallel -j $JOBS_PER_NODE --slf $PBS_NODEFILE --wd $WDIR --joblog process.log --resume"
$parallelrun -a example.lst sh run_script.sh {}
cat run_script.sh
#!/bin/bash
# parallel command options
i=$1
data=/shared/TF_data
# create tmp dir and work in
TMP_DIR=/shared/data/work_dir/$i
mkdir -p $TMP_DIR
cd $TMP_DIR/
# get file name
mk=$(echo "$i" | cut -d- -f1-2)
nk=$(echo "$i" | cut -d- -f3-6)
#run a tool to compare the features of pair files
/shared/software/tool_v2.1/tool -s1 $data/inf_tf/$mk -s1cf $data/features/$mk-cf -s1ss $data/features/$mk-ss -s2 $data/inf_tf/$nk.pdb -s2cf $data/features/$nk-cf.pdb -s2ss $data/features/$nk-ss.pdb > $data/$i.out
# move output files
mv matrix.txt $data/glosa_tf/matrix/$mk"_"$nk.txt
mv ali_struct.pdb $data/glosa_tf/aligned/$nk"_"$mk.pdb
# move back and remove tmp dir
cd $TMP_DIR/../
rm -rf $TMP_DIR
exit 0
PBS submission
qsub job_script.sh
Login to one of the node : ssh ip-172-31-9-208
top - 09:28:03 up 15 min, 1 user, load average: 14.77, 13.44, 8.08
Tasks: 928 total, 1 running, 434 sleeping, 0 stopped, 166 zombie
Cpu(s): 0.1%us, 0.1%sy, 0.0%ni, 98.4%id, 1.4%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 193694612k total, 1811200k used, 191883412k free, 94680k buffers
Swap: 0k total, 0k used, 0k free, 707960k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15348 ec2-user 20 0 16028 2820 1820 R 0.3 0.0 0:00.10 top
15621 ec2-user 20 0 169m 7584 6684 S 0.3 0.0 0:00.01 ssh
15625 ec2-user 20 0 171m 7472 6552 S 0.3 0.0 0:00.01 ssh
15626 ec2-user 20 0 126m 3924 3492 S 0.3 0.0 0:00.01 perl
.....
All of the nodes top shows the similar state and produces the results by running only ~26 at a time!
I've aws-parallelcluster contains 5 nodes(each have 72 cpus) with torque scheduler and GNU Parallel 2018, Mar 2018
Update
By introducing the new function that takes input on stdin and running the script in parallel works great and utilizes all the CPU in local machine.
However, when its runs over remote machines it produces a
parallel: Error: test.lst is neither a file nor a block device
MCVE:
A simple code that echoing list gives the same error while running it in remote machines but works great in local machine:
cat test.lst # contains list
DNMT3L-5yx2B_1_N-DNMT3L-5yx2B_2_N
DNMT3L-5yx2B_1_N-DNMT3L-6brrC_3_N
DNMT3L-5yx2B_1_N-DNMT3L-6f57B_2_N
DNMT3L-5yx2B_1_N-DNMT3L-6f57C_2_N
DNMT3L-5yx2B_1_N-DUX4-6e8cA_4_N
DNMT3L-5yx2B_1_N-E2F8-4yo2A_3_P
DNMT3L-5yx2B_1_N-E2F8-4yo2A_6_N
DNMT3L-5yx2B_1_N-EBF3-3n50A_2_N
DNMT3L-5yx2B_1_N-ELK4-1k6oA_3_N
DNMT3L-5yx2B_1_N-EPAS1-1p97A_1_N
cat test_job.sh # GNU parallel submission script
#!/bin/bash
#PBS -l nodes=1:ppn=72
#PBS -N test
#PBS -k oe
# introduce new function and Run from ~/
dowork() {
parallel sh test_work.sh {}
}
export -f dowork
parallel -a test.lst --env dowork --pipepart --slf $PBS_NODEFILE --block -10 dowork
cat test_work.sh # run/work script
#!/bin/bash
i=$1
data=pwd
#create temporary folder in current dir
TMP_DIR=$data/$i
mkdir -p $TMP_DIR
cd $TMP_DIR/
# split list
mk=$(echo "$i" | cut -d- -f1-2)
nk=$(echo "$i" | cut -d- -f3-6)
# echo list and save in echo_test.out
echo $mk, $nk >> $data/echo_test.out
cd $TMP_DIR/../
rm -rf $TMP_DIR
From your timing:
real 0m0.179s
user 0m0.005s
sys 0m0.000s sec
it seems the tool uses very little CPU power. When GNU Parallel runs local jobs it has an overhead of 10 ms CPU time per job. Your jobs use 179 ms time, and 5 ms CPU time. So GNU Parallel will be using quite a bit of the time spent.
The overhead is much worse when running jobs remotely. Here we are talking 10 ms + running an ssh command. This can easily be in the order of 100 ms.
So how can we minimize the number of ssh commands and how can spread the overhead over multiple cores?
First let us make a function that can take input on stdin and run the script - one job per CPU thread in parallel:
dowork() {
[...set variables here. that becomes particularly important we when run remotely...]
parallel sh run_script.sh {}
}
export -f dowork
Test that this actually works by running:
head -n 1000 example.lst | dowork
Then let us look at running jobs locally. This can be done similar to described here: https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Running-more-than-250-jobs-workaround
parallel -a example.lst --pipepart --block -10 dowork
This will split example.lst
into 10 blocks per CPU thread. So on a machine with 72 CPU threads this will make 720 blocks. It will the start 72 doworks
and when one is done it will get another of the 720 blocks. The reason I choose 10 instead of 1 is if one of the jobs "get stuck" for a while, then you are unlikely to notice this.
This should make sure 100% of the CPUs on the local machine is busy.
If that works, we need to distribute this work to remote machines:
parallel -j1 -a example.lst --env dowork --pipepart --slf $PBS_NODEFILE --block -10 dowork
This should in total start 10 ssh
per CPU thread (i.e. 5*72*10) - namely one for each block. With 1 running per server listed in $PBS_NODEFILE
in parallel.
Unfortunately this means that --joblog
and --resume
will not work. There is currently no way to make that work, but if it is valuable to you contact me via parallel@gnu.org.