amazon-ec2parallel-processingopenmpignu-parallelpbs

Run million of list in PBS with parallel tool


I've huge size(few million) job contain list and wants to run java written tool to perform the features comparison. This tool completes the calculation in

real    0m0.179s
user    0m0.005s
sys 0m0.000s sec

Running 5 nodes(each have 72 cpus) with pbs torque scheduler in the GNU parallel, tool runs fine and produces the results but as I set 72 jobs per node, it should run 72 x 5 jobs at a time but I can see only it runs 25-35 jobs! Checking of cpu utilization on each node also shows low utilization.

I desire to run 72 X 5 jobs or more at a time and produce the results by utilizing all the available source (72 X 5 cpus).

As I mentioned have ~200 millions of job to run, I desire to complete it faster(1-2 hours) by using/increasing the number of nodes/cpus.

Current code, input and job state:

example.lst (it has ~300 million lines)

ZNF512-xxxx_2_N-THRA-xxtx_2_N
ZNF512-xxxx_2_N-THRA-xxtx_3_N
ZNF512-xxxx_2_N-THRA-xxtx_4_N
.......

cat job_script.sh

#!/bin/bash
#PBS -l nodes=5:ppn=72
#PBS -N job01
#PBS -j oe

#work dir
export WDIR=/shared/data/work_dir

cd $WDIR;  

# use available 72 cpu in each node    
export JOBS_PER_NODE=72

#gnu parallel command
parallelrun="parallel -j $JOBS_PER_NODE --slf $PBS_NODEFILE --wd $WDIR --joblog process.log --resume"

$parallelrun -a example.lst sh run_script.sh {}

cat run_script.sh

#!/bin/bash 
# parallel command options
i=$1
data=/shared/TF_data

# create tmp dir and work in
TMP_DIR=/shared/data/work_dir/$i
mkdir -p $TMP_DIR
cd $TMP_DIR/

# get file name
mk=$(echo "$i" | cut -d- -f1-2) 
nk=$(echo "$i" | cut -d- -f3-6) 

#run a tool to compare the features of pair files
/shared/software/tool_v2.1/tool -s1 $data/inf_tf/$mk -s1cf $data/features/$mk-cf -s1ss $data/features/$mk-ss -s2 $data/inf_tf/$nk.pdb -s2cf $data/features/$nk-cf.pdb -s2ss $data/features/$nk-ss.pdb > $data/$i.out

# move output files    
mv matrix.txt $data/glosa_tf/matrix/$mk"_"$nk.txt
mv ali_struct.pdb $data/glosa_tf/aligned/$nk"_"$mk.pdb
# move back and remove tmp dir
cd $TMP_DIR/../
rm -rf $TMP_DIR
exit 0

PBS submission

qsub job_script.sh

Login to one of the node : ssh ip-172-31-9-208

top - 09:28:03 up 15 min,  1 user,  load average: 14.77, 13.44, 8.08
Tasks: 928 total,   1 running, 434 sleeping,   0 stopped, 166 zombie
Cpu(s):  0.1%us,  0.1%sy,  0.0%ni, 98.4%id,  1.4%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  193694612k total,  1811200k used, 191883412k free,    94680k buffers
Swap:        0k total,        0k used,        0k free,   707960k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                       
15348 ec2-user  20   0 16028 2820 1820 R  0.3  0.0   0:00.10 top                            
15621 ec2-user  20   0  169m 7584 6684 S  0.3  0.0   0:00.01 ssh                            
15625 ec2-user  20   0  171m 7472 6552 S  0.3  0.0   0:00.01 ssh                            
15626 ec2-user  20   0  126m 3924 3492 S  0.3  0.0   0:00.01 perl                                                     
.....

All of the nodes top shows the similar state and produces the results by running only ~26 at a time!

I've aws-parallelcluster contains 5 nodes(each have 72 cpus) with torque scheduler and GNU Parallel 2018, Mar 2018


Update

By introducing the new function that takes input on stdin and running the script in parallel works great and utilizes all the CPU in local machine.

However, when its runs over remote machines it produces a

parallel: Error: test.lst is neither a file nor a block device

MCVE:

A simple code that echoing list gives the same error while running it in remote machines but works great in local machine:

cat test.lst # contains list

DNMT3L-5yx2B_1_N-DNMT3L-5yx2B_2_N
DNMT3L-5yx2B_1_N-DNMT3L-6brrC_3_N
DNMT3L-5yx2B_1_N-DNMT3L-6f57B_2_N
DNMT3L-5yx2B_1_N-DNMT3L-6f57C_2_N
DNMT3L-5yx2B_1_N-DUX4-6e8cA_4_N
DNMT3L-5yx2B_1_N-E2F8-4yo2A_3_P
DNMT3L-5yx2B_1_N-E2F8-4yo2A_6_N
DNMT3L-5yx2B_1_N-EBF3-3n50A_2_N
DNMT3L-5yx2B_1_N-ELK4-1k6oA_3_N
DNMT3L-5yx2B_1_N-EPAS1-1p97A_1_N

cat test_job.sh # GNU parallel submission script

#!/bin/bash
#PBS -l nodes=1:ppn=72
#PBS -N test
#PBS -k oe

# introduce new function and Run from ~/
dowork() {
parallel sh test_work.sh {}
}
export -f dowork

parallel -a test.lst --env dowork --pipepart --slf $PBS_NODEFILE --block -10 dowork

cat test_work.sh # run/work script

#!/bin/bash 
i=$1
data=pwd
#create temporary folder in current dir
TMP_DIR=$data/$i
mkdir -p $TMP_DIR
cd $TMP_DIR/
# split list
mk=$(echo "$i" | cut -d- -f1-2) 
nk=$(echo "$i" | cut -d- -f3-6) 
# echo list and save in echo_test.out
echo $mk, $nk >> $data/echo_test.out
cd $TMP_DIR/../
rm -rf $TMP_DIR

Solution

  • From your timing:

    real    0m0.179s
    user    0m0.005s
    sys 0m0.000s sec
    

    it seems the tool uses very little CPU power. When GNU Parallel runs local jobs it has an overhead of 10 ms CPU time per job. Your jobs use 179 ms time, and 5 ms CPU time. So GNU Parallel will be using quite a bit of the time spent.

    The overhead is much worse when running jobs remotely. Here we are talking 10 ms + running an ssh command. This can easily be in the order of 100 ms.

    So how can we minimize the number of ssh commands and how can spread the overhead over multiple cores?

    First let us make a function that can take input on stdin and run the script - one job per CPU thread in parallel:

    dowork() {
      [...set variables here. that becomes particularly important we when run remotely...]
      parallel sh run_script.sh {}
    }
    export -f dowork
    

    Test that this actually works by running:

    head -n 1000 example.lst | dowork
    

    Then let us look at running jobs locally. This can be done similar to described here: https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Running-more-than-250-jobs-workaround

    parallel -a example.lst --pipepart --block -10 dowork
    

    This will split example.lst into 10 blocks per CPU thread. So on a machine with 72 CPU threads this will make 720 blocks. It will the start 72 doworks and when one is done it will get another of the 720 blocks. The reason I choose 10 instead of 1 is if one of the jobs "get stuck" for a while, then you are unlikely to notice this.

    This should make sure 100% of the CPUs on the local machine is busy.

    If that works, we need to distribute this work to remote machines:

    parallel -j1 -a example.lst --env dowork --pipepart --slf $PBS_NODEFILE --block -10 dowork
    

    This should in total start 10 ssh per CPU thread (i.e. 5*72*10) - namely one for each block. With 1 running per server listed in $PBS_NODEFILE in parallel.

    Unfortunately this means that --joblog and --resume will not work. There is currently no way to make that work, but if it is valuable to you contact me via parallel@gnu.org.