Given that I have a file of N size. For the sake of example 30GB file.
Facts about the file content is that it has proprotional amount of lines. This is interleaved FastQ file. (not important for the question but usefull for someone)
File content is paired
or interleaved
DNA sequence of strings. Each pair
is 8
lines long.
I want to process the interleaved FastQ with GNU parallel
in order to speed up the process.
Reason for using parallel
instead of native bwa
tool threads feature is that parallel
helps to reduce amount of RAM needed because the nature of bwa
memory allocation.
Given that interleaved file is 30GB of size I want to process chunks
of --block 500M
, command line params looks like --pipe --block 500M -L 8 -j 10
this then is sent as stdin
to bwa
and will run 10 bwa
tasks each getting 500M
chunks with a record
of 8
lines.
Is my assumption correct that --block 500M
and -L 8
will be managed by parallel
and I can be certain that my bwa
tool will always get 8
lines times N MB
of data?
What I am not clear is, will parallel
"repeat" last "chunk" if 8
lines are not present?
And will it apropriatelly controll other chunk inputs for N processes
I start with parallel
?
Or this --block 500M
"blindly" sends 500M chunk to single process regardless if last part of the 500M chunk does not contain 8 lines
so to speak?
Update:
After whole day reading questions and answers on biostars and seqanswers I've realised that my testing/"benchmarking" was wrong.
But this helped to realise that I need to update the question and will make separate question.
I was testing inside Docker container which by default has very low /dev/shm
thus I have mislead my self to go totaly different path.
Yes, you can be certain.
The --block
parameter is described here:
https://www.gnu.org/software/parallel/parallel_tutorial.html#chunk-size
The -L
parameter here:
https://www.gnu.org/software/parallel/parallel_tutorial.html#records
Quick summary: Parallel will always send full lines to each process until the block/buffer capacity is filled. If you specify a that one record requires several lines (8 in your case), it will fill the buffer capacity in chunks of 8 lines each.
The last block can be smaller than 8 lines, if there are fewer remaining.
Side note:
In the case of properly formatted and interleaved fastq
files, there will always be 8 lines. fastq
format specifies that each record is 4 lines and paired-end fastq files must contain the same number of records.