bioinformaticsgnu-parallelgenetic-programming

GNU parallel --block and -L clarification


Given that I have a file of N size. For the sake of example 30GB file.

Facts about the file content is that it has proprotional amount of lines. This is interleaved FastQ file. (not important for the question but usefull for someone)

File content is paired or interleaved DNA sequence of strings. Each pair is 8 lines long.

I want to process the interleaved FastQ with GNU parallel in order to speed up the process. Reason for using parallel instead of native bwa tool threads feature is that parallel helps to reduce amount of RAM needed because the nature of bwa memory allocation.

Given that interleaved file is 30GB of size I want to process chunks of --block 500M, command line params looks like --pipe --block 500M -L 8 -j 10 this then is sent as stdin to bwa and will run 10 bwa tasks each getting 500M chunks with a record of 8 lines.

Is my assumption correct that --block 500M and -L 8 will be managed by parallel and I can be certain that my bwa tool will always get 8 lines times N MB of data?

What I am not clear is, will parallel "repeat" last "chunk" if 8 lines are not present? And will it apropriatelly controll other chunk inputs for N processes I start with parallel?

Or this --block 500M "blindly" sends 500M chunk to single process regardless if last part of the 500M chunk does not contain 8 lines so to speak?

Update:

After whole day reading questions and answers on biostars and seqanswers I've realised that my testing/"benchmarking" was wrong.

But this helped to realise that I need to update the question and will make separate question.

I was testing inside Docker container which by default has very low /dev/shm thus I have mislead my self to go totaly different path.



Solution

  • Yes, you can be certain.

    The --block parameter is described here: https://www.gnu.org/software/parallel/parallel_tutorial.html#chunk-size

    The -L parameter here: https://www.gnu.org/software/parallel/parallel_tutorial.html#records

    Quick summary: Parallel will always send full lines to each process until the block/buffer capacity is filled. If you specify a that one record requires several lines (8 in your case), it will fill the buffer capacity in chunks of 8 lines each.

    The last block can be smaller than 8 lines, if there are fewer remaining.

    Side note: In the case of properly formatted and interleaved fastq files, there will always be 8 lines. fastq format specifies that each record is 4 lines and paired-end fastq files must contain the same number of records.