pythonmultiprocessingbiopythonpysam

Multiprocess, various process reading the same file


I am trying to simulate some dna-sequencing reads, and,in order to speed-up the code, I am in need to run it on parallel.

Basically, what I am trying to do is the following:I am sampling reads from the human genome, and I think that one the two process from multiprocessing module try to get data from the same file (the human genome) the processes gets corrupted and it is not able to get the desired DNA sequence. I have tried different things, but I am very new to parallel programming and I cannot solve my problem

When I run the script with one core it works fine.

This is the way I am calling the function

if __name__ == '__main__':
    jobs = []
    # init the processes
    for i in range(number_of_cores):
        length= 100
        lock = mp.Manager().Lock()
        p = mp.Process(target=simulations.sim_reads,args=(lock,FastaFile, "/home/inigo/msc_thesis/genome_data/hg38.fa",length,paired,results_dir,spawn_reads[i],temp_file_names[i]))
        jobs.append(p)
        p.start()
    for p in jobs:
        p.join()

And this is the function I am using to get the reads, were each process writes the data to a different file.

def sim_single_end(lc,fastafile,chr,chr_pos_start,chr_pos_end,read_length, unique_id):

    lc.acquire()
    left_split_read = fastafile.fetch(chr, chr_pos_end - (read_length / 2), chr_pos_end)
    right_split_read = fastafile.fetch(chr, chr_pos_start, chr_pos_start + (read_length / 2))
    reversed_left_split_read = left_split_read[::-1]
    total_read = reversed_left_split_read + right_split_read
    seq_id = "id:%s-%s|left_pos:%s-%s|right:%s-%s " % (unique_id,chr, int(chr_pos_end - (read_length / 2)), int(chr_pos_end), int(chr_pos_start),int(chr_pos_start + (read_length / 2)))
    quality = "I" * read_length
    fastq_string = "@%s\n%s\n+\n%s\n" % (seq_id, total_read, quality)
    lc.release()
    new_record = SeqIO.read(StringIO(fastq_string), "fastq")
    return(new_record)

Here is the traceback:

Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/inigo/Dropbox/PycharmProjects/circ_dna/simulations.py", line 107, in sim_ecc_reads
   new_read = sim_single_end(lc,fastafile, chr, chr_pos_start, chr_pos_end, read_length, read_id)
   File "/home/inigo/Dropbox/PycharmProjects/circ_dna/simulations.py", line 132, in sim_single_end
   new_record = SeqIO.read(StringIO(fastq_string), "fastq")
   File "/usr/local/lib/python3.5/dist-packages/Bio/SeqIO/__init__.py", line 664, in read
   first = next(iterator)
   File "/usr/local/lib/python3.5/dist-packages/Bio/SeqIO/__init__.py", line 600, in parse
for r in i:
   File "/usr/local/lib/python3.5/dist-packages/Bio/SeqIO/QualityIO.py", line 1031, in FastqPhredIterator
for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
   File "/usr/local/lib/python3.5/dist-packages/Bio/SeqIO/QualityIO.py", line 951, in FastqGeneralIterator
% (title_line, seq_len, len(quality_string)))

 ValueError: Lengths of sequence and quality values differs  for id:6-chr1_KI270707v1_random|left_pos:50511537-50511587|right:50511214-50511264 (0 and 100).

Solution

  • I am the OP of this answer that I did almost a year ago. The problem was that the package that I was using for reading the human genome file (pysam) was failing. The issue was a typo when calling multiprocessing.

    From the authors respose, this should work:

     p = mp.Process(target=get_fasta, args=(genome_fa,))
    

    note the ',' to ensure you pass a tuple

    See https://github.com/pysam-developers/pysam/issues/409 for more details