I'm trying to choose unique regions for my Non-invasive Prenatal Testing (NIPT) project. I have done following steps:
I have a .sam file about 40gb, on the next step I try to merge all overlapping coordinates to one .bed file for using in samtools. This is my python script to do that:
import glob
import time
folder = glob.glob('analysys/*.sam')
core = 22
nst = [f'chr{x+1}' for x in range(22)] + ['chrX','chrY']
for file in folder:
now = time.time()
print(f'Analyzing {file}')
raw = []
treat1 = []
treat2 = []
with open(file) as samfile:
print('Importing coordinates')
for line in samfile:
coord_raw = line.split('\t')[0]
coord_chr = coord_raw.split('_')[0]
coord_loc = [int(r) for r in coord_raw.split('_')[1].split('-')] #0: begin, 1: end
raw.append(({coord_chr},{coord_loc[0]},{coord_loc[1]}))
print('Begin merging overlapping area...')
last_coord = () #0:chr, 1:begin, 2:end
for chrom ,begin , end in raw:
if last_coord == () or last_coord[0] != chrom:
last_coord = (chrom,begin,end)
else:
if last_coord[0] == chrom:
if begin > last_coord[1] and end < last_coord[2]:
last_coord = (chrom,last_coord[1],end)
else:
treat1.append(({last_coord[0]},{last_coord[1]},{last_coord[2]}))
last_coord = (chrom,begin,end)
else:
treat1.append(({last_coord[0]},{last_coord[1]},{last_coord[2]}))
last_coord = (chrom,begin,end)
print('Begin merging nearby area...')
last_coord = ()
for chrom ,begin , end in treat1:
if last_coord == ():
last_coord = (chrom,begin,end)
else:
if chrom == last_coord[0]:
if begin == last_coord[2] + 1:
last_coord = (last_coord[0],last_coord[1],end)
else:
treat2.append(f'{last_coord[0]}\t{last_coord[1]}\t{last_coord[2]}')
last_coord = (chrom,begin,end)
else:
treat2.append(f'{last_coord[0]}\t{last_coord[1]}\t{last_coord[2]}')
last_coord = (chrom,begin,end)
print('Outputting...')
with open('unique_coord.bed','w') as bedfile:
bedfile.write('\n'.join(treat2))
print(f'Program completed! Total time: {int(time.time() - now)} seconds')
However, after 30 minutes running, I found that the program consumed all of my memory and crashed. Is there any advice for me to reduce the amount of memory this script consumed? thank you very much
Hopefully this does the trick. Hard to know since I'm working without having the input file or knowing what the output should look like and can't really run the code to check for errors, but here's what I've tried to do:
raw
and go straight to creating treat1
; andtreat2
and go straight to writing to bedfile
. I opened bedfile
right off the bat and everything in your program is done with that file open.If this does what you need but still crashes, then perhaps you can read through each file twice to get last_coord
at the end of the treat1
and then read through it again to "recreate" each line of treat1
individually and apply it to defining what needs to be written into the file.
Without really knowing the details of what you're doing (I do not work anywhere close to a field that would apply samtools).
import glob
import time
folder = glob.glob('analysys/*.sam')
# These two lines are not used
#core = 22
#nst = [f'chr{x+1}' for x in range(22)] + ['chrX','chrY']
# moved this to outside the for loop since the time check was
# after the for loop
now = time.time()
with open('unique_coord.bed','w') as bedfile:
for file in folder:
print(f'Analyzing {file}')
raw = []
treat1 = []
with open(file) as samfile:
print('Importing coordinates')
last_coord = ()
for line in samfile:
chrom = line.split('\t')[0]
begin = coord_raw.split('_')[0]
end = [int(r) for r in coord_raw.split('_')[1].split('-')] #0: begin, 1: end
if last_coord == () or last_coord[0] != chrom:
last_coord = (chrom,begin,end)
else:
if last_coord[0] == chrom:
if begin > last_coord[1] and end < last_coord[2]:
last_coord = (chrom,last_coord[1],end)
else:
treat1.append(({last_coord[0]},{last_coord[1]},{last_coord[2]}))
last_coord = (chrom,begin,end)
else:
treat1.append(({last_coord[0]},{last_coord[1]},{last_coord[2]}))
last_coord = (chrom,begin,end)
print('Begin merging nearby area...')
last_coord = ()
for chrom ,begin , end in treat1:
if last_coord == ():
last_coord = (chrom,begin,end)
else:
if chrom == last_coord[0]:
if begin == last_coord[2] + 1:
last_coord = (last_coord[0],last_coord[1],end)
else:
bedfile.write(f'{last_coord[0]}\t{last_coord[1]}\t{last_coord[2]}\n')
last_coord = (chrom,begin,end)
else:
bedfile.write(f'{last_coord[0]}\t{last_coord[1]}\t{last_coord[2]}\n')
last_coord = (chrom,begin,end)
print(f'Program completed! Total time: {int(time.time() - now)} seconds')