I am trying to analyse a simple but "large" (52MB) .ped file I generated from a 1000genomes project .vcf file: it has 107 lines and 248189 columns. I do not care for the first 6 cols, and the ones I am interested in contain only letters, 'A','C','G','T', for which I need to calculate their frequencies. It is either only one of them or a combination of two (e.g. A,A,A,C,A,A,C,...). The thing is: I am not being able to load this whole file as the memory usage goes from 7GB to 16GB (which I can't understand since the file is only 52MB large), so I tried another approach.
The columns complement each other, so col 6 and 7 represent the same thing, 8 and 9, 10 and 11, so only really need to import two cols at a time in order to analyse this file. By creating a dict, I would be able to avoid loading the whole file, keep the necessary info and for that I did the following:
import pandas as pd
import io
import matplotlib.pyplot as plt
import numpy as np
import time
def main(end):
for col in range(6,end,2):
allels = pd.read_csv('filtered_conv.ped',sep = '\t',header=None, usecols=[(col), (col+1)])
allel_1 = allels[col].value_counts().rename_axis('Nb').reset_index(name='counts')
allel_2 = allels[col+1].value_counts().rename_axis('Nb').reset_index(name='counts')
And for the dict I am using this inside the loop:
if len(allel_1['counts']) == 2 and len(allel_2['counts']) == 2:
alt_nb = allel_1['Nb'][1]
alt_c = allel_1['counts'][1]+allel_2['counts'][1]
elif len(allel_1['counts']) == 2:
alt_nb = allel_1['Nb'][1]
alt_c = allel_1['counts'][1]
else:
alt_nb = allel_2['Nb'][1]
alt_c = allel_2['counts'][1]
info = {'ref_Nb':allel_1['Nb'][0],
'c_ref':(allel_1['counts'][0]+allel_2['counts'][0]),
'alt_Nb':allel_1['Nb'][1] if len(allel_1['Nb']) == 2 else allel_2['Nb'][1],
'c_alt':alt_c}
freq_count.append(info)
And although it works, running this for cols 6 to 240 takes 30 seconds, so it would take about 8 hours to complete this process.
freq_count = []
start_time = time.time()
main(240)
print("--- %s seconds ---" % (time.time() - start_time))
Is there any other way to do this? I saw some other SO answers but categorizing the data (...dtype='category') increased this time from 33 to 38sec and I am not sure what else to do...
So all rows just contain 6 meta columns to skip, and then 248183 colums where every value can only be 'A','C','G' and 'T'? And the file is tab-separated?
with open('filtered_conv.ped') as file:
# read line by line
for line in file:
# skip chars until 5 tabs were found
tab_count = 0
while tab_count < 5:
if next(line) == "\t":
tab_count += 1
# iterate over characters
for next_char in line:
# do something in case A C G or T is found, ' \t and \n are skipped
if next_char == "A":
// do something
elif next_char == "C":
// do something
elif next_char == "G":
// do something
elif next_char == "T":
// do something
If you need a progress bar you can use tqdm like so:
from tqdm import tqdm
...
# read line by line
for line in tqdm(file):
...