pythonpandasvcf-variant-call-format

50MB file taking too long to read in python


I am trying to analyse a simple but "large" (52MB) .ped file I generated from a 1000genomes project .vcf file: it has 107 lines and 248189 columns. I do not care for the first 6 cols, and the ones I am interested in contain only letters, 'A','C','G','T', for which I need to calculate their frequencies. It is either only one of them or a combination of two (e.g. A,A,A,C,A,A,C,...). The thing is: I am not being able to load this whole file as the memory usage goes from 7GB to 16GB (which I can't understand since the file is only 52MB large), so I tried another approach.

The columns complement each other, so col 6 and 7 represent the same thing, 8 and 9, 10 and 11, so only really need to import two cols at a time in order to analyse this file. By creating a dict, I would be able to avoid loading the whole file, keep the necessary info and for that I did the following:

import pandas as pd
import io
import matplotlib.pyplot as plt
import numpy as np
import time

def main(end):
    for col in range(6,end,2):
        allels = pd.read_csv('filtered_conv.ped',sep = '\t',header=None, usecols=[(col), (col+1)])
        allel_1 = allels[col].value_counts().rename_axis('Nb').reset_index(name='counts')
        allel_2 = allels[col+1].value_counts().rename_axis('Nb').reset_index(name='counts')

And for the dict I am using this inside the loop:

        if len(allel_1['counts']) == 2 and len(allel_2['counts']) == 2: 
            alt_nb = allel_1['Nb'][1]
            alt_c = allel_1['counts'][1]+allel_2['counts'][1]
        elif len(allel_1['counts']) == 2:
            alt_nb = allel_1['Nb'][1]
            alt_c = allel_1['counts'][1]
        else: 
            alt_nb = allel_2['Nb'][1]
            alt_c = allel_2['counts'][1]

        info = {'ref_Nb':allel_1['Nb'][0],
            'c_ref':(allel_1['counts'][0]+allel_2['counts'][0]),
            'alt_Nb':allel_1['Nb'][1] if len(allel_1['Nb']) == 2 else allel_2['Nb'][1],
            'c_alt':alt_c}
        freq_count.append(info)

And although it works, running this for cols 6 to 240 takes 30 seconds, so it would take about 8 hours to complete this process.

freq_count = []
start_time = time.time()
main(240)
print("--- %s seconds ---" % (time.time() - start_time))

Is there any other way to do this? I saw some other SO answers but categorizing the data (...dtype='category') increased this time from 33 to 38sec and I am not sure what else to do...


Solution

  • So all rows just contain 6 meta columns to skip, and then 248183 colums where every value can only be 'A','C','G' and 'T'? And the file is tab-separated?

    with open('filtered_conv.ped') as file:
        # read line by line
        for line in file:
            # skip chars until 5 tabs were found
            tab_count = 0
            while tab_count < 5:
                if next(line) == "\t":
                    tab_count += 1
            # iterate over characters
            for next_char in line:
                # do something in case A C G or T is found, ' \t and \n are skipped
                if next_char == "A":
                    // do something
                elif next_char == "C":
                    // do something
                elif next_char == "G":
                    // do something
                elif next_char == "T":
                    // do something
    

    If you need a progress bar you can use tqdm like so:

    from tqdm import tqdm
    ...
    
    # read line by line
    for line in tqdm(file):
       ...