pythonpandas

How to query a large file using pandas (or an alternative)?


I want to query a large file that I generated from the dbSNP VCF (16GB) using this shell command:

gzcat 00-All.vcf.gz | grep -v ## | awk -v FS='\t' -v OFS='\t' '{print $3, $1, $2, $4, $4}' | gzip > 00-All_relevant.vcf.gz

From this 00-All_relevant.vcf.gz I want to query multiple rsIDs (column ID in the MRE) and retrieve the corresponding genomic location and change (columns #CHROM, POS, REF, ALT). I use pandas for my project so I started with trying pandas for this too. I am open to other Python compatible options though.

I cannot read the whole file at once since my kernel always crashes, if using this command:

import pandas as pd

rsid_df = pd.read_csv('00-All_relevant.vcf.gz',
                      sep='\t')

That's why i tried reading it in chunks and after implementing user27243451's solution I came up with that solution:

rsid_df = pd.read_csv('00-All_relevant.vcf.gz',
                      sep='\t',
                      chunksize=1000000)

rsids = ['rs537152180','rs376204250','rs181326522']
variants = pd.DataFrame()

for data in rsid_df:
    rsid_found = data['ID'].isin(rsids)
    if rsid_found.astype(int).sum() > 0:
        variant = data.loc[rsid_found]
        variants = pd.concat([variants,variant])
        for id in variant['ID'].tolist():
            rsids.remove(id)
        if not rsids:
            break

print(variants)

This works, but it's terribly slow, especially if one of the rsIDs is on the last chunk. Is there any way to speed this up?


Solution

  • Accordingly to Pandas documentation, read_csv invoked with the chunksize argument returns a collection of dataframes.
    Your approach is generally correct. We can optimize the for by using the isin() operator as already pointed out by others.
    I refactor a bit your code with more explicit names and the additional variable c_ to explain the dataframe filtering by boolean arrays.

    import pandas as pd
    
    df_collection = pd.read_csv('large_file.gz',
                     chunksize=1000000)
    
    keys = ["find_this", "and that"] # elements of column 1
    
    for df in df_collection:
        c_ = df['col1'].isin(keys)   # returns an array of booleans with True where the condition is satisfied
        if c_.astype(int).sum() > 0: # if there are some True in the array, print and exit the loop
            print("found\n", df.loc[c_]) 
            break
    else:
        print("Not found")