I want to query a large file that I generated from the dbSNP VCF (16GB) using this shell command:
gzcat 00-All.vcf.gz | grep -v ## | awk -v FS='\t' -v OFS='\t' '{print $3, $1, $2, $4, $4}' | gzip > 00-All_relevant.vcf.gz
From this 00-All_relevant.vcf.gz
I want to query multiple rsIDs (column ID in the MRE) and retrieve the corresponding genomic location and change (columns #CHROM, POS, REF, ALT). I use pandas for my project so I started with trying pandas for this too. I am open to other Python compatible options though.
I cannot read the whole file at once since my kernel always crashes, if using this command:
import pandas as pd
rsid_df = pd.read_csv('00-All_relevant.vcf.gz',
sep='\t')
That's why i tried reading it in chunks and after implementing user27243451's solution I came up with that solution:
rsid_df = pd.read_csv('00-All_relevant.vcf.gz',
sep='\t',
chunksize=1000000)
rsids = ['rs537152180','rs376204250','rs181326522']
variants = pd.DataFrame()
for data in rsid_df:
rsid_found = data['ID'].isin(rsids)
if rsid_found.astype(int).sum() > 0:
variant = data.loc[rsid_found]
variants = pd.concat([variants,variant])
for id in variant['ID'].tolist():
rsids.remove(id)
if not rsids:
break
print(variants)
This works, but it's terribly slow, especially if one of the rsIDs is on the last chunk. Is there any way to speed this up?
Accordingly to Pandas documentation, read_csv invoked with the chunksize
argument returns a collection of dataframes.
Your approach is generally correct. We can optimize the for
by using the isin()
operator as already pointed out by others.
I refactor a bit your code with more explicit names and the additional variable c_
to explain the dataframe filtering by boolean arrays.
import pandas as pd
df_collection = pd.read_csv('large_file.gz',
chunksize=1000000)
keys = ["find_this", "and that"] # elements of column 1
for df in df_collection:
c_ = df['col1'].isin(keys) # returns an array of booleans with True where the condition is satisfied
if c_.astype(int).sum() > 0: # if there are some True in the array, print and exit the loop
print("found\n", df.loc[c_])
break
else:
print("Not found")