My database has about 600Mio entries that I want to query (Pandas is too slow). This local dbSNP only contains rsIDs and genomic positions. I used:
import sqlite3
import gzip
import csv
rsid_db = sqlite3.connect('rsid.db')
rsid_cursor = rsid_db.cursor()
rsid_cursor.execute(
"""
CREATE TABLE rsids (
rsid TEXT,
chrom TEXT,
pos INTEGER,
ref TEXT,
alt TEXT
)
"""
)
with gzip.open('00-All.vcf.gz', 'rt') as vcf: # from https://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz
reader = csv.reader(vcf, delimiter="\t")
i = 0
for row in reader:
if not ''.join(row).startswith('#'):
rsid_cursor.execute(
f"""
INSERT INTO rsids (rsid, chrom, pos, ref, alt)
VALUES ('{row[2]}', '{row[0]}', '{row[1]}', '{row[3]}', '{row[4]}');
"""
)
i += 1
if i % 1000000 == 0:
print(f'{i} entries written')
rsid_db.commit()
rsid_db.commit()
rsid_db.close()
I want to query multiple rsIDs and get their genomic position and alteration (query rsid
and get chrom
, pos
, ref
, alt
and rsid
). One entry looks like:
rsid | chrom | pos | ref | alt |
---|---|---|---|---|
rs537152180 | 1 | 4002401 | G | A,C |
I query using:
import sqlite3
import pandas as pd
def query_rsid(rsid_list,
rsid_db_path='rsid.db'):
with sqlite3.connect(rsid_db_path) as rsid_db:
rsid_cursor = rsid_db.cursor()
rsid_cursor.execute(
f"""
SELECT * FROM rsids
WHERE rsid IN ('{"', '".join(rsid_list)}');
"""
)
query = rsid_cursor.fetchall()
return query
It takes about 1.5 minutes no matter how many entries. Is there a way to speed this up?
Others have suggested defining your rsid
column as the primary key, or alternatively creating a unique index on it. That's a good idea.
Another thing: rsid IN ('dirty','great','list','of',items')
may use a so-called skip-scan to get its results. If your rsid_list
is very large, or if it pulls in values that are lexically widely separated, you may get a benefit from putting the items in the list into a temporary table then doing
SELECT rsids.*
FROM rsids
JOIN temp_rsids_list ON rsids.rsid = temp_rsids_list.rsids
to get a more efficient lookup.
I would declare the table like this:
CREATE TABLE rsids (
rsid TEXT PRIMARY KEY COLLATE BINARY,
chrom TEXT,
pos INTEGER,
ref TEXT,
alt TEXT
) WITHOUT ROWID
COLLATE BINARY
is the default. But still, it's helpful to show it because you know upfront you don't want case-insensitive matching on that column. This will remind your future self and your colleagues of that important optimization.
WITHOUT ROWID
tells SQLite to organize the table as a so-called "clustered index" where the other values are stored along with the easily searchable primary key.
If you can make your primary key into an INTEGER that is a good idea for the sake of performance.