pythonsqlsqlitequery-optimization

How do I speed up querying my >600Mio rows?


My database has about 600Mio entries that I want to query (Pandas is too slow). This local dbSNP only contains rsIDs and genomic positions. I used:

import sqlite3
import gzip
import csv

rsid_db = sqlite3.connect('rsid.db')

rsid_cursor = rsid_db.cursor()

rsid_cursor.execute(
"""
CREATE TABLE rsids (
rsid TEXT,
chrom TEXT,
pos INTEGER,
ref TEXT,
alt TEXT
)
"""
)

with gzip.open('00-All.vcf.gz', 'rt') as vcf: # from https://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz
    reader = csv.reader(vcf, delimiter="\t")
    i = 0
    for row in reader:
        if not ''.join(row).startswith('#'):
            rsid_cursor.execute(
            f"""
            INSERT INTO rsids (rsid, chrom, pos, ref, alt)
            VALUES ('{row[2]}', '{row[0]}', '{row[1]}', '{row[3]}', '{row[4]}');
            """
            )
            i += 1
            if i % 1000000 == 0:
                print(f'{i} entries written')
                rsid_db.commit()

rsid_db.commit()

rsid_db.close()

I want to query multiple rsIDs and get their genomic position and alteration (query rsid and get chrom, pos, ref, alt and rsid). One entry looks like:

rsid chrom pos ref alt
rs537152180 1 4002401 G A,C

I query using:

import sqlite3
import pandas as pd


def query_rsid(rsid_list,
               rsid_db_path='rsid.db'):

    with sqlite3.connect(rsid_db_path) as rsid_db:

        rsid_cursor = rsid_db.cursor()
        rsid_cursor.execute(
        f"""
        SELECT * FROM rsids
        WHERE rsid IN ('{"', '".join(rsid_list)}');
        """
        )

        query = rsid_cursor.fetchall()

    return query

It takes about 1.5 minutes no matter how many entries. Is there a way to speed this up?


Solution

  • Others have suggested defining your rsid column as the primary key, or alternatively creating a unique index on it. That's a good idea.

    Another thing: rsid IN ('dirty','great','list','of',items') may use a so-called skip-scan to get its results. If your rsid_list is very large, or if it pulls in values that are lexically widely separated, you may get a benefit from putting the items in the list into a temporary table then doing

    SELECT rsids.*
      FROM rsids
      JOIN temp_rsids_list ON rsids.rsid = temp_rsids_list.rsids
    

    to get a more efficient lookup.

    I would declare the table like this:

    CREATE TABLE rsids (
      rsid TEXT PRIMARY KEY COLLATE BINARY,
      chrom TEXT,
      pos INTEGER,
      ref TEXT,
      alt TEXT
    ) WITHOUT ROWID
    

    COLLATE BINARY is the default. But still, it's helpful to show it because you know upfront you don't want case-insensitive matching on that column. This will remind your future self and your colleagues of that important optimization.

    WITHOUT ROWID tells SQLite to organize the table as a so-called "clustered index" where the other values are stored along with the easily searchable primary key.

    If you can make your primary key into an INTEGER that is a good idea for the sake of performance.