sqliteselectsql-order-bypysqlite

Big SELECT optimization


I am using SQLite in Python for a big file management system. I have a big flat file (100 millions lines) that I want to sort using the values of 3 columns (which are integers), so that I could iterate and do some computation.

I used SQLite with a big SELECT ... ORDER BY (with an index on one column). Since this big SELECT is too memory demanding I need to call it several times (with OFFSET and LIMIT).

I could use Linux sort, but I want it to be platform independent. It is working fine (as long as the right PRAGMA are correctly set), but slow. How to optimize this?

Commands are like:

PRAGMA journal_mode = OFF
PRAGMA synchronous = 0
PRAGMA locking_mode = EXCLUSIVE
PRAGMA count_change = OFF
PRAGMA temp_store = 2
CREATE TABLE tmpTranscripts_arm_3R_transcripts (id INTEGER PRIMARY KEY, name varchar(255), chromosome varchar(255), start int(11), end int(11), direction tinyint(4), tags varchar(1023), bin int(11), exons varchar(10000))
CREATE INDEX 'iTranscript_arm_3R_14943' ON 'tmpTranscripts_arm_3R_transcripts' (start, end, direction)
INSERT INTO tmpTranscripts_arm_3R_transcripts (name, chromosome, start, end, direction, tags, bin, exons) VALUES ('SRR060644.1', 'arm_3R', 11450314, 11450337, -1, 'feature=transcript;bestRegion=(self);nbGaps=0;nbMismatches=0;ID=SRR060644.1;identity=100.0', 300011450, '')
(this, more than 10 millions times)
SELECT * FROM tmpTranscripts_arm_3R_transcripts ORDER BY start, end, direction LIMIT 0, 10000
(this, as much as needed)

Solution

  • I have written some sample script that create your database and go through all its elements. And it looks like it works much faster than your wrote in comments. Are you sure that database access is a bottleneck? Maybe in your script you do something more and this takes so much time.

    I have checked 2 databases SQLite and MongoDB with 5 millions of items. For SQLite inserting all rows took ~1200 seconds and selection them around 300 seconds. MongoDB was faster and insert took ~400 seconds while select less than 100 seconds.

    Please check your code with my samples and check if your select is similar. I used cursor instead of LIMIT/OFFSET. If this still doesn't help then I think MongoDB is worth a shot. It has one disadvantage - it require 64-bit OS to support large database (like yours). If you newer used it before then here is shortest installation guide for windows:

    And here are my python 3.x test scripts for SQLite

    import sqlite3
    from time import time
    
    conn = sqlite3.connect('test.dbase')
    
    c = conn.cursor()
    
    c.execute("""PRAGMA journal_mode = OFF""")
    c.execute("""PRAGMA synchronous = 0""")
    c.execute("""PRAGMA locking_mode = EXCLUSIVE""")
    c.execute("""PRAGMA count_change = OFF""")
    c.execute("""PRAGMA temp_store = 2""")
    
    c.execute("""CREATE TABLE tmpTranscripts_arm_3R_transcripts (id INTEGER PRIMARY KEY, name varchar(255), chromosome varchar(255), start int(11), end int(11), direction tinyint(4), tags varchar(1023), bin int(11), exons varchar(10000))""")
    c.execute("""CREATE INDEX 'iTranscript_arm_3R_14943' ON 'tmpTranscripts_arm_3R_transcripts' (start, end, direction)""")
    
    t1 = time()
    
    for i in range(0, 5000000):
        c.execute("""INSERT INTO tmpTranscripts_arm_3R_transcripts (name, chromosome, start, end, direction, tags, bin, exons) VALUES ('SRR060644.1', 'arm_3R', %d, %d, %d, 'feature=transcript;bestRegion=(self);nbGaps=0;nbMismatches=0;ID=SRR060644.1;identity=100.0', 300011450, '')""" % ((i+123)%352, (i+523)%422, (i+866)%536))
        if(not i%10000):
            print("Insert:", i)
    
    t2 = time()
    print("Insert time", t2-t1)
    
    conn.commit()
    
    t1 = time()
    c.execute("""SELECT * FROM tmpTranscripts_arm_3R_transcripts ORDER BY start, end, direction""")
    
    i = 0
    for row in c:
        a = row[0]
        if(not i%10000):
            print("Get:", i, row)
        i+=1
    
    t2 = time()
    print("Sort time", t2-t1)
    
    c.close()
    

    and for MongoDB

    from pymongo import Connection
    from pymongo import ASCENDING, DESCENDING
    from time import time
    
    connection = Connection()
    connection = Connection('localhost', 27017)
    db = connection['test-database']
    collection = db['test-collection']
    posts = db.posts
    
    posts.create_index([("start", ASCENDING), ("end", ASCENDING), ("direction", ASCENDING)])
    
    t1 = time()
    
    for i in range(0, 5000000):
        post = { "name": 'SRR060644.1',
                "chromosome": 'arm_3R',
                "start": (i+123)%352,
                "end": (i+523)%422,
                "direction": (i+866)%536,
                "tags": 'feature=transcript;bestRegion=(self);nbGaps=0;nbMismatches=0;ID=SRR060644.1;identity=100.0',
                "bin": 300011450,
                "exons": ''}
    
        posts.insert(post)
    
        if(not i%10000):
            print("Insert:", i)
    
    t2 = time()
    print("Insert time", t2-t1)
    
    t1 = time()
    
    i = 0
    for post in posts.find().sort([("start", ASCENDING), ("end", ASCENDING), ("direction", ASCENDING)]):
        if(not i%10000):
            print("Get:", i, post)
        i+=1
    
    t2 = time()
    print("Sort time", t2-t1)