sqldatabasesqlitetime-seriesauto-increment

How to use time-series with Sqlite, with fast time-range queries?


Let's say we log events in a Sqlite database with Unix timestamp column ts:

CREATE TABLE data(ts INTEGER, text TEXT);   -- more columns in reality

and that we want fast lookup for datetime ranges, for example:

SELECT text FROM data WHERE ts BETWEEN 1608710000 and 1608718654;

Like this, EXPLAIN QUERY PLAN gives SCAN TABLE data which is bad, so one obvious solution is to create an index with CREATE INDEX dt_idx ON data(ts).

Then the problem is solved, but it's rather a poor solution to have to maintain an index for an already-increasing sequence / already-sorted column ts for which we could use a B-tree search in O(log n) directly. Internally this will be the index:

ts           rowid
1608000001   1
1608000002   2
1608000012   3
1608000077   4

which is a waste of DB space (and CPU when a query has to look in the index first).

To avoid this:

More generally, how to create time-series optimally with Sqlite, to have fast queries WHERE timestamp BETWEEN a AND b?


Solution

  • First solution

    The method (2) detailed in the question seems to work well. In a benchmark, I obtained:

    The key point is here to use dt as an INTEGER PRIMARY KEY, so it will be the row id itself (see also Is an index needed for a primary key in SQLite?), using a B-tree, and there will not be another hidden rowid column. Thus we avoid an extra index which would make a correspondance dt => rowid: here dt is the row id.

    We also use AUTOINCREMENT which internally creates a sqlite_sequence table, which keeps track of the last added ID. This is useful when inserting: since it is possible that two events have the same timestamp in seconds (it would be possible even with milliseconds or microseconds timestamps, the OS could truncate the precision), we use the maximum between timestamp*10000 and last_added_ID + 1 to make sure it's unique:

     MAX(?, (SELECT seq FROM sqlite_sequence) + 1)
    

    Code:

    import sqlite3, random, time
    db = sqlite3.connect('test.db')
    db.execute("CREATE TABLE data(dt INTEGER PRIMARY KEY AUTOINCREMENT, label TEXT);")
    
    t = 1600000000
    for i in range(1000*1000):
        if random.randint(0, 100) == 0:  # timestamp increases of 1 second with probability 1%
            t += 1
        db.execute("INSERT INTO data(dt, label) VALUES (MAX(?, (SELECT seq FROM sqlite_sequence) + 1), 'hello');", (t*10000, ))
    db.commit()
    
    # t will range in a ~ 10 000 seconds window
    t1, t2 = 1600005000*10000, 1600005100*10000  # time range of width 100 seconds (i.e. 1%)
    start = time.time()
    for _ in db.execute("SELECT 1 FROM data WHERE dt BETWEEN ? AND ?", (t1, t2)): 
        pass
    print(time.time()-start)
    

    Using a WITHOUT ROWID table

    Here is another method with WITHOUT ROWID which gives a 8 ms query time. We have to implement an auto-incrementing id ourself, since AUTOINCREMENT is not available when using WITHOUT ROWID.
    WITHOUT ROWID is useful when we want to use a PRIMARY KEY(dt, another_column1, another_column2, id) and avoid to have an extra rowid column. Instead of having one B-tree for rowid and one B-tree for (dt, another_column1, ...), we'll have just one.

    db.executescript("""
        CREATE TABLE autoinc(num INTEGER); INSERT INTO autoinc(num) VALUES(0);
    
        CREATE TABLE data(dt INTEGER, id INTEGER, label TEXT, PRIMARY KEY(dt, id)) WITHOUT ROWID;
        
        CREATE TRIGGER insert_trigger BEFORE INSERT ON data BEGIN UPDATE autoinc SET num=num+1; END;
        """)
    
    t = 1600000000
    for i in range(1000*1000):
        if random.randint(0, 100) == 0: # timestamp increases of 1 second with probabibly 1%
            t += 1
        db.execute("INSERT INTO data(dt, id, label) VALUES (?, (SELECT num FROM autoinc), ?);", (t, 'hello'))
    db.commit()
    
    # t will range in a ~ 10 000 seconds window
    t1, t2 = 1600005000, 1600005100  # time range of width 100 seconds (i.e. 1%)
    start = time.time()
    for _ in db.execute("SELECT 1 FROM data WHERE dt BETWEEN ? AND ?", (t1, t2)): 
        pass
    print(time.time()-start)
    

    Roughly-sorted UUID

    More generally, the problem is linked to having IDs that are "roughly-sorted" by datetime. More about this:

    All these methods use an ID which is:

    [---- timestamp ----][---- random and/or incremental ----]