sqliteindexinggroup-bylarge-datasqldatetime

Improving group by query for large time-series dataset


I have a large table with time-series data for a whole year, with 24 values for each day (one per hour), for multiple customers.

Customer-ID Date Value
001 2020-01-01 00:00:00 xx
001 2020-01-01 00:01:00 xx
001 2020-01-01 00:02:00 xx
... ... ...
002 2020-01-01 00:00:00 xx
002 2020-01-01 00:01:00 xx
002 2020-01-01 00:02:00 xx
... ... ...

I am currently storing the whole dataset in a single sqlite table, which I then query from an python application in which you can select different forms of visualization (by month which shows 12 values per year or by day, which shows 365 values per year).

CREATE TABLE "data" (
    "index" INTEGER NOT NULL,
    "customer_id"   INTEGER NOT NULL,
    "date"  DATETIME NOT NULL,
    "value" FLOAT NOT NULL,
    "year"  INTEGER NOT NULL,
    "month" INTEGER NOT NULL,
    "day"   INTEGER NOT NULL,
    PRIMARY KEY("index")
);

CREATE INDEX "idx_data_ym" ON "data" (
    "year",
    "month"
);

CREATE INDEX "idx_data_ymd" ON "data" (
    "year",
    "month",
    "day"
);

One naive query I could use for showing the monthly data, would be

SELECT date, sum(value) FROM data GROUP BY CAST(STRFTIME('%Y', date) AS INTEGER), CAST(STRFTIME('%m', date) AS INTEGER)

which is rather slow on a large dataset and from my understanding can't use indexes, which is why I am storing year, month and day as an extra column, so that I can use

SELECT date, sum(value) from data GROUP BY year, month

Result:

Date Sum
2017-01-01 00:00:00 yy
2017-01-02 00:00:00 yy
.. ..
2018-01-01 00:00:00 yy
2018-01-02 00:00:00 yy
.. ..

For a small test dataset (100 customers, 24 hours per value, two years = 100 * 24 * 365 * 2 = 1.752.000 records), the first query takes about 20s while the second one takes only 1.8s.

Checking the queries with EXPLAIN QUERY PLAN, the second query is using the idx_data_ym index, which is what I wanted, while the first query does not use an index.

Output EXPLAIN QUERY PLAN for the first query:

id parent notused detail
6 0 0 SCAN TABLE data
8 0 0 USE TEMP B-TREE FOR GROUP BY

Output for the second query:

id parent notused detail
7 0 0 SCAN TABLE data USING INDEX idx_data_ym

Now I'm wondering, 1.8s may be okay, but the dataset in production will be much larger which will slow down the application tremendously. Grouping by year, month, day to get the sum values for each day is even slower.

Is there a way to improve my query performance? Am I doing something complety wrong with understanding how to aggregate the data?

Thanks for your help!


Solution

  • There is no need for the additional columns year, month and day.

    You can set Indexes On Expressions:

    CREATE TABLE "data" (
        "index" INTEGER NOT NULL,
        "customer_id" INTEGER NOT NULL,
        "date" TEXT NOT NULL, -- there is no DATETIME data type in SQLite
        "value" FLOAT NOT NULL,
        PRIMARY KEY("index")
    );
    
    CREATE INDEX "idx_data_ym" ON "data"(strftime('%Y-%m', date));
    
    CREATE INDEX "idx_data_ymd" ON "data"(date(date)); -- equivalent of strftime('%Y-%m-%d', date)
    

    For these queries:

    SELECT STRFTIME('%Y-%m', date) AS year_month, 
           SUM(value) AS total
    FROM data 
    GROUP BY year_month;
    
    SELECT date(date) AS year_month_day, 
           SUM(value) AS total
    FROM data 
    GROUP BY year_month_day; 
    

    the proper index will be used.

    See the demo.

    Indexes are the best that you can do to speed up the operation.