I have a large table with time-series data for a whole year, with 24 values for each day (one per hour), for multiple customers.
Customer-ID | Date | Value |
---|---|---|
001 | 2020-01-01 00:00:00 | xx |
001 | 2020-01-01 00:01:00 | xx |
001 | 2020-01-01 00:02:00 | xx |
... | ... | ... |
002 | 2020-01-01 00:00:00 | xx |
002 | 2020-01-01 00:01:00 | xx |
002 | 2020-01-01 00:02:00 | xx |
... | ... | ... |
I am currently storing the whole dataset in a single sqlite table, which I then query from an python application in which you can select different forms of visualization (by month which shows 12 values per year or by day, which shows 365 values per year).
CREATE TABLE "data" (
"index" INTEGER NOT NULL,
"customer_id" INTEGER NOT NULL,
"date" DATETIME NOT NULL,
"value" FLOAT NOT NULL,
"year" INTEGER NOT NULL,
"month" INTEGER NOT NULL,
"day" INTEGER NOT NULL,
PRIMARY KEY("index")
);
CREATE INDEX "idx_data_ym" ON "data" (
"year",
"month"
);
CREATE INDEX "idx_data_ymd" ON "data" (
"year",
"month",
"day"
);
One naive query I could use for showing the monthly data, would be
SELECT date, sum(value) FROM data GROUP BY CAST(STRFTIME('%Y', date) AS INTEGER), CAST(STRFTIME('%m', date) AS INTEGER)
which is rather slow on a large dataset and from my understanding can't use indexes, which is why I am storing year
, month
and day
as an extra column, so that I can use
SELECT date, sum(value) from data GROUP BY year, month
Result:
Date | Sum |
---|---|
2017-01-01 00:00:00 | yy |
2017-01-02 00:00:00 | yy |
.. | .. |
2018-01-01 00:00:00 | yy |
2018-01-02 00:00:00 | yy |
.. | .. |
For a small test dataset (100 customers, 24 hours per value, two years = 100 * 24 * 365 * 2
= 1.752.000
records), the first query takes about 20s
while the second one takes only 1.8s
.
Checking the queries with EXPLAIN QUERY PLAN
, the second query is using the idx_data_ym
index, which is what I wanted, while the first query does not use an index.
Output EXPLAIN QUERY PLAN
for the first query:
id | parent | notused | detail |
---|---|---|---|
6 | 0 | 0 | SCAN TABLE data |
8 | 0 | 0 | USE TEMP B-TREE FOR GROUP BY |
Output for the second query:
id | parent | notused | detail |
---|---|---|---|
7 | 0 | 0 | SCAN TABLE data USING INDEX idx_data_ym |
Now I'm wondering, 1.8s
may be okay, but the dataset in production will be much larger which will slow down the application tremendously. Grouping by year, month, day
to get the sum values for each day is even slower.
Is there a way to improve my query performance? Am I doing something complety wrong with understanding how to aggregate the data?
Thanks for your help!
There is no need for the additional columns year, month and day.
You can set Indexes On Expressions:
CREATE TABLE "data" (
"index" INTEGER NOT NULL,
"customer_id" INTEGER NOT NULL,
"date" TEXT NOT NULL, -- there is no DATETIME data type in SQLite
"value" FLOAT NOT NULL,
PRIMARY KEY("index")
);
CREATE INDEX "idx_data_ym" ON "data"(strftime('%Y-%m', date));
CREATE INDEX "idx_data_ymd" ON "data"(date(date)); -- equivalent of strftime('%Y-%m-%d', date)
For these queries:
SELECT STRFTIME('%Y-%m', date) AS year_month,
SUM(value) AS total
FROM data
GROUP BY year_month;
SELECT date(date) AS year_month_day,
SUM(value) AS total
FROM data
GROUP BY year_month_day;
the proper index will be used.
See the demo.
Indexes are the best that you can do to speed up the operation.