Found a strange(?) behaviour: when importing several hundred GB of data, aggregation works with a delay even if you force optimization process.
Here is the process:
CREATE TABLE IF NOT EXISTS temp.table
(
date Date,
session String,
total Int32
)
ENGINE = SummingMergeTree
ORDER BY ( date, session );
-- Importing '2024-01-01'
INSERT INTO temp.table
SELECT date, session, sum(n) AS total FROM temp.data WHERE date = '2024-01-01' GROUP BY date, session
SETTINGS async_insert = 1;
-- Importing '2024-01-02'
INSERT INTO temp.table
SELECT date, session, sum(n) AS total FROM temp.data WHERE date = '2024-01-02' GROUP BY date, session
SETTINGS async_insert = 1;
-- date = '2024-01-03'
-- date = '2024-01-04'
-- a lote of data ...
-- Run optimiztion process
OPTIMIZE TABLE temp.table FINAL;
-- Final aggregated report
SELECT * FROM temp.table;
I need the last SELECT
to return the correct sum, no matter how long it takes. Maybe there is some trick to deal with this kind of case? It's a bit frustrating when getting more data than it should be.
Don't use OPTIMIZE TABLE
for any kind of business logic. (In fact, rarely should you ever run that command.)
Add FINAL
after your table name and, at query time, the summing will occur:
SELECT * FROM temp.table FINAL;
Alternately, you can have FINAL
be the default behavior of all queries:
SET final = 1;
SELECT * FROM temp.table;
Check the docs: https://clickhouse.com/docs/en/operations/settings/settings#final