optimizationclickhouse

How to force SummingMergeTree do the summing


Found a strange(?) behaviour: when importing several hundred GB of data, aggregation works with a delay even if you force optimization process.

Here is the process:

CREATE TABLE IF NOT EXISTS temp.table
(
    date     Date, 
    session  String, 
    total    Int32
)
ENGINE = SummingMergeTree
ORDER BY ( date, session );

-- Importing '2024-01-01'
INSERT INTO temp.table
SELECT date, session, sum(n) AS total FROM temp.data WHERE date = '2024-01-01' GROUP BY date, session
SETTINGS async_insert = 1;

-- Importing '2024-01-02'
INSERT INTO temp.table
SELECT date, session, sum(n) AS total FROM temp.data WHERE date = '2024-01-02' GROUP BY date, session
SETTINGS async_insert = 1;

-- date = '2024-01-03'
-- date = '2024-01-04'
-- a lote of data ...


-- Run optimiztion process
OPTIMIZE TABLE temp.table FINAL;


-- Final aggregated report
SELECT * FROM temp.table;

I need the last SELECT to return the correct sum, no matter how long it takes. Maybe there is some trick to deal with this kind of case? It's a bit frustrating when getting more data than it should be.


Solution

  • Don't use OPTIMIZE TABLE for any kind of business logic. (In fact, rarely should you ever run that command.)

    Add FINAL after your table name and, at query time, the summing will occur:

    SELECT * FROM temp.table FINAL;
    

    Alternately, you can have FINAL be the default behavior of all queries:

    SET final = 1;
    SELECT * FROM temp.table;
    

    Check the docs: https://clickhouse.com/docs/en/operations/settings/settings#final