clickhouseclickhouse-client

Clickhouse MergeTree not aggregating inserts


I want to create a clickhouse table that stores the max value of stats over a period of every one hour. My base table looks like below

CREATE TABLE base_table
(
    `time` DateTime,
    `stats` Float64
)
ENGINE = MergeTree
ORDER BY time
SETTINGS index_granularity = 8192

The materialized view look like below

CREATE MATERIALIZED VIEW hourly_max_stats_mv TO hourly_max_stats
(
    `hour_start` DateTime,
    `hour_end` DateTime,
    `max_stats` Float64
) AS
SELECT
    toStartOfHour(time) AS hour_start,
    addHours(toStartOfHour(time), 1) AS hour_end,
    max(stats) AS max_stats
FROM base_table
GROUP BY
    hour_start,
    hour_end

When I insert the below 2 rows in the base table, I am expecting the hourly_max_stats_mv table to contain only 1 row with the max value over an interval of 1 hour.

INSERT INTO base_table (time, stats) VALUES ('2024-01-18 12:30:00', 10);
INSERT INTO base_table (time, stats) VALUES ('2024-01-18 12:40:00', 20);

Instead I see both the records in hourly_max_stats. I checked the clickhouse logs and there are no errors. I even waited for more than an hour to let the background process aggregate the result, but I still see 2 records. Any idea what am I doing wrong?


Solution

  • You are using MergeTree on your destination table when you actually want to be using AggregatingMergeTree. Here is what you need (I'm a little confused as to why you need the "end" of an hour so I removed it):

    CREATE TABLE hourly_max_stats (
        hour_start DateTime,
        max_stats SimpleAggregateFunction(max, Float64)
    )
    ENGINE = AggregatingMergeTree
    PRIMARY KEY hour_start;
    
    CREATE  MATERIALIZED VIEW hourly_max_stats_mv 
    TO hourly_max_stats
    AS
    SELECT
        toStartOfHour(time) AS hour_start,
        maxSimpleState(stats) AS max_stats
    FROM base_table
    GROUP BY
        hour_start;
    
    INSERT INTO base_table (time, stats) VALUES ('2024-01-18 12:30:00', 10);
    INSERT INTO base_table (time, stats) VALUES ('2024-01-18 12:40:00', 20);
    
    
    SELECT
        hour_start,
        max(max_stats)
    FROM hourly_max_stats
    GROUP BY hour_start;
    
    Query id: 194b3229-f94f-4503-abc7-552a11f4fc7a
    
    ┌──────────hour_start─┬─max(max_stats)─┐
    │ 2024-01-18 12:00:00 │             20 │
    └─────────────────────┴────────────────┘
    
    1 row in set. Elapsed: 0.025 sec.
    

    I explain the details here: https://www.youtube.com/watch?v=-Ma2ohy_6lM&list=PL0Z2YDlm0b3gtIdcZI3B_8bMJclDOvY8s&index=10