[SOLVED] Clickhouse aggregating merge tree table issue

Clickhouse aggregating merge tree table issue

I have aggregating merge tree table

-- app_events.events_v2_b definition

CREATE TABLE app_events.events_v2_b on cluster 'cluster_name'
(
    `event_date` Date,
    `user_pseudo_id` String,
    `event_params_ga_session_id` String,
    `event_params_ga_session_number` String,
    `min_time` AggregateFunction(min, String)
)
ENGINE = Distributed('cluster_name',
 'app_events',
 'events_v2_b_local',
 rand());


-- app_events.events_v2_b_local definition
CREATE TABLE app_events.events_v2_b_local on cluster 'cluster_name'
( 
    `event_date` Date,
    `user_pseudo_id` String,
    `event_params_ga_session_id` String,
    `event_params_ga_session_number` String,
    `min_time` AggregateFunction(min,String)
)
ENGINE = ReplicatedAggregatingMergeTree('/clickhouse/tables/{layer}-{shard}/events_v2_b_local',
 '{replica}')
PARTITION BY event_date
ORDER BY (event_date,
 user_pseudo_id,
 event_params_ga_session_id,
 event_params_ga_session_number)
SETTINGS index_granularity = 8192;

When I load data to app_events.events_v2_b using select :

INSERT INTO app_events.events_v2_b
(
    `event_date` ,
    `user_pseudo_id` ,
    `event_params_ga_session_id` ,
    `event_params_ga_session_number` ,
    `min_time` 
) 
SELECT
    event_date,
    user_pseudo_id,
    event_params_ga_session_id,
    event_params_ga_session_number,
    minState(event_timestamp) AS min_time
FROM app_events.events_v2 
and event_date = '2025-01-01'
GROUP BY
    event_date,
    user_pseudo_id,
    event_params_ga_session_id,
    event_params_ga_session_number

and then trying to calculate row count :

select 'events_v2_b' ,count(*) from (
SELECT event_date, 
user_pseudo_id, 
event_params_ga_session_id, 
event_params_ga_session_number, 
minMerge(min_time) as min_time
FROM app_events.events_v2_b
where event_date = '2025-01-01'
group by 
event_date, 
user_pseudo_id, 
event_params_ga_session_id, 
event_params_ga_session_number
)

I get result 100 123 rows

If I repeat proccess of loading to event_v2_b same dataset based on select (I truncate table befoore new load ) and then again select row count from event_v2_b using the same provided select mentioned above I getting 100 115 rows. Why results can be different ? Per my undestanding despite on rand() shardkey and (probably) unmerged parts I should get same results with every loading.

Solution

Issue happened because of before of insertion I had started drop partition sql command. This command didnt performed completely in time of I started insertion. While syncronization is not completed I had started insertion , when insertion have been completed a truncation command completed after the process insertion and deleted some of my inserted rows. Using following hint for drop partition process I have resolved the issue , every run of sequence drop - insert commands do the same and return same results every time.

Setting alter_sync = 1 forces the command to wait until the operation is fully complete before returning.