I know that the distributed node does not combine intermediate results from shards by using distributed_group_by_no_merge.
The following SQL
select sum(xxxxx),xxxxx from (
select sum(xxxx),xxxx
from (
select count(xxx),xxx
from distributed_table group by xxx )
group by xxxx SETTINGS distributed_group_by_no_merge = 1
) group by xxxxx
I want to know that which part of sql will be sent to MergeTree node to execute by using distributed_group_by_no_merge? is it?select count(xxx),xxx from distributed_table group by xxx ) group by xxxx SETTINGS distributed_group_by_no_merge = 1
how does the parameter of distributed_group_by_no_merge change the behavior of distributed query?which part of sql execute on MergeTree node and which part of sql execute on distributed node?
distributed_group_by_no_merge-param affects the way how the initiator-node (it is a node which runs distributed query) will form the final result of a distributed query:
either by merging aggregated intermediate states coming from shards by itself (it required copying full aggregated intermediate states from shards to initiator-node) [distributed_group_by_no_merge = 0 (default mode)]
or get already final result from shards (when each shard merges an intermediate aggregation state on its side and send to initiator-node only the final result). It provides a significant improvement in performance and resource consumption but requires the right selection of the sharding key [distributed_group_by_no_merge = 1]
I would put distributed_group_by_no_merge at the same level of subquery where defined distributed table to explicitly define your intention and avoid confusion when there are several distributed-subqueries.
Let's look at the way how to check the differences between the two modes (will use _shard_num-virtual column):
SELECT
groupUniqArray(_shard_num) AS shards,
..
FROM table
WHERE ..
GROUP BY ..
SETTINGS distributed_group_by_no_merge = 0
/* Aggregated states were merged into ONE result set on initiator-node.
┌─shards────┬─ ..
│ [2, 1, 3] │ ..
└───────────┴─ ..
*/
SELECT
groupUniqArray(_shard_num) AS shards,
..
FROM table
WHERE ..
GROUP BY ..
SETTINGS distributed_group_by_no_merge = 1
/* Get a set of final results (not aggregated states) from each shard. They should be unioned manually.
┌─shards─┬─ ..
│ [2] │ ..
│ [1] │ ..
│ [3] │ ..
└────────┴─ ..
*/
See https://clickhouse.com/docs/en/operations/settings/settings#distributed_group_by_no_merge.
How to avoid merging high cardinality sub-select aggregations on distributed tables