I have a table clustered on s_nation_key as below.
create or replace table t1
( S_SUPPKEY string,
S_NAME string,
S_NATIONKEY string,
S_ADDRESS string,
S_ACCTBAL string) cluster by (S_NATIONKEY);
Now i have added data to it
INSERT INTO T1
SELECT S_SUPPKEY , S_NAME,S_NATIONKEY,S_ADDRESS,S_ACCTBAL
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1000"."SUPPLIER"
WHERE S_NATIONKEY=7
limit 50000;
When i check data distribution in underlying micro partition itlooks good .
>select system$clustering_information('t1','S_NATIONKEY');
{ "cluster_by_keys" : "LINEAR(S_NATIONKEY)", "total_partition_count" : 1, "total_constant_partition_count" : 0, "average_overlaps" : 0.0, "average_depth" : 1.0, "partition_depth_histogram" : {
"00000" : 0,
"00001" : 1,
"00002" : 0,
"00003" : 0,
"00004" : 0,
"00005" : 0,
"00006" : 0,
"00007" : 0,
"00008" : 0,
"00009" : 0,
"00010" : 0,
"00011" : 0,
"00012" : 0,
"00013" : 0,
"00014" : 0,
"00015" : 0,
"00016" : 0 } }
Again i have loaded few more record as below for particular s_nation_key set as below.
--batch load 2
INSERT INTO T1
SELECT S_SUPPKEY , S_NAME,S_NATIONKEY,S_ADDRESS,S_ACCTBAL
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1000"."SUPPLIER"
WHERE S_NATIONKEY=3
LIMIT 50000;
--batch load 3
INSERT INTO T1
SELECT S_SUPPKEY , S_NAME,S_NATIONKEY,S_ADDRESS,S_ACCTBAL
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1000"."SUPPLIER"
WHERE S_NATIONKEY=1
limit 50000;
--batch load 3
INSERT INTO T1
SELECT S_SUPPKEY , S_NAME,S_NATIONKEY,S_ADDRESS,S_ACCTBAL
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1000"."SUPPLIER"
WHERE S_NATIONKEY=2
and S_ACCTBAL>0
limit 50000;
Now when i check clustering information again ,this also looks good . Now total 4 micro-partition and each distinct S_NATIONKEY value set is loaded into individual partition with no overlapping in range.So all micro-partition is having clustering depth 1.
>select system$clustering_information('t1','S_NATIONKEY');
{
"cluster_by_keys" : "LINEAR(S_NATIONKEY)",
"total_partition_count" : 4,
"total_constant_partition_count" : 4,
"average_overlaps" : 0.0,
"average_depth" : 1.0,
"partition_depth_histogram" : {
"00000" : 0,
"00001" : 4,
"00002" : 0,
"00003" : 0,
"00004" : 0,
"00005" : 0,
"00006" : 0,
"00007" : 0,
"00008" : 0,
"00009" : 0,
"00010" : 0,
"00011" : 0,
"00012" : 0,
"00013" : 0,
"00014" : 0,
"00015" : 0,
"00016" : 0
}
}
Now as per Snowflake documentation and concept of query pruning, when ever we search for records belong to one cluster_key value , it should scan only particular micro-partition which will be holding that cluster_key value (basing on min/max value range of each micro-partition). But in my case it is scanning all underlying micro partition(as below)
. As per above query planning stats,it is scanning all the partitions, instead of scanning 1 .
Am i missing anything here ??What is the logic behind it ?? Please help me in understanding this scenario in Snowflake.
Thanks, @Himanshu
Here the size of the table is not that big, that is why it is scanning all the partition rather one. Even if you check the total size scanned it is just 7.96 mb which is small hence SF scans all partitions