snowflake-cloud-data-platformclustering-key

Snowflake - Clustering


What is the best approach for clustering snowflake tables

  1. Absolute clustering by manually reloading the tables at a certain frequency based on retrieval order
  2. Create cluster key and turn on auto recluster but suspend it most of them, run it only at certain intervals may be by looking at the partition scanned column of the table

Thanks Rajib


Solution

  • There is not general across all data use patterns that applies, and also that applies across time, as clustering that the implementation is evolving (said as an outside, but watching it change over time).

    Auto clustering is just like hard drive fragmentation management. Because they are both the same idea, of locating like data near, to make read perf better. And just like disk defragmentation different usage loads/patterns make the need for clstuering/defrag more important, and some usages conflict with auto-clustering.

    For example we have some tables that are written in as tight a loop as we can, and we want it clustered in a pattern that is 90% aligned with the insert order. So the auto clustering is not costly to the insert pattern. But once a month we delete from these tables GDPR/PII reasons, and after update/delete change 1/3 of partitions. So it would seem doing a full table rewrite with an ORDER applied would be overkill. But because of the insert rate auto-clustering (as it stands today) thrashes for hours and costs 5x the cost to do a full table rewrite.

    Also we have other tables (the contain address information) and the table is "rather small" so is full tables scanned a lot, so ordering it in the sense of auto-cluster does not make sense, but re-build the table daily, to keep the partition size small as possible, so full tables scans are the fastest they can be.. the point being auto-clustering also does micro-partition optimization, which would be useful, but we don't need the table ordered, so are not running clustering..