I have a table that consists of 56millions rows.
This table is handling high load of UPSERTS every 5 minutes as it's loading streaming data from KAFKA. Approx 200-500k updates every load.
When I run a SELECT with an ORDER BY against one of the timestamp columns, it takes over 5-7 minutes to return a result.
I tried Cluster Key for that column but since there is a high DML operation on that table and high cardinality on the column itself, the clustering was ineffective and costly.
So far, the only think that has significantly reduced query time to about 15 seconds is increasing the warehouse size to an X-Large from a Small.
I am not convinced that the only solution is to increase the warehouse size. Any advice here would be great!
Clustering on date(timestamp)
(or something that's lower cardinality) would be more effective, although because of the volume of updates it will still be expensive.
At a happy hour event, I heard a Snowflake user that achieved acceptable results on a similar (ish) scenario by clustering on late arriving facts (e.g. iff(event_date<current_date, true, false))
) (although I think they were INSERT
ing not UPSERT
ing and in the later case the micropartitions have to be re-written anyway so it might not help much.)
There are other things to consider too.
Inspect the query plan to confirm that ordering is the problem (e.g is a lot of time spent on ordering.) Without seeing your actual query, I wonder if a majority of the time is spent on the table scan (when it is grabbing the data from remote storage.) If a larger warehouse improves performance, this is likely the case since every added node in the cluster means more micro-partitions can be read concurrently.