snowflake-cloud-data-platform

Partitioning large table in snowflake through a custom partitioner


We have a large table in snowflake which has more than 55 BILLION records. Users retrieve data from this table by providing YEAR and a SERIAL_NUMBER as the predicate. It takes a lot of time to retrieve the records. We do not want to enable auto clustering as that is becoming expensive for us. We have, therefore, come up with the following approach. I wanted to take inputs on this approach.

We are dividing the table into multiple partitions. For example if the table name is TABLE_XYZ. We are creating physical tables for each year like TABLE_XYZ_2016, TABLE_XYZ_2017, TABLE_XYZ_2018, TABLE_XYS_2019, TABLE_XYZ_2020. The latest year table is which gets changed frequently, the others are largely untouched. We have written a partitioner/splitter which reads a source similar table and splits the records by year and loads them in the corresponding year table ordering by year and serial_number. We have also created a view on top of all these tables. The idea is people who will need all years will go against the view, people who are interested in a certain year will go against the physical year table. We are expecting that this will reduce the time to retrieve the data. However, if there is a new year(lets say 2021), we will have to create that table and also redefine the view. I have not been able to find a way to create a dynamic view.

Please let me know if this approach has any loop holes.

Thanks


Solution

  • There's a far simpler way of achieving incredible performance using cluster keys. See the documentation here: https://docs.snowflake.com/en/user-guide/tables-clustering-keys.html

    In summary, create a cluster key on (YEAR, SERIAL_NUMBER) and in background Snowflake will automatically recluster (sort) the data by those keys. Initially, this will probably re-sequence the entire table, but subsequently, it will only re-sort newly added values.

    I did a test on a billion-row table where I created a unique sequence and put a cluster key on it performing a random look-up by the key and each fetch returned in milliseconds with just one I/O.

    Snowflake also has a customer in the USA with over a PETABYTE of data in a single table (yes, 1,000 terabytes) which uses the same technique to achieve terrific query performance against the most recently added data.

    To help tune the performance, I'd recommend (if possible) load the data as it arrives in YEAR format. IE. If you're doing a bulk load of several years, load one year at a time. Otherwise don't worry.

    You can then remove your physical partitioning and views and have Snowflake keep the entire solution clean and automatically updated.

    You will find the background clustering will have an initial cost to sort the data, but subsequently, there should be a little cost involved, and the performance gains will be worth the effort.