I’ve been creating an ETL/ELT Pipeline with Azure Databricks Workflows, Spark and Azure Data Lake. It should process in Near Real Time changes (A Change Data Capture process) from an Azure SQL Server database.
I have a bunch of big tables with the size of 30 – 50 GB on the SQL Server and the biggest of them have the size of 190 – 220 GB, and a lot of small tables. Compressed in parquet delta format they get to 3 - 5 GB and the biggest ones 45 - 55 GB on the Data Lake, but they are expected to grow. I am storing the data in delta tables.
Because of easier management of the data (Life Cycle Management, Debugging of the process etc.) I think that it will be better to store the data of the big tables in partitioned delta tables. I ran some tests and observed that in some cases (when the data written is big - millions of records) the writing of data into the partitioned tables takes less time, but in most of the cases (thousands of records) it takes the same time as in non - partitioned tables.
When one creates the table for the first time, it takes longer for the partitioned table to be created, which makes sense, since the metadata of the partitioned tables is more complicated.
So far so good, but at the same time I have the statement in the documentation of Databricks, saying that one should partition only tables that are bigger than 1 TB. https://docs.databricks.com/en/tables/partitions.html
https://learn.microsoft.com/en-us/azure/databricks/tables/partitions
Do you think that I should follow this rule and create the tables as non – partitioned? And what stays behind this rule? Why should I partition only tables that are bigger than 1 TB? Many thanks for your answer in advance!
I am partitioning based on year and month of ingestion of the data. The data will not be queried based on those columns, but I have no requirement for column that the data will be queried on.
Many thanks for your answer in advance!
The tables less than 1 Tb are optimized automatically with the built-in features, it is mentioned in the article your shared.
In built features like ingestion time clustering automatically clusters data in unpartitioned tables by ingestion time, this is the reason you are getting same time while writing the data.
You were saying the data is expected to grow, in such case you can use liquid clustering which handles partitions automatically not too much and not too few, and tables which grows.