apache-sparkdatabricksazure-databricksdelta-lakehadoop-partitioning

Is it relevant to partition by Business Date and Ingest Date for a FACT table on Delta Lake?


I am working on a data engineering case where i have a table Table_Movie partitionned by ingest date. Now, from time to time, i receive some old data. And I need to perform operations based on business date.

For example : Today, I received new data corresponding to the date 12/05/2020:

  1. I need to delete in Table_Movie the data corresponding to the business date 12/05/2020
  2. Once the old data deleted, i need to insert the new data corresponding to 12/05/2020.

Would it improve performance to partition by ingest date and business date ? :

==> eg. Table_Movie/ingestdate=20220812/business_date=20200512/

Note : we need to keep the partitionning with ingest date for other reasons.

Please let me know if you have any suggestion or guidance


Solution

  • If you only occasionally receive the old data, then with partitioning by business date you may get very small files, and this will affect performance. Delta on Databricks uses multiple optimization techniques to avoid reading not necessary data. One of this techniques is Data Skipping - when writing data Delta will record the min & max values for indexed columns, so when reading with condition on indexed column, driver will check value against this index, and skip files that doesn't have the data. Data Skipping is enabled by default for first N columns (32 by default, and configurable), so you need to make sure that business date column is at the beginning of the columns list.