amazon-web-servicesamazon-redshiftaws-glueaws-glue-data-catalogamazon-redshift-spectrum

Partitioning by date on Glue: 1 date column vs 3 columns (year/month/day)?


I was wondering why in the Glue/Athena/Redshift Spectrum documentation and workshops, all the partitioning examples on dates use 3 columns (year/month/date), which translates to something like s3://../year=2022/month=08/day=30/[filename].parquet on Amazon S3.

Here are some examples:

I created a data pipeline for data on the Ethereum Blockchain, and I started partitioning by creating one partition_date column in the %Y-%m-%d format based on the block timestamp which gives this on S3: s3://.../partition_date=2022-08-30/[filename].parquet.

Is there any advantage to splitting the date column you want to partition on into 3 partition columns for year, month and day?


Solution

  • Yes there is an advantage. If you just want to filter on the year for example, Spark just needs to go into the path year=xyz/ and load all the files. With you partitioning scheme, Spark needs to list all the paths and then filter which fulfil the condition.

    But it also comes with disadvantages, mainly that you have more files if you partition more granularly. This can become a bottleneck, since the amount of partitions increase exponentially.