apache-sparkdatabricksparquetdelta-lake

Would Zordering a Delta Table affect performance if the table was later converted to a Parquet Table?


I am the owner of a Delta Table that some consumers would like a copy of as a Parquet Table. For various reasons, some people within my company will not use Delta. I have Zordered this Delta Table to improve read performance. If I make a copy of this table and convert it to Parquet (i.e. remove the delta log and vacuum logically removed files) will the converted Parquet Table still benefit from the original Zordering of the Delta Table? I have heard of 'row group filtering' in Parquet that I think would still benefit from the data clustering. But I don't know enough about how row group filtering works to confirm this.

Please ignore any side effect of compaction of files when performing Optimize Zorder. I know the Parquet Table will still benefit from the compaction of files but I am not sure about the ordering specifically.


Solution

  • You will likely still get some benefit. Parquet keeps track of min/max statistics for each column both at a row group level and at the page (a column in parquet is divided into multiple pages). The columns that where in the zorder will typically have smaller ranges for min/max so assuming the predicates in queries use the columns it is more likely they can prune row groups (and pages) from consideration due to the tighter bound. There is some potential performance loss because there is more advanced filtering can be applied if data is known to be z-ordered (I'm not sure if delta lake actually does this in practice), since parquet has no knowledge of this. Also rowgroup/page filtering has higher overhead because each parquet file needs to be opened and it's metadata parsed before it can be applied.