apache-sparkoptimizationdatabricksdelta-lakedata-lake

Does auto compaction break z-ordering?


Is auto compaction will break existing z-ordered tables in delta lake? I'm curious what is the recommended way to use optimized writing, auto compaction, and z-ordering in terms of performance of Spark.


Solution

  • Good question. I am doing the Certification study for Databricks as it so happens.

    They seem at odds. Short answer to your question is YES.

    Why?

    ZORDER as in %sql OPTIMIZE delta./mnt/delta/t1 ZORDER BY (c1, c2) is essentially clustering to speed up filter, where operations.

    Auto Compaction, when enabled, is about making fewer, larger files from smaller files to improve performance - small files problems. That will be done without being cognisant of ZORDER. Thus breaking the ZORDER aspects achieved. NB: I read also that it is most applicable for Structured Streaming.