scalaapache-sparkdelta-lakedelta

deltaTable.generate("symlink_format_manifest") is not incremental


I am using Spark 2.4.5, Scala 2.11

I have a delta table set up on S3. In every run of my application, a new partition of the data is generated and appended.

df
.write
.format("delta")
.mode("append")
.save(deltaPath)

Once partition is appended, it also does:

val deltaTable = DeltaTable.forPath(deltaPath)
deltaTable.generate("symlink_format_manifest")

This symlink_format_manifest takes around 20 minutes while the total job time is 28 minutes. I checked the generated files under _symlink_format_manifest/ and it seems that all of the older partitions get updated everytime. Confirmed this by checking last modified of the manifest files of older partitions.

What do I need to change such that generate("symlink_format_manifest") only is used to register a new partition and not reupdate all the previous ones everytime?


Solution

  • It seems to be happening because the _symlink_format_manifest folder contains partitions from last 4 years. However, my delta table has vacuum running and hence only maintains last 30 days partitions.

    The issue appears to be that vacuum does not clean up partitions from _symlink_format_manifest folder. Found the same issue reported on github as well: https://github.com/delta-io/delta/issues/443

    The suggested fix is to update the delta version.

    Manual work around:

    By deleting the folder _symlink_format_manifest manually, the next run of the job recreates the folder with only the required partitions and does not take 20 minutes extra.