I am using Spark 2.4.5, Scala 2.11
I have a delta table set up on S3. In every run of my application, a new partition of the data is generated and appended.
df
.write
.format("delta")
.mode("append")
.save(deltaPath)
Once partition is appended, it also does:
val deltaTable = DeltaTable.forPath(deltaPath)
deltaTable.generate("symlink_format_manifest")
This symlink_format_manifest
takes around 20 minutes while the total job time is 28 minutes. I checked the generated files under _symlink_format_manifest/
and it seems that all of the older partitions get updated everytime. Confirmed this by checking last modified
of the manifest files of older partitions.
What do I need to change such that generate("symlink_format_manifest")
only is used to register a new partition and not reupdate all the previous ones everytime?
It seems to be happening because the _symlink_format_manifest
folder contains partitions from last 4 years. However, my delta table has vacuum
running and hence only maintains last 30 days partitions.
The issue appears to be that vacuum
does not clean up partitions from _symlink_format_manifest
folder. Found the same issue reported on github as well: https://github.com/delta-io/delta/issues/443
The suggested fix is to update the delta
version.
Manual work around:
By deleting the folder _symlink_format_manifest
manually, the next run of the job recreates the folder with only the required partitions and does not take 20 minutes extra.