delta-lakechange-data-capture

Delta Lake change data feed - delete, vacuum, read - java.io.FileNotFoundException


I used the following to write to google cloud storage

df.write.format("delta").partitionBy("g","p").option("delta.enableChangeDataFeed", "true").mode("append").save(path)

And then I inserted data in versions 1,2,3,4. I deleted some of the data in version 5.

Ran

deltaTable.vacuum(8)

I tried to read starting Version 3

spark.read.format("delta")
  .option("readChangeFeed", "true")
  .option("startingVersion", 3)
  .load(path)

Caused by: java.io.FileNotFoundException: File not found: gs://xxx/yyy.snappy.parquet It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

I deleted the cluster and tried to read again. Same issue. Why is it looking for the vacuumed files?

I expected to see all the data inserted starting version 3


Solution

  • Adding the setting worked! spark.sql.files.ignoreMissingFiles ->true