databricksspark-structured-streamingdelta-lake

Is it safe to run VACUUM and DELETE against a Delta Table while there's a Spark Streaming query doing data ingestion


I've got a 24/7 Spark Structured Streaming query (Kafka as a source) that appends data to a Delta Table.

Is it safe to periodically run VACUUM and DELETE against the same Delta Table from a different cluster while the first one is still processing incoming data ?

The table is partitioned on date and the DELETE will be done at partition level.

p.s. the infrastructure is on top of AWS.


Solution

  • If your streaming job is really append-only, then it shouldn't have any conflicts: