I've got a 24/7 Spark Structured Streaming query (Kafka as a source) that appends data to a Delta Table.
Is it safe to periodically run VACUUM
and DELETE
against the same Delta Table from a different cluster while the first one is still processing incoming data ?
The table is partitioned on date and the DELETE will be done at partition level.
p.s. the infrastructure is on top of AWS.
If your streaming job is really append-only, then it shouldn't have any conflicts:
DELETE
on the partition level can't conflict in WriteSerializable isolation level (default) if the write happens without reading (i.e. append-only workload)VACUUM
simply removes files that aren't referenced in the latest version so it won't conflict with appends.