pysparkazure-databricksdelta-lakevacuumtbl-properties

Vacuuming Delta tables in Databricks does not work


I'm trying to vacuum my Delta tables in Databricks. However, somehow it is not working and I don't understand why. This is causing our storage constantly increasing.

I have set the following table properties:

%sql
ALTER TABLE <table_name> SET TBLPROPERTIES 
("delta.deletedFileRetentionDuration" = "interval 2 hours");

%sql
ALTER TABLE <table_name> SET TBLPROPERTIES 
("delta.logRetentionDuration" = "interval 2 hours");

Then I run the following vacuum command in a Databricks notebook:

%sql
VACUUM db_name.table_name retain 2 hours

or like this:

%sql
VACUUM db_name.table_name

The files that show up in the dbfs as candidate for removal are still there after running this command.

Example of the data in the delta_log json:

{"remove":{"path":"year=2021/month=05/day=06/part-00001-52dd3cf7-9afc-46b0-9a03-7be3d1ee533e.c000.snappy.parquet","deletionTimestamp":1622798688231,"dataChange":true}

I also added some data and deleted some data for testing purposes because I read that you need to alter the table before the vacuum can be executed successfully.

What am I missing here?


Solution

  • After some research I found out I was trying to remove log files with vacuum instead of old data files. However, I did found out why our storage usages in Databricks kept increasing. It seems that Databricks creates locked container in the Azure storage account and saves cluster metrics/logs there. This accounted for 2TB of the 3TB we were using in a specific storage account.