I'm trying to vacuum my Delta tables in Databricks. However, somehow it is not working and I don't understand why. This is causing our storage constantly increasing.
I have set the following table properties:
%sql
ALTER TABLE <table_name> SET TBLPROPERTIES
("delta.deletedFileRetentionDuration" = "interval 2 hours");
%sql
ALTER TABLE <table_name> SET TBLPROPERTIES
("delta.logRetentionDuration" = "interval 2 hours");
Then I run the following vacuum command in a Databricks notebook:
%sql
VACUUM db_name.table_name retain 2 hours
or like this:
%sql
VACUUM db_name.table_name
The files that show up in the dbfs as candidate for removal are still there after running this command.
Example of the data in the delta_log json:
{"remove":{"path":"year=2021/month=05/day=06/part-00001-52dd3cf7-9afc-46b0-9a03-7be3d1ee533e.c000.snappy.parquet","deletionTimestamp":1622798688231,"dataChange":true}
I also added some data and deleted some data for testing purposes because I read that you need to alter the table before the vacuum can be executed successfully.
What am I missing here?
After some research I found out I was trying to remove log files with vacuum instead of old data files. However, I did found out why our storage usages in Databricks kept increasing. It seems that Databricks creates locked container in the Azure storage account and saves cluster metrics/logs there. This accounted for 2TB of the 3TB we were using in a specific storage account.