I have a partitioned delta table. I noticed that some partitions contains huge amount of tiny data files.
Here a comparison between two partitions with very similar data.
I now did run:
Running OPTIMIZE:
OPTIMIZE delta.`dbfs:/mnt/backup/bronze/DeviceData`
WHERE Partition_Date = '2023-01-19'
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|path |metrics |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|dbfs:/mnt/backup/bronze/DeviceData|{0, 0, {NULL, NULL, 0.0, 0, 0}, {NULL, NULL, 0.0, 0, 0}, 1, NULL, 0, 63, 63, true, 0, 0, 1705194826563, 1705194828732, 8, 0, NULL, {0, 0}, 7, 7, 0, 0, NULL}|
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
and after it:
Running VACUUM:
VACUUM delta.`dbfs:/mnt/backup/bronze/DeviceData` RETAIN 0 HOURS
Deleted 0 files and directories in a total of 653 directories.
+----------------------------------+
|path |
+----------------------------------+
|dbfs:/mnt/backup/bronze/DeviceData|
+----------------------------------+
As you can see already in the logs, it does absolute nothing related to bin-packing. I have the same amount of tiny files and no optimization.
I'm using the last databricks runtime: 14.2 (includes Apache Spark 3.5.0, Scala 2.12).
What I'm doing wrong?
--- UPDATE 1 ------------------------
@rainingdistros
No, when I order the files of this partition by timestamp, there was nothing written since the data was created in the first place. In other partitions the optimization worked (see second screenshot).
--- UPDATE 2 ------------------------
Base on some old databricks community post I tried now to rerunn the command with spark.databricks.delta.optimize.preserveInsertionOrder
deactivated and also printing out the result.
{
"numFilesAdded": 70,
"numFilesRemoved": 4381,
"filesAdded": {
"min": 16899041,
"max": 613054015,
"avg": 303944954.8857143,
"totalFiles": 70,
"totalSize": 21276146842
},
"filesRemoved": {
"min": 159655,
"max": 9243024,
"avg": 4173938.1842045197,
"totalFiles": 4381,
"totalSize": 18286023185
},
"partitionsOptimized": 1,
"zOrderStats": {
"strategyName": "minCubeSize(107374182400)",
"inputCubeFiles": {
"num": 0,
"size": 0
},
"inputOtherFiles": {
"num": 4381,
"size": 18286023185
},
"inputNumCubes": 0,
"mergedFiles": {
"num": 4381,
"size": 18286023185
},
"numOutputCubes": 1,
"mergedNumCubes": null
},
"numBatches": 1,
"totalConsideredFiles": 4381,
"totalFilesSkipped": 0,
"preserveInsertionOrder": false,
"numFilesSkippedToReduceWriteAmplification": 0,
"numBytesSkippedToReduceWriteAmplification": 0,
"startTimeMs": 1705705885113,
"endTimeMs": 1705710361132,
"totalClusterParallelism": 4,
"totalScheduledTasks": 1,
"autoCompactParallelismStats": null,
"deletionVectorStats": {
"numDeletionVectorsRemoved": 0,
"numDeletionVectorRowsRemoved": 0
},
"numTableColumns": 7,
"numTableColumnsWithStats": 7,
"totalTaskExecutionTimeMs": 8150342
}
This option solved the problem:
spark.conf.set("spark.databricks.delta.optimize.preserveInsertionOrder", false)
The Spark configuration option spark.databricks.delta.optimize.preserveInsertionOrder
is specific to Delta Lake on Databricks. This option controls how the OPTIMIZE
command orders the data in a Delta table when it compacts small files into larger ones.
Here's a brief explanation of what this option does:
Preserving Insertion Order (true
):
true
, the OPTIMIZE
command preserves the original insertion order of the data in the table. This means that during file compaction, data is organized in the same order as it was originally written.Not Preserving Insertion Order (false
):
false
, the OPTIMIZE
command does not preserve the insertion order. Instead, it might reorder data based on other considerations, such as optimizing for query performance. For example, it may sort data based on certain columns if ZORDER BY
is used in the OPTIMIZE
command.OPTIMIZE
command can organize data in a way that is more optimal for read access patterns.