databricksdelta-lake

Databricks OPTIMIZE does nothing


I have a partitioned delta table. I noticed that some partitions contains huge amount of tiny data files.

enter image description here

Here a comparison between two partitions with very similar data.

enter image description here

I now did run:

Running OPTIMIZE:
OPTIMIZE delta.`dbfs:/mnt/backup/bronze/DeviceData`
WHERE Partition_Date = '2023-01-19'
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|path                              |metrics                                                                                                                                                     |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|dbfs:/mnt/backup/bronze/DeviceData|{0, 0, {NULL, NULL, 0.0, 0, 0}, {NULL, NULL, 0.0, 0, 0}, 1, NULL, 0, 63, 63, true, 0, 0, 1705194826563, 1705194828732, 8, 0, NULL, {0, 0}, 7, 7, 0, 0, NULL}|
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+

and after it:

Running VACUUM:
VACUUM delta.`dbfs:/mnt/backup/bronze/DeviceData` RETAIN 0 HOURS
Deleted 0 files and directories in a total of 653 directories.
+----------------------------------+
|path                              |
+----------------------------------+
|dbfs:/mnt/backup/bronze/DeviceData|
+----------------------------------+

As you can see already in the logs, it does absolute nothing related to bin-packing. I have the same amount of tiny files and no optimization.

I'm using the last databricks runtime: 14.2 (includes Apache Spark 3.5.0, Scala 2.12).

What I'm doing wrong?

--- UPDATE 1 ------------------------

@rainingdistros

No, when I order the files of this partition by timestamp, there was nothing written since the data was created in the first place. In other partitions the optimization worked (see second screenshot).

enter image description here

--- UPDATE 2 ------------------------

Base on some old databricks community post I tried now to rerunn the command with spark.databricks.delta.optimize.preserveInsertionOrder deactivated and also printing out the result.

{
    "numFilesAdded": 70,
    "numFilesRemoved": 4381,
    "filesAdded": {
        "min": 16899041,
        "max": 613054015,
        "avg": 303944954.8857143,
        "totalFiles": 70,
        "totalSize": 21276146842
    },
    "filesRemoved": {
        "min": 159655,
        "max": 9243024,
        "avg": 4173938.1842045197,
        "totalFiles": 4381,
        "totalSize": 18286023185
    },
    "partitionsOptimized": 1,
    "zOrderStats": {
        "strategyName": "minCubeSize(107374182400)",
        "inputCubeFiles": {
            "num": 0,
            "size": 0
        },
        "inputOtherFiles": {
            "num": 4381,
            "size": 18286023185
        },
        "inputNumCubes": 0,
        "mergedFiles": {
            "num": 4381,
            "size": 18286023185
        },
        "numOutputCubes": 1,
        "mergedNumCubes": null
    },
    "numBatches": 1,
    "totalConsideredFiles": 4381,
    "totalFilesSkipped": 0,
    "preserveInsertionOrder": false,
    "numFilesSkippedToReduceWriteAmplification": 0,
    "numBytesSkippedToReduceWriteAmplification": 0,
    "startTimeMs": 1705705885113,
    "endTimeMs": 1705710361132,
    "totalClusterParallelism": 4,
    "totalScheduledTasks": 1,
    "autoCompactParallelismStats": null,
    "deletionVectorStats": {
        "numDeletionVectorsRemoved": 0,
        "numDeletionVectorRowsRemoved": 0
    },
    "numTableColumns": 7,
    "numTableColumnsWithStats": 7,
    "totalTaskExecutionTimeMs": 8150342
}

Solution

  • This option solved the problem:

    spark.conf.set("spark.databricks.delta.optimize.preserveInsertionOrder", false)
    

    The Spark configuration option spark.databricks.delta.optimize.preserveInsertionOrder is specific to Delta Lake on Databricks. This option controls how the OPTIMIZE command orders the data in a Delta table when it compacts small files into larger ones.

    Here's a brief explanation of what this option does: