Setup
Spark submit conf args (not all, just related to performance)
--num-executors 100
--executor-memory 64g
--conf spark.executor.memoryOverhead=54g
--executor-cores 15
--conf spark.default.parallelism=1500
Command I execute
OPTIMIZE delta.`<path>` WHERE partition_by_hour_column between <...> and <...>
Here is problem: when I check Spark UI, there is never more than 14-15 parallel jobs, despite the fact I have 100 nodes
Question: how to increase parallelism? Let's say it cannot do 1500 but I would like to have at least one job per executor.
There is configuration which allows more thread to be used for OPTIMIZE
buildConf("optimize.maxThreads")
.internal()
.doc(
"""
|Maximum number of parallel jobs allowed in OPTIMIZE command. Increasing the maximum
| parallel jobs allows the OPTIMIZE command to run faster, but increases the job
| management on the Spark driver side.
|""".stripMargin)
.intConf
.checkValue(_ > 0, "'optimize.maxThreads' must be positive.")
.createWithDefault(15)
So in my case it should be additional option passed to the Spark
--conf spark.databricks.delta.optimize.maxThreads=100