I'm running a code in Azure Databricks, it's a Python code to read from a Delta Lake and write to a SQL Server database. I can use 4 workers (each one 4 cores, 32GB), and I put a repartition of 16 to create 16 tasks.
df = df.repartition(16)
While doing it, I see 16 tasks running on 4 workers, each worker with 4 tasks. The process is mainly an I/O process, which is why the use of the CPU in every worker is low (around 20%).
The default configuration in Spark is:
I tried maximising the tasks in the worker by putting in the Spark cluster config:
spark.task.cpus 0.25
But it's not possible, because the value must be an int, so 1 is the lowest value.
So, what is the way to run 16 tasks in parallel in a worker with 4 cores? I trust there should be a way, incrementing more than 1 task per core, or launching more than 1 executor in a worker (each one with 4 tasks). Doing it, I would parallelise 64 tasks in my 4 workers.
You can configure env var SPARK_WORKER_CORES
, which makes spark assume you have this number of cores regardless of what you actually have.