apache-sparkpysparkdatabricksazure-databricks

How to run more tasks than cores in a worker?


I'm running a code in Azure Databricks, it's a Python code to read from a Delta Lake and write to a SQL Server database. I can use 4 workers (each one 4 cores, 32GB), and I put a repartition of 16 to create 16 tasks.

df = df.repartition(16)

While doing it, I see 16 tasks running on 4 workers, each worker with 4 tasks. The process is mainly an I/O process, which is why the use of the CPU in every worker is low (around 20%).

The default configuration in Spark is:

I tried maximising the tasks in the worker by putting in the Spark cluster config:

spark.task.cpus 0.25

But it's not possible, because the value must be an int, so 1 is the lowest value.

So, what is the way to run 16 tasks in parallel in a worker with 4 cores? I trust there should be a way, incrementing more than 1 task per core, or launching more than 1 executor in a worker (each one with 4 tasks). Doing it, I would parallelise 64 tasks in my 4 workers.


Solution

  • You can configure env var SPARK_WORKER_CORES, which makes spark assume you have this number of cores regardless of what you actually have.