apache-sparkdatabricks

Why Spark independent Actions in Spark App / Notebook do not run in parallel standardly


In the past I was aware of the fact that that Actions in a Notebook or Spark App would run sequentially. Also, completely independent Actions. That is what I thought until someone stated on SO - cannot find that anymore - that independent Actions can run in parallel in the same Spark App. That is not the case. Albeit I have seen Stages running in parallel some times.

I just re-tested this on a same Databricks Cluster Notebook as follows.

  1. 2 counts from 2 independently created RDD's. From The Stages tab I can see the submits occur in short succession and there is thus no parallel processing; it runs sequentially.

  2. 2 delta saveAsTable - they run sequentially. All logic is independent again.

So, why does Spark DAG processing not see - as it is lazily evaluated - that there are 2 independent Actions to run and allow for .par like processing internally?

Is there a parameter for this? No. chatGPT - dare I mention this - after much pushing, states that if there are enough resources it can make the decision to run parallel.

Seems a little odd that Futures or .par needed for completely independent Actions in same Spark App. Sure, Databricks WorkFlow Job Tasks can assist here, but.


Solution

  • From my readings in preparation for Spark certification due to idle time between assignments:

    The DAG Scheduler is a single-threaded process, implying that it only caters for a single Job concurrently - not so concurrent.

    You can of course use Threads and other techniques to get parallelism, or just submit multiple Spark Apps.