In the past I was aware of the fact that that Actions in a Notebook or Spark App would run sequentially. Also, completely independent Actions. That is what I thought until someone stated on SO - cannot find that anymore - that independent Actions can run in parallel in the same Spark App. That is not the case. Albeit I have seen Stages running in parallel some times.
I just re-tested this on a same Databricks Cluster Notebook
as follows.
2 count
s from 2 independently created RDD's. From The Stages tab I can see the submits occur in short succession and there is thus no parallel processing; it runs sequentially.
2 delta saveAsTable
- they run sequentially. All logic is independent again.
So, why does Spark DAG processing not see - as it is lazily evaluated - that there are 2 independent Actions to run and allow for .par like processing internally?
Is there a parameter for this? No. chatGPT - dare I mention this - after much pushing, states that if there are enough resources it can make the decision to run parallel.
Seems a little odd that Futures
or .par
needed for completely independent Actions in same Spark App. Sure, Databricks WorkFlow Job Tasks can assist here, but.
From my readings in preparation for Spark certification due to idle time between assignments:
The DAG Scheduler is a single-threaded process, implying that it only caters for a single Job concurrently - not so concurrent.
You can of course use Threads and other techniques to get parallelism, or just submit multiple Spark Apps.