pythonapache-sparkdatabricks

Non Spark code (Plain Python) run on Spark cluster


As per the doc (https://docs.databricks.com/en/optimizations/spark-ui-guide/spark-job-gaps.html), Any execution of code that is not Spark will show up in the timeline as gaps.For example, you could have a loop in Python which calls native Python functions. This code is not executing in Spark and it can show up as a gap in the timeline.

Where does the non spark code (plain python code) runs? Is it on any worker or Driver?


Solution

  • Since the Driver node is the one responsible for orchestrating the application, Native Python code, since it isn't tied directly to a Spark operation will be run on a Driver node. Since this code is not running on parallel worker nodes that is the reason there will be gaps in the timeline.