apache-spark

Do stages in an application run parallel in spark?


I have a doubt that, how do stages execute in a spark application. Is there any consistency in execution of stages that can be defined by programmer or will it derived by spark engine?


Solution

  • Check the entities(stages, partitions) in this pic:

    enter image description here

    pic credits

    Does stages in a job(spark application ?) run parallel in spark?

    Yes, they can be executed in parallel if there is no sequential dependency.

    Here Stage 1 and Stage 2 partitions can be executed in parallel but not Stage 0 partitions, because of dependency partitions in Stage 1 & 2 has to be processed.

    Is there any consistency in execution of stages that can be defined by programmer or will it derived by spark engine?

    Stage boundary is defined by when data shuffling happens among partitions. (check pink lines in pic)