apache-sparkspark-launcher

Unexpected State Transitions with SparkAppHandle.Listener and SparkLauncher


I'm using SparkAppHandle.Listener to monitor the state of a submitted PySpark project using SparkLauncher. When the program fails, I expect the following state transitions:

Connected -\> Submitting -\> Running -\> Failed

However, the actual state transitions I observe are:

Connected -\> Submitting -\> Running -\> **Finished** -\> Failed

Additionally, when I submit a pure Python script, it immediately transitions to the Lost state.

Questions:

Is it expected to see a Finished state before a Failed state? Under what conditions could this happen? - Why does a pure Python script result in a Lost state immediately? What should I check in my script or cluster configuration to resolve this?

I have implemented a listener using SparkAppHandle.Listener to capture and print state changes of the Spark job. I also reviewed the Spark logs to understand the sequence of events leading to the state transitions.


Solution

  • Yes, it is expected to see FINISHED state before FAILED in some cases. For example, when running Spark on YARN in client mode, the job monitoring loop ends before the final application master (AM) state is known and hence a FININSHED state is reported first - lacking knowledge about the actual AM state, the monitor assumes the job has finished normally. Then, once the YARN job has finished, the YARN client in the launcher polls the final AM state from the job report and sends a new state update to the launch server (part of SparkLauncher which listens for various events from the Spark application, including state changes)

    A pure Python script doesn't talk properly to the launch server before exiting and hence the state becomes LOST.