hadoopapache-sparkhadoop-yarnfailoverapache-spark-standalone

Understand Spark: Cluster Manager, Master and Driver nodes


Having read this question, I would like to ask additional questions:

  1. The Cluster Manager is a long-running service, on which node it is running?
  2. Is it possible that the Master and the Driver nodes will be the same machine? I presume that there should be a rule somewhere stating that these two nodes should be different?
  3. In case where the Driver node fails, who is responsible of re-launching the application? and what will happen exactly? i.e. how the Master node, Cluster Manager and Workers nodes will get involved (if they do), and in which order?
  4. Similarly to the previous question: In case where the Master node fails, what will happen exactly and who is responsible of recovering from the failure?

Solution

  • 1. The Cluster Manager is a long-running service, on which node it is running?

    Cluster Manager is Master process in Spark standalone mode. It can be started anywhere by doing ./sbin/start-master.sh, in YARN it would be Resource Manager.

    2. Is it possible that the Master and the Driver nodes will be the same machine? I presume that there should be a rule somewhere stating that these two nodes should be different?

    Master is per cluster, and Driver is per application. For standalone/yarn clusters, Spark currently supports two deploy modes.

    1. In client mode, the driver is launched in the same process as the client that submits the application.
    2. In cluster mode, however, for standalone, the driver is launched from one of the Worker & for yarn, it is launched inside application master node and the client process exits as soon as it fulfils its responsibility of submitting the application without waiting for the app to finish.

    If an application submitted with --deploy-mode client in Master node, both Master and Driver will be on the same node. check deployment of Spark application over YARN

    3. In the case where the Driver node fails, who is responsible for re-launching the application? And what will happen exactly? i.e. how the Master node, Cluster Manager and Workers nodes will get involved (if they do), and in which order?

    If the driver fails, all executors tasks will be killed for that submitted/triggered spark application.

    4. In the case where the Master node fails, what will happen exactly and who is responsible for recovering from the failure?

    Master node failures are handled in two ways.

    1. Standby Masters with ZooKeeper:

      Utilizing ZooKeeper to provide leader election and some state storage, you can launch multiple Masters in your cluster connected to the same ZooKeeper instance. One will be elected “leader” and the others will remain in standby mode. If the current leader dies, another Master will be elected, recover the old Master’s state, and then resume scheduling. The entire recovery process (from the time the first leader goes down) should take between 1 and 2 minutes. Note that this delay only affects scheduling new applications – applications that were already running during Master failover are unaffected. check here for configurations

    2. Single-Node Recovery with Local File System:

      ZooKeeper is the best way to go for production-level high availability, but if you want to be able to restart the Master if it goes down, FILESYSTEM mode can take care of it. When applications and Workers register, they have enough state written to the provided directory so that they can be recovered upon a restart of the Master process. check here for conf and more details