h2osparkling-water

Are the nodes in H20 Sparkling preemptible?


I am running Sparkling waterover 36 Spark executors. Due to Yarn's scheduling, some executors would preempt and comeback later. Overall, there are 36 executors for the majority of time, just not always.

So far, my experience is that, as soon as 1 executor fails, the entire H2o instance halts, even if the missing executor comes back to life later. I wonder if this is how Sparkling-waterbehaves? Or some preemptive capability needs to be turned on?

Anyone have a clue about this ?


Solution

  • [Summary]

    What you are seeing is how Sparkling Water behaves.


    [ Details... ]

    Sparkling Water on YARN can run in two different ways:

    H2O nodes do not support elastic cloud formation behavior. Which is to say, once an H2O cluster is formed, new nodes may not join the cluster (they are rejected) and existing nodes may not leave the cluster (the cluster becomes unusable).

    As a result, YARN preemption must be disabled for the queue where H2O nodes are running. In the default way, it means the entire Spark job must run with YARN preemption disabled (and Spark dynamicAllocation disabled). For the external H2O cluster way, it means the H2O cluster must be run in a YARN queue with preemption disabled.

    Other pieces of information that might help: