I am running Sparkling water
over 36 Spark executors.
Due to Yarn's scheduling, some executors would preempt and comeback later.
Overall, there are 36 executors for the majority of time, just not always.
So far, my experience is that, as soon as 1 executor fails, the entire H2o
instance halts, even if the missing executor comes back to life later.
I wonder if this is how Sparkling-water
behaves? Or some preemptive capability needs to be turned on?
Anyone have a clue about this ?
[Summary]
What you are seeing is how Sparkling Water behaves.
[ Details... ]
Sparkling Water on YARN can run in two different ways:
the default way, where H2O nodes are embedded inside Spark executors and there is a single (Spark) YARN job,
the external H2O cluster way, where the Spark cluster and H2O cluster are separate YARN jobs (running in this mode requires more setup; if you were running in this way, you would know it)
H2O nodes do not support elastic cloud formation behavior. Which is to say, once an H2O cluster is formed, new nodes may not join the cluster (they are rejected) and existing nodes may not leave the cluster (the cluster becomes unusable).
As a result, YARN preemption must be disabled for the queue where H2O nodes are running. In the default way, it means the entire Spark job must run with YARN preemption disabled (and Spark dynamicAllocation disabled). For the external H2O cluster way, it means the H2O cluster must be run in a YARN queue with preemption disabled.
Other pieces of information that might help:
If you are just starting on a new problem with Sparkling Water (or H2O in general), prefer a small number of large memory nodes to a large number of small memory nodes; fewer things can go wrong that way,
To be more specific, if you are trying to run with 36 executors that each have 1 GB of executor memory, that's a really awful configuration; start with 4 executors x 10 GB instead,
In general you don't want to start Sparkling Water with executors less than 5 GB at all, and more memory is better,
If running in the default way, don't set the number of executor cores to be too small; machine learning is hungry for lots of CPU.