I'm running Tweedie GLM using sparkling water for different sized data ie 20 MB, 400 MB, 2GB,25 GB. Code works fine for Sampling iteration 10. But I have to test for large sampling scenario..
Sampling iteration is 500
In this case code working well for 20 and 400 mb data.But It starts throwing issue when data is larger than 2 GB
After doing search I found one solution disabling change listener but that did not worked for large data.
--conf "spark.scheduler.minRegisteredResourcesRatio=1" "spark.ext.h2o.topology.change.listener.enabled=false"
Here is my spark submit configuration
spark-submit \
--packages ai.h2o:sparkling-water-core_2.10:1.6.1, log4j:log4j:1.2.17\
--driver-memory 8g \
--executor-memory 10g \
--num-executors 10\
--executor-cores 5 \
--class TweedieGLM target/SparklingWaterGLM.jar \
$1\
$2\
--conf "spark.scheduler.minRegisteredResourcesRatio=1" "spark.ext.h2o.topology.change.listener.enabled=false"
This is what I got as an error
16/07/08 20:39:55 ERROR YarnScheduler: Lost executor 2 on cfclbv0152.us2.oraclecloud.com: Executor heartbeat timed out after 175455 ms
16/07/08 20:40:00 ERROR YarnScheduler: Lost executor 2 on cfclbv0152.us2.oraclecloud.com: remote Rpc client disassociated
16/07/08 20:40:00 ERROR LiveListenerBus: Listener anon1 threw an exception
java.lang.IllegalArgumentException: Executor without H2O instance discovered, killing the cloud!
at org.apache.spark.h2o.H2OContext$$anon$1.onExecutorAdded(H2OContext.scala:203)
at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:58)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:56)
at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:79)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
After reading carefully the issue posted on github https://github.com/h2oai/sparkling-water/issues/32. I tried couple of options here is what I tried
Added
--conf "spark.scheduler.minRegisteredResourcesRatio=1" "spark.ext.h2o.topology.change.listener.enabled=false" "spark.locality.wait=3000" "spark.ext.h2o.network.mask=10.196.64.0/24"
Changed the : Executors from 10 to 3,6 9 executor-memory from 4 to 12 and 12 to 24gb driver-memory from 4 to 12 and 12 to 24gb
This is what I learned: GLM is memory intensive job so we have to provide sufficient memory to execute the job.