I am trying to launch a Sparkling Water cloud within Spark using Databricks. I've attached the H2O library (3.16.0.2), PySparkling (pysparkling 0.4.6), and the Sparkling Water jar (sparkling-water-assembly_2.11-2.1.10-all.jar) to the cluster I'm running (Spark 2.1, Auto-updating Scala 1.1.1).
I succesfully import the required libraries below:
from pysparkling import *
import h2o
Yet when I try to initialize the Sparkling Water cloud using the following commands:
hc = H2OContext.getOrCreate(spark)
or
H2OContext.getOrCreate(sc)
I get the same error:
NameError: name 'H2OContext' is not defined
NameError Traceback (most recent call last)
<command-4043510449425708> in <module>()
----> 1 H2OContext.getOrCreate(sc)
NameError: name 'H2OContext' is not defined
For what it's worth I can initialize the Sparkling Water cloud using this Scala documentation:
%scala
import org.apache.spark.h2o._
val h2oConf = new H2OConf(sc).set("spark.ui.enabled", "false")
val h2oContext = H2OContext.getOrCreate(sc, h2oConf)
import org.apache.spark.h2o._
h2oConf: org.apache.spark.h2o.H2OConf =
Sparkling Water configuration:
backend cluster mode : internal
workers : None
cloudName : sparkling-water-root_app-20171222131625-0000
flatfile : true
clientBasePort : 54321
nodeBasePort : 54321
cloudTimeout : 60000
h2oNodeLog : INFO
h2oClientLog : WARN
nthreads : -1
drddMulFactor : 10
h2oContext: org.apache.spark.h2o.H2OContext =
Sparkling Water Context:
* H2O name: sparkling-water-root_app-20171222131625-0000
* cluster size: 1
* list of used nodes:
(executorId, host, port)
------------------------
(x,xx.xxx.xxx.x,54321)
------------------------
Open H2O Flow in browser: http://xx.xxx.xxx.xxx:54321 (CMD + click in Mac OSX)
but this pipeline may not always use Databricks so it needs to be all in PySpark and Databricks doesn't have a corresponding PySpark example.
Thanks in advance.
For pysparkling, you need to first create a PyPi library for h2o_pysparkling_2.1
since you are using a Spark 2.1 cluster. The library you attached, pysparkling
is something different. Also, you do not need to attach all those other libraries as the h2o_pysparkling_2.1
package will already import the other necessary libraries.
Once you do that you can run:
from pysparkling import *
h2oConf = H2OConf(spark)
h2oConf.set("spark.ui.enabled", False)
h2oContext = H2OContext.getOrCreate(spark, h2oConf)