I'm trying to run a grid search for Gradient Boosting Machine in pyspark with H2O Sparkling Water.
Produced a reproducible example with the famous iris dataset.
from pysparkling import H2OContext, H2OConf
import pyspark
from pyspark.sql.types import StructType, StructField, FloatType, StringType
from pyspark.conf import SparkConf
from pyspark.sql import SQLContext
conf = SparkConf()
conf.setMaster("local").setAppName("test")
conf.set("spark.sql.shuffle.partitions", 3)
conf.set("spark.default.parallelism", 3)
conf.set("spark.debug.maxToStringFields", 100)
sc = pyspark.SparkContext(conf=conf)
sqlContext = SQLContext(sc)
hc = H2OContext.getOrCreate(sc, H2OConf(sc).set_internal_cluster_mode())
schema = StructType([
StructField("sepal_length", FloatType(), True),
StructField("sepal_width", FloatType(), True),
StructField("petal_length", FloatType(), True),
StructField("petal_width", FloatType(), True),
StructField("class", StringType(), True)])
iris_df = sqlContext.read \
.format('com.databricks.spark.csv') \
.option('header', 'false') \
.option('delimiter', ',') \
.schema(schema) \
.load('../../../../Downloads/iris.data')
If I try to follow this page of H2O docs and just translate to python
gbm_params = {'learnRate': [0.01, 0.1],
'ntrees': [100 , 200, 300, 500]}
gbm_grid = H2OGridSearch()\
.setLabelCol("class") \
.setHyperParameters(gbm_params)\
.setAlgo(H2OGBM().setMaxDepth(30))
model_pipeline = Pipeline().setStages([gbm_grid])
model = model_pipeline.fit(iris_df)
I get an internal NullPointerException, I guess there's something missing in the configuration.
Py4JJavaError: An error occurred while calling o111.fit.
: java.lang.NullPointerException
at ai.h2o.sparkling.ml.algos.H2OGridSearch.extractH2OParameters(H2OGridSearch.scala:352)
at ai.h2o.sparkling.ml.algos.H2OGridSearch.fit(H2OGridSearch.scala:64)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Unknown Source)
If I try to rewrite it in a different way, I get a different error,
gbm_grid = H2OGridSearch(algo=H2OGBM().setMaxDepth(30),
hyperParameters={'learnRate': [0.01, 0.1]},
withDetailedPredictionCol=True,
labelCol='class',
stoppingMetric="AUC")
model_pipeline = Pipeline().setStages([gbm_grid])
model = model_pipeline.fit(iris_df)
This is the output, no matter how do I change the hyperparameters,
Py4JJavaError: An error occurred while calling o1817.fit.
: java.lang.NoSuchFieldException: learnRate
at java.lang.Class.getField(Unknown Source)
at ai.h2o.sparkling.ml.algos.H2OGridSearch.findField(H2OGridSearch.scala:170)
at ai.h2o.sparkling.ml.algos.H2OGridSearch.processHyperParams(H2OGridSearch.scala:154)
at ai.h2o.sparkling.ml.algos.H2OGridSearch.fit(H2OGridSearch.scala:71)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Unknown Source)
The following works, however it is not useful since there is no grid,
gbm_grid = H2OGridSearch(algo=H2OGBM().setMaxDepth(30),
#hyperParameters=gbm_params,
withDetailedPredictionCol=True,
labelCol='class',
stoppingMetric="AUC")
model_pipeline = Pipeline().setStages([gbm_grid])
model = model_pipeline.fit(iris_df)
model.stages[0].transform(iris_df).head()
And finally, just to be sure that learnRate
is a parameter of H2OGBM, this also works,
gbm_model = H2OGBM(labelCol='class',
withDetailedPredictionCol=True).setLearnRate(0.01).setMaxDepth(5).setNtrees(100)
model_pipeline = Pipeline().setStages([gbm_model])
model = model_pipeline.fit(iris_df)
model.stages[0].transform(iris_df).head()
EDIT: missing imports
from pyspark.ml.pipeline import Pipeline
from ai.h2o.sparkling.ml.algos import H2OGridSearch
from ai.h2o.sparkling.ml.algos import H2OGBM
and sparking water version
h2o-pysparkling-2-4 3.28.0.1-1 pypi_0 pypi
EDIT after comments for Spark/H2O/Java versions
Spark: 2.4.4
H2O: 3.28.0.3
Java: 1.8.0_232
EDIT java -version
openjdk version "1.8.0_242"
OpenJDK Runtime Environment (build 1.8.0_242-8u242-b08-0ubuntu3~16.04-b08)
OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)
Get the same error if I use learn_rate
instead of learnRate
.
gbm_grid = H2OGridSearch(algo=H2OGBM().setMaxDepth(30),
hyperParameters={'learn_rate': [0.01, 0.1]},
withDetailedPredictionCol=True,
labelCol='class',
stoppingMetric="AUC")
model_pipeline = Pipeline().setStages([gbm_grid])
model = model_pipeline.fit(iris_df)
...
Py4JJavaError: An error occurred while calling o1376.fit.
: java.lang.NoSuchFieldException: learn_rate
at java.lang.Class.getField(Class.java:1703)
at ai.h2o.sparkling.ml.algos.H2OGridSearch.findField(H2OGridSearch.scala:170)
at ai.h2o.sparkling.ml.algos.H2OGridSearch.processHyperParams(H2OGridSearch.scala:154)
at ai.h2o.sparkling.ml.algos.H2OGridSearch.fit(H2OGridSearch.scala:71)
at ai.h2o.sparkling.ml.algos.H2OGridSearch.fit(H2OGridSearch.scala:52)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
There's a workaround here I did not notice (probably I should have posted it as a bug in github in the first place).
gbm_grid = H2OGridSearch(algo=H2OGBM().setMaxDepth(30),
hyperParameters={'_learn_rate':[0.01, 0.1], '_ntrees': [100, 200]},
withDetailedPredictionCol=True,
labelCol='class',
stoppingMetric="AUC")
model_pipeline = Pipeline().setStages([gbm_grid])
model = model_pipeline.fit(iris_df)
model.stages[0].transform(iris_df).head()