I am using spark 2.2.0 with python. I tried to figure out what is the default param of Link function Spark accepts in the GeneralizedLineraModel
in case of Tweedie family.
When I look to documentation https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.regression.GeneralizedLinearRegression
class pyspark.ml.regression.GeneralizedLinearRegression(self, labelCol="label", featuresCol="features", predictionCol="prediction", family="gaussian", link=None, fitIntercept=True, maxIter=25, tol=1e-6, regParam=0.0, weightCol=None, solver="irls", linkPredictionCol=None
It seems that default value when family='tweedie' should be None but when I tried this (by using similar test as unit test : https://github.com/apache/spark/pull/17146/files/fe1d3ae36314e385990f024bca94ab1e416476f2) :
from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([(1.0, Vectors.dense(0.0, 0.0)),\
(1.0, Vectors.dense(1.0, 2.0)),\
(2.0, Vectors.dense(0.0, 0.0)),\
(2.0, Vectors.dense(1.0, 1.0)),], ["label", "features"])
glr = GeneralizedLinearRegression(family="tweedie",variancePower=1.42,link=None)
model = glr.fit(df)
transformed = model.transform(df)
it raised a Null pointer Java exception
...
Py4JJavaError: An error occurred while calling o6739.w. : java.lang.NullPointerException ...
It works well when I remove explicite link=None in the initilization of the model.
from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([(1.0, Vectors.dense(0.0, 0.0)),\
(1.0, Vectors.dense(1.0, 2.0)),\
(2.0, Vectors.dense(0.0, 0.0)),\
(2.0, Vectors.dense(1.0, 1.0)),], ["label", "features"])
glr = GeneralizedLinearRegression(family="tweedie",variancePower=1.42)
model = glr.fit(df)
transformed = model.transform(df)
I would like to be able to pass a standard set of params like
params={"family":"Onefamily","link":"OnelinkAccordingToFamily",..}
and then initialize GLM as:
glr = GeneralizedLinearRegression(family=params["family"],link=params['link]' ....)
So it could be more standard and works in any case of family and link. Just seems that the link value is not ignored in the case when family=Tweedie any idea of what default value I should use? I tried link='' or link='None' but it raises 'invalid link function'.
To deal with GLR tweedie
family you'll need to define the power link function specified through the "linkPower" parameter, and you shouldn't set link
to None
which was leading to that exception you got.
Here is an example on how to use it :
df = spark.createDataFrame(
[(1.0, Vectors.dense(0.0, 0.0)),
(1.0, Vectors.dense(1.0, 2.0)),
(2.0, Vectors.dense(0.0, 0.0)),
(2.0, Vectors.dense(1.0, 1.0)), ], ["label", "features"])
# in this case the default link power applies
glr = GeneralizedLinearRegression(family="tweedie", variancePower=1.6)
model = glr.fit(df) # in this case the default link power applies
model2 = glr.setLinkPower(-1.0).fit(df)
PS : The default link power in the tweedie family is 1 - variancePower
.