scalaapache-sparkapache-spark-sqlapache-spark-mllibapache-spark-encoders

Why does creating a Dataset with LinearRegressionModel fail with "No Encoder found for org.apache.spark.ml.regression.LinearRegressionModel"?


I get a DataFrame contians Tuple(String, org.apache.spark.ml.regression.LinearRegressionModel):

val result = rows.map(row => {
  val userid = row.getString(0)
  val frame = filterByUserId(userid ,dataFrame)
  (userid, lr.fit(frame, "topicDistribution", "s"))
}).toDF()

When I use foreach function, I get this error.

 result.foreach(row => {
  val model = row.getAs[LinearRegressionModel](1)
  val userid = row.getString(0)
  model.save(SocialTextTest.userModelPath + userid)
})
Exception in thread "main" java.lang.UnsupportedOperationException: 
No Encoder found for org.apache.spark.ml.regression.LinearRegressionModel
- field (class: "org.apache.spark.ml.regression.LinearRegressionModel", name: "_2")
- root class: "scala.Tuple2"

Should I write a Encoder by myself?


Solution

  • The issue is for a reason.

    No Encoder found for org.apache.spark.ml.regression.LinearRegressionModel

    The code simply does not make much sense once you get the gist of what Dataset data abstraction really is and the purpose of Encoders.

    In essence, prepare your dataset first (as a collection of transformations on Dataset) and only when the dataset is ready train a model (aka fit a model). The model will then be outside the realm of Dataset and you won't see the exception.

    The reason the exception happens when you call foreach is that that's when you trigger a computation and Spark tries to execute the code.

    Should I write a Encoder by myself?

    Oh, no. Rewrite the code following the guide at Machine Learning Library (MLlib) Guide and reviewing some examples to learn how to use the API.