I'm writing a solution in Scala based on Spark 3.5.0 that has complex map/seq data structures of dates where I need to represent missing dates with null or None. As this is scala my preference is None, however, I get an exception trying to encode Option[LocalDate] in a sequence or map. As a workaround, I switched to Null but I can't do this in a map as Null can't be used as a key.
This is code that reproduces the problem:
case class SubClass[T](i: Option[T])
case class TestClass[T](i: Option[T] = None, st: Option[SubClass[T]], sq: Option[Seq[Option[T]]] = None)
it should "handle sequence of date" in {
val sparkSession = spark
import sparkSession.implicits._
val aDate = LocalDate.parse("2024-04-29")
var ds = Seq(TestClass[LocalDate](Some(aDate), Some(SubClass[LocalDate](Some(aDate))), Some(Seq(Some(aDate))))).toDS
ds.show
}
it should "handle date structure" in {
val sparkSession = spark
import sparkSession.implicits._
val aDate = LocalDate.parse("2024-04-29")
var ds = Seq(TestClass[LocalDate](Some(aDate), Some(SubClass[LocalDate](Some(aDate))))).toDS
ds.show
}
it should "handle sequence of double" in {
val sparkSession = spark
import sparkSession.implicits._
val aNum = 2.42
var ds = Seq(TestClass[Double](Some(aNum), Some(SubClass[Double](Some(aNum))), Some(Seq(Some(aNum))))).toDS
ds.show
}
it should "handle sequence of string" in {
val sparkSession = spark
import sparkSession.implicits._
val aString = "Foo"
var ds = Seq(TestClass[String](Some(aString), Some(SubClass[String](Some(aString))), Some(Seq(Some(aString))))).toDS
ds.show
}
For the above tests, they all work except should handle sequence of date
which fails with exception org.apache.spark.SparkRuntimeException: Error while encoding: java.lang.RuntimeException: scala.Some is not a valid external type for schema of date
.
This only seems to be an issue with LocalDate as the Double and String tests work just fine.
Can anyone suggest a workaround or fix for this?
Thanks,
David
I assume this is 3.5.0, on that version I can reproduce, but on 3.5.1 it no longer occurs.
Probably this is triggered by https://issues.apache.org/jira/browse/SPARK-45896
(the fix is present on Databricks 14.3 LTS but not on lower versions)