scalaapache-sparkapache-spark-sqlto-timestamp

Spark Scala to_timestamp() function throwing DateTimeParseException (Fail to parse) error instead of returning null


According to the documentation to_timestamp function ideally should return null instead of throwing a Fail to parse error:

The following code throws Caused by: java.time.format.DateTimeParseException: Text '17-08-01' could not be parsed at index 0

import org.apache.spark.sql.functions.{col, to_timestamp}
val df1 = Seq(("abc", "17-08-01")).toDF("id", "eventTime")
val df2 = df1.withColumn("eventTime1",to_timestamp(col("eventTime"),"yyyy-MM-dd"))
df2.show()

https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/functions$.html#to_timestamp(s:org.apache.spark.sql.Column,fmt:String):org.apache.spark.sql.Column

Based on the documentation to_timestamp function returns -> A timestamp, or null if s was a string that could not be casted to a timestamp or fmt was an invalid format


Solution

  • Are you using spark 3 ? It seems that this behaviour is no longer supported since spark 3.0 (they shoud've updated the docs) see the below error at the beginning of your stack:

    Exception in thread "main" org.apache.spark.SparkUpgradeException: You may get a different 
    result due to the upgrading of Spark 3.0: Fail to parse '17-08-01' in the new parser. 
    You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior 
    before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
    

    If you want the pre spark 3.0 behaviour, you need to set one of these confs, the second one seems to fit more your needs:

    spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
    spark.conf.set("spark.sql.legacy.timeParserPolicy", "CORRECTED")