I have an Excel-reader, where I put the results in sparks dataframes. I have problems with parsing the timestamps.
I have timestamps as strings like Wed Dec 08 10:49:59 CET 2021
. I was using spark-sql version 2.4.5
and everything worked fine until I recently updated to version 3.1.2
.
Please find some minimal code below.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, to_timestamp}
val ts: String = "Wed Dec 08 20:49:59 CET 2021"
val oldfmt: String = "E MMM dd HH:mm:ss z yyyy"
val ttdf = Seq(ts)
.toDF("theTimestampColumn")
.withColumn("parsedTime", to_timestamp(col("theTimestampColumn"), fmt = oldfmt))
ttdf.show()
Running this code with spark version 2.4.5
works like expected and produces the following output:
+--------------------+-------------------+
| theTimestampColumn| parsedTime|
+--------------------+-------------------+
|Wed Dec 08 20:49:...|2021-12-08 20:49:59|
+--------------------+-------------------+
Now, executing the same code, just with spark version 3.1.2
, results in the following error:
Exception in thread "main" org.apache.spark.SparkUpgradeException:
You may get a different result due to the upgrading of Spark 3.0:
Fail to recognize 'E MMM dd HH:mm:ss z yyyy' pattern in the DateTimeFormatter.
1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0.
2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
(clickable link: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html)
This website doesn't help me further. I don't find any mistakes in my formatstring.
The symbol E
represents the day-of-week
as text like Tue; Tuesday
.
The symbol M
represents the month-of-year
like 7; 07; Jul; July
. The symbols H,m,s,y
are hours, minutes, seconds or years, respectively. The symbol z
denotes the time-zone name
like Pacific Standard Time; PST
.
Do I miss something obvious here?
Any help will be really appreciated. Thank you in advance.
You can use E
only for datetime formatting and not for parsing, as stated in datetime pattern documentation:
Symbols of ‘E’, ‘F’, ‘q’ and ‘Q’ can only be used for datetime formatting, e.g. date_format. They are not allowed used for datetime parsing, e.g. to_timestamp.
If you want to apply behavior of Spark version <3.0, you can set spark.sql.legacy.timeParserPolicy
option to LEGACY
:
sparkSession.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
And if you don't want to change spark configuration, you can remove the characters representing day with substr
SQL function:
import org.apache.spark.sql.functions.{col, to_timestamp, expr}
val ts: String = "Wed Dec 08 20:49:59 CET 2021"
val fmt: String = "MMM dd HH:mm:ss z yyyy"
val ttdf = Seq(ts)
.toDF("theTimestampColumn")
.withColumn("preparedTimestamp", expr("substr(theTimestampColumn, 5, length(theTimestampColumn))"))
.withColumn("parsedTime", to_timestamp(col("preparedTimestamp"), fmt = fmt))
.drop("preparedTimestamp")