pythonapache-sparkpysparkto-timestamp

Pyspark to_timestamp date format parsing error


In my pyspark dataframe, there is a column CallDate with datatype as string containing the values shown below:

2008-04-01T00:00:00
2008-04-01T00:00:00
2008-04-01T00:00:00
2008-04-01T00:00:00
2008-04-01T00:00:00
2008-04-01T00:00:00

I am trying to convert this columns from datatype string to timestamp using pyspark.sql.functions.to_timestamp().

When I am running this code:

df.withColumn('IncidentDate', to_timestamp(col('CallDate'), 'yyyy/MM/dd')).select('CallDate', 'IncidentDate').show()

... I am getting this output:

+-------------------+------------+
|           CallDate|IncidentDate|
+-------------------+------------+
|2008-04-01T00:00:00|        NULL|
|2008-04-01T00:00:00|        NULL|
|2008-04-01T00:00:00|        NULL|
|2008-04-01T00:00:00|        NULL|
|2008-04-01T00:00:00|        NULL|
|2008-04-01T00:00:00|        NULL|

I believe the NULL values are due to the fact that the format specified for the date is not consistent with the actual date string, and since no match is found, the NULL values are returned.

But when I run this code:

df.withColumn('IncidentDate', to_timestamp(col('CallDate'), 'yyyy-MM-dd')).select('CallDate', 'IncidentDate').show()

I am getting an error saying that

Caused by: org.apache.spark.SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0:
Fail to parse '2008-04-01T00:00:00' in the new parser. You can set "spark.sql.legacy.timeParserPolicy" to "LEGACY" to restore the behavior before Spark 3.0, or set to "CORRECTED" and treat it as an invalid datetime string.

I know the correct parse format should be "yyyy-MM-dd'T'HH:mm:ss" as shown below:

df.withColumn('IncidentDate', to_timestamp(col('CallDate'), "yyyy-MM-dd'T'HH:mm:ss")).select('CallDate', 'IncidentDate').show()

But my question is why is it that when I give the date parse format as yyyy/MM/dd, Spark returns NULL values but when I give it as yyyy-MM-dd, it is throwing an error?


Solution

  • This happens because when you use format yyyy/MM/dd, both old and new datetime parsers are unable to parse the input, so the result would be NULL in both cases regardless of Spark (and its parser) version. However, with yyyy-MM-dd format the old parser, being more lenient, returns a valid value while the new one still fails and returns NULL. Spark detects such cases through internal check, and creates and throws a very special SparkUpgradeException with a very specific message to attract your attention to the potential regression issue. More about new parsing logic in Spark 3.x, and the effects of spark.sql.legacy.timeParserPolicy config can be found in Spark Migration Guide and/or source code.