In my pyspark dataframe, there is a column CallDate
with datatype as string
containing the values shown below:
2008-04-01T00:00:00
2008-04-01T00:00:00
2008-04-01T00:00:00
2008-04-01T00:00:00
2008-04-01T00:00:00
2008-04-01T00:00:00
I am trying to convert this columns from datatype string
to timestamp
using pyspark.sql.functions.to_timestamp()
.
When I am running this code:
df.withColumn('IncidentDate', to_timestamp(col('CallDate'), 'yyyy/MM/dd')).select('CallDate', 'IncidentDate').show()
... I am getting this output:
+-------------------+------------+
| CallDate|IncidentDate|
+-------------------+------------+
|2008-04-01T00:00:00| NULL|
|2008-04-01T00:00:00| NULL|
|2008-04-01T00:00:00| NULL|
|2008-04-01T00:00:00| NULL|
|2008-04-01T00:00:00| NULL|
|2008-04-01T00:00:00| NULL|
I believe the NULL
values are due to the fact that the format specified for the date is not consistent with the actual date string, and since no match is found, the NULL
values are returned.
But when I run this code:
df.withColumn('IncidentDate', to_timestamp(col('CallDate'), 'yyyy-MM-dd')).select('CallDate', 'IncidentDate').show()
I am getting an error saying that
Caused by: org.apache.spark.SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0:
Fail to parse '2008-04-01T00:00:00' in the new parser. You can set "spark.sql.legacy.timeParserPolicy" to "LEGACY" to restore the behavior before Spark 3.0, or set to "CORRECTED" and treat it as an invalid datetime string.
I know the correct parse format should be "yyyy-MM-dd'T'HH:mm:ss"
as shown below:
df.withColumn('IncidentDate', to_timestamp(col('CallDate'), "yyyy-MM-dd'T'HH:mm:ss")).select('CallDate', 'IncidentDate').show()
But my question is why is it that when I give the date parse format as yyyy/MM/dd
, Spark returns NULL
values but when I give it as yyyy-MM-dd
, it is throwing an error?
This happens because when you use format yyyy/MM/dd
, both old and new datetime parsers are unable to parse the input, so the result would be NULL in both cases regardless of Spark (and its parser) version. However, with yyyy-MM-dd
format the old parser, being more lenient, returns a valid value while the new one still fails and returns NULL. Spark detects such cases through internal check, and creates and throws a very special SparkUpgradeException
with a very specific message to attract your attention to the potential regression issue. More about new parsing logic in Spark 3.x, and the effects of spark.sql.legacy.timeParserPolicy
config can be found in Spark Migration Guide and/or source code.