pysparkapache-spark-sqldatabricks-sql

INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER


I am very new to pyspark and getting below error, even if drop all date related columns or selecting only one column. Date format stored in my data frame like "enter image description here". Can anyone please suggest changes I could made in dataframe to resolve this/date formats supported by new parser. It's working if I set "spark.sql.legacy.timeParserPolicy" to "LEGACY"

[INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0: Caused by: DateTimeParseException: Text '1/1/2023 3:57:22 AM' could not be parsed at index 0 org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 15.0 failed 4 times, most recent failure: Lost task 4.3 in stage 15.0 (TID 355) (10.139.64.5 executor 0): org.apache.spark.SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0: Fail to parse '1/1/2023 3:57:22 AM' in the new parser. You can set "spark.sql.legacy.timeParserPolicy" to "LEGACY" to restore the behavior before Spark 3.0, or set to "CORRECTED" and treat it as an invalid datetime string.

Example:

#spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
from pyspark.sql.functions import *
from pyspark.sql import functions as F

emp = [(1, "AAA", "dept1", 1000, "12/22/2022  3:11:44 AM"),
(2, "BBB", "dept1", 1100, "12/22/2022  3:11:44 AM"),
(3, "CCC", "dept1", 3000, "12/22/2022  3:11:44 AM"),
(4, "DDD", "dept1", 1500, "12/22/2022  3:11:44 AM"),
(5, "EEE", "dept2", 8000, "12/22/2022  3:11:44 AM"),
(6, "FFF", "dept2", 7200, "12/22/2022  3:11:44 AM"),
(7, "GGG", "dept3", 7100, "12/22/2022  3:11:44 AM"),
(8, "HHH", "dept3", 3700, "12/22/2022  3:11:44 PM"),
(9, "III", "dept3", 4500, "12/22/2022  3:11:44 PM"),
(10, "JJJ", "dept5", 3400,"12/22/2022 3:11:44 PM")]
empdf = spark.createDataFrame(emp, ["id", "name", "dept", "salary", 
"date"])

#empdf.printSchema()
df = empdf.withColumn("date", F.to_timestamp(col("date"), 
"MM/dd/yyyy hh:mm:ss a"))
df.show(12,False)

Thanks a lot, in Advance


Solution

  • In general the new parser does not seem to like normal date patterns like MM or dd but prefers M or d.

    Example 1:

    # Does not work: 
    df = df.withColumn("date", F.to_timestamp(col("date"), "MM/dd/yyyy  hh:mm:ss a"))
    
    # Does work:
    df = pdf.withColumn("date", F.to_timestamp(col("date"), "M/d/yyyy h:m:s a"))
    

    Example 2

    # Does not work: 
    MM/dd/yyyy H:mm:ss a
    
    # Does work: 
    MM/d/yyyy H:mm:ss a
    

    In addition the parser seems to be less robust against things that are not explicitly defined in the pattern, e.g. when parsing dates from timestamps.

    Example 3

    String example: 02.03.2004 10:35 `

    # Does not work: 
    df.withColumn(
      "parsed_date", 
      psf.to_date("unparsed_date", "dd.MM.yyyy"))
    
    # Does work: 
    df.withColumn(
      "parsed_date", 
      psf.to_date(psf.col("unparsed_date").substr(0, 10), "d.M.y"))