apache-sparkstr-to-date

spark to_date function - how to convert 31-DEC-98 to 1998-12-31 not 2098-12-31


(Py)Spark to_date convert 31-DEC-98 to 2098-12-31. Is there a way to make it 1998-12-31?

The document does not have an option to select 1000 or 2000.

to_date(date_str[, fmt]) - Parses the date_str expression with the fmt expression to a date. Returns null with invalid input. By default, it follows casting rules to a date if the fmt is omitted.

grade_type = spark.read\
    .option("header", "true")\
    .option("nullValue", "")\
    .option("inferSchema", "true")\
    .csv("student/GRADE_TYPE_DATA_TABLE.csv")

grade_type.show(3)
-----
+---------------+-----------+----------+------------+-----------+-------------+
|GRADE_TYPE_CODE|DESCRIPTION|CREATED_BY|CREATED_DATE|MODIFIED_BY|MODIFIED_DATE|
+---------------+-----------+----------+------------+-----------+-------------+
|             FI|      Final|  MCAFFREY|   31-DEC-98|   MCAFFREY|    31-DEC-98|
|             HM|   Homework|  MCAFFREY|   31-DEC-98|   MCAFFREY|    31-DEC-98|
|             MT|    Midterm|  MCAFFREY|   31-DEC-98|   MCAFFREY|    31-DEC-98|
+---------------+-----------+----------+------------+-----------+-------------+
grade_type = spark.read\
    .option("header", "true")\
    .option("nullValue", "")\
    .option("inferSchema", "true")\
    .csv("student/GRADE_TYPE_DATA_TABLE.csv")\
    .withColumn("CREATED_DATE", to_date(col('CREATED_DATE'), "dd-MMM-yy"))\
    .withColumn("MODIFIED_DATE", to_date(col('MODIFIED_DATE'), "dd-MMM-yy"))

grade_type.show(3)
-----
+---------------+-----------+----------+------------+-----------+-------------+
|GRADE_TYPE_CODE|DESCRIPTION|CREATED_BY|CREATED_DATE|MODIFIED_BY|MODIFIED_DATE|
+---------------+-----------+----------+------------+-----------+-------------+
|             FI|      Final|  MCAFFREY|  2098-12-31|   MCAFFREY|   2098-12-31|
|             HM|   Homework|  MCAFFREY|  2098-12-31|   MCAFFREY|   2098-12-31|
|             MT|    Midterm|  MCAFFREY|  2098-12-31|   MCAFFREY|   2098-12-31|
+---------------+-----------+----------+------------+-----------+-------------+

Solution

  • On Spark 3.0, a new dates parser was introduced, with a changed behavior for dealing with 2 digits year.
    You could find a reference for the change under Upgrading from Spark SQL 2.4 to 3.0

    spark.conf.set('spark.sql.legacy.timeParserPolicy', 'LEGACY') will give you the original behavior with the required results


    from pyspark.sql import functions as F
    
    spark.conf.set('spark.sql.legacy.timeParserPolicy', 'LEGACY')
    
    (spark.createDataFrame([('31-DEC-98',)], 'my_date string')
     .select(F.to_date('my_date','dd-MMM-yy')
     .alias('my_new_date')).show()
    )
    

    +-----------+
    |my_new_date|
    +-----------+
    | 1998-12-31|
    +-----------+