I am trying to validate date received in file against configured date format(using to_timestamp /to_date).
schema = StructType([ \
StructField("date",StringType(),True), \
StructField("active", StringType(), True)
])
input_data = [
("27/10/00",'Y'),
("27/10/01",'Y'),
("27/10/1",'Y'),
("27/10/16",'Y'),
("27/10",'Y'),
("27/10/99",'Y'),
("27/10/100",'Y'),
("27/10/992",'Y'),
("27/10/2000",'Y'),
("27/10/9999",'Y'),
("27/10/20015",'Y'), ]
date_format = "dd/MM/yyyy"
temp_df = spark.createDataFrame(data=input_data,schema=schema)
df = temp_df.select('*',
f.when(f.date_format(f.to_timestamp(f.col('date'), date_format), date_format).isNotNull(), True).otherwise(False).alias('Date_validation'),
f.date_format(f.to_timestamp(f.col('date'), date_format), date_format).alias('converted_date'),
)
df.show(truncate=False)
my expectation is the code should return False since YYYY is a four character date but the code returns True and prints the converted year as valid one.
is there anyway I could to a strict validation on date ?
note : formats are configured by user which will be dynamic.
tried to_timestamp and to_date both are having same output
This is a bug since 2.4 and had been fixed from 3.0 forward. More information here. In short, there is a configuration called spark.sql.legacy.timeParserPolicy
with 3 possible values:
LEGACY
: Perform like 2.4 where you see the bugEXCEPTION
: Raise exceptions when has an invalid formatCORRECTED
: Fix and return your expected resultThis is the result with CORRECTED
:
+-----------+------+---------------+--------------+
|date |active|Date_validation|converted_date|
+-----------+------+---------------+--------------+
|27/10/00 |Y |false |null |
|27/10/01 |Y |false |null |
|27/10/1 |Y |false |null |
|27/10/16 |Y |false |null |
|27/10 |Y |false |null |
|27/10/99 |Y |false |null |
|27/10/100 |Y |false |null |
|27/10/992 |Y |false |null |
|27/10/2000 |Y |true |27/10/2000 |
|27/10/9999 |Y |true |27/10/9999 |
|27/10/20015|Y |false |null |
+-----------+------+---------------+--------------+