I need to convert a date string to a DateType, but I've several challenges using to_date
.
Formatting for day
works well (1 or 2 digits), month
is a Dutch abbreviation and doesn't work (works only if the abbreviation is equal to English), and year
is 2 or 4 digits (missing centuries!).
What's the best way to convert these all to a DateType?
I couldn't find an option to set locale to NL using the formatting.
I created an UDF, but don't know if this is the best way to fix this.
The 19
for century is debatable.
Code:
@F.udf(T.StringType())
def convert_date(s):
month_dict = {"jan":"01", "feb":"02", "mrt":"03", "apr":"04", "mei":"05", "jun":"06", "jul":"07", "aug":"08", "sep":"09", "okt":"10", "nov":"11", "dec":"12" }
day, month, year = s.split("-")
if len(day) == 1:
day = '0' + day
if len(year) < 4:
year = '19' + year
date = day + "-" + month_dict[month] + "-" + year
return date
df = df.withColumn('DateOfBirth_new', F.to_date(convert_date(F.col("DateOfBirth"), "dd-M-yyyy"))
DateFrame:
df = spark.createDataFrame([
["2-feb-1966"],
["05-mei-1974"],
["3-mrt-83"],
["05-mrt-1983"],
["12-jun-75"]
]).toDF("DateOfBirth")
month_dict = {"jan":"01", "feb":"02", "mrt":"03", "apr":"04", "mei":"05", "jun":"06", "jul":"07", "aug":"08", "sep":"09", "okt":"10", "nov":"11", "dec":"12" }
for key, item in month_dict.items():
df= df.withColumn('column', regexp_replace('column', key, item))