datepysparkto-datejulian-date

Convert day of year to Date Format in Pyspark


I have a pySpark dataframe with a date column as yyyyddd, where yyyy is year(format 2020, 2021) and ddd is the day of year(format 001, 365, 366).

I am trying to convert it to date as:

df = df.withColumn("new_date", to_date("old_date", "yyyyddd"))

but this gives me the correct answer for January dates only, and 'Null' for all other months.

old_date is StringType and new_date is DateType

old_date new_date
2006272 (means 272nd day of 2006) null
2008016 2008-01-16
2011179 null
2011026 2011-01-26

How can I convert this date format?


Solution

  • You can use D format which represents the day of year in unix_timestamp functions like below. You would not need UDF to perform this operation

    # Import functions
    import pyspark.sql.functions as f
    
    
    df.withColumn("new_date", f.from_unixtime(f.unix_timestamp("old_date", 'yyyyD'),'yyyy-MMdd'))