pythondatedatetimeazure-databricksjulian-date

Converting 7 digit dates to normal calendar dates in Databricks python


I am generating data using TPC-DS.

I load the customers table to a dataframe. The c_first_sales_date_sk column has values such as 2449001, which makes me think they are Julian calendar dates of type yyyyDD.

So far I have tried:

from pyspark.sql.functions import to_date, from_unixtime
df_with_date = df.withColumn("c_first_sales_date", to_date(col("c_first_sales_date_sk"), format="yyyyDDD"))
display(df_with_date)

Applying this, it will convert 2449001 to 2449-01-01, which is wrong. The online convert at http://www.longpelaexpertise.com/toolsJulian.php converts the same date to 01-Jan-2024.

What am I doing wrong? How do I convert this column properly?


Solution

  • data = [[1,'2449001'],[2,'2020111'],[3,'2010364']]
    cols = ['id','jd']
    df = spark.createDataFrame(data=data, schema=cols)
    #df.show()
    
    from pyspark.sql.functions import to_date, from_unixtime,col
    df_with_date = df.withColumn("ad", to_date(col("jd"), format="yyyyDDD"))
    display(df_with_date)
    

    enter image description here