I am generating data using TPC-DS.
I load the customers table to a dataframe. The c_first_sales_date_sk
column has values such as 2449001
, which makes me think they are Julian calendar dates of type yyyyDD
.
So far I have tried:
from pyspark.sql.functions import to_date, from_unixtime
df_with_date = df.withColumn("c_first_sales_date", to_date(col("c_first_sales_date_sk"), format="yyyyDDD"))
display(df_with_date)
Applying this, it will convert 2449001
to 2449-01-01
, which is wrong. The online convert at http://www.longpelaexpertise.com/toolsJulian.php converts the same date to 01-Jan-2024
.
What am I doing wrong? How do I convert this column properly?
data = [[1,'2449001'],[2,'2020111'],[3,'2010364']]
cols = ['id','jd']
df = spark.createDataFrame(data=data, schema=cols)
#df.show()
from pyspark.sql.functions import to_date, from_unixtime,col
df_with_date = df.withColumn("ad", to_date(col("jd"), format="yyyyDDD"))
display(df_with_date)