amazon-s3pysparkdata-lake

How to convert timestamp to AWS data lake s3 timestamp


In my data frame I have timestamp column with values in format: "%Y-%m-%d %H:%M:%m". I want to save this Data Frame to parquet using: df.write.partitionBy('p_timestamp').mode("overwrite").parquet('output/proto.parquet') and I want to have timestamp partition saved in AWS data lake s3 timestamp format: "%Y-%m-%dT%H-%M"

I tried:

df.select(F.col("p_timestamp"), F.date_format(F.col("p_timestamp"), "%Y-%m-%dT%H-%M").alias("date_format")).show()

but I got an error: Illegal pattern character 'T'

I would be grateful for any tips.


Solution

  • I think the formatting string you are looking for is:

    "yyyy-MM-dd'T'HH-mm"
    

    You get the error because the T is not quoted with ' to mark it as a string. You can consult sparks Datetime Patterns for Formatting and Parsing if you need something a little different.

    Small example:

    import pyspark.sql.types as T
    schema = T.StructType([T.StructField("Time", T.StringType(), False)])
    df = spark.createDataFrame([('1970-09-01 02:02:10',)], schema=schema)
    df.withColumn("Convert", F.date_format(F.col("Time"), "yyyy-MM-dd'T'HH-mm")).show()
    

    Output:

    +-------------------+----------------+
    |               Time|         Convert|
    +-------------------+----------------+
    |1970-09-01 02:03:10|1970-09-01T02-03|
    +-------------------+----------------+