pyspark

Pyspark date_trunc without modifying actual value


Consider the below dataframe

df:

time
2022-02-21T11:23:54

I have to convert it to

time
2022-02-21T11:23:00

After using the below code

df.withColumn("time_updated", date_trunc("minute", col("time"))).show(truncate = False)

My output

time
2022-02-21 11:23:00

By desired output is

time
2022-02-21T11:23:00

Is there anyway I can keep the data same and just update/truncate the seconds??


Solution

  • The issue lies in the output format of your time_updated column. When you use date_trunc, it returns a timestamp type, which is displayed in the default format (yyyy-MM-dd HH:mm:ss). To match your desired format, you need to explicitly format the output as a string.

    Here’s how you can do it:

    from pyspark.sql import functions as F
    
    df = df.withColumn(
        "time_updated",
        F.date_format(F.col("time").cast("timestamp"), "yyyy-MM-dd'T'HH:mm:00"),
    )
    
    df.show(truncate=False)
    

    Output:

    +-------------------+-------------------+
    |time               |time_updated       |
    +-------------------+-------------------+
    |2022-02-21T11:23:54|2022-02-21T11:23:00|
    +-------------------+-------------------+
    

    You can confirm the schema using df.printSchema():

    root
     |-- time: string (nullable = true)
     |-- time_updated: string (nullable = true)