Consider the below dataframe
df:
time |
---|
2022-02-21T11:23:54 |
I have to convert it to
time |
---|
2022-02-21T11:23:00 |
After using the below code
df.withColumn("time_updated", date_trunc("minute", col("time"))).show(truncate = False)
My output
time |
---|
2022-02-21 11:23:00 |
By desired output is
time |
---|
2022-02-21T11:23:00 |
Is there anyway I can keep the data same and just update/truncate the seconds??
The issue lies in the output format of your time_updated
column. When you use date_trunc
, it returns a timestamp
type, which is displayed in the default format (yyyy-MM-dd HH:mm:ss
). To match your desired format, you need to explicitly format the output as a string.
Here’s how you can do it:
from pyspark.sql import functions as F
df = df.withColumn(
"time_updated",
F.date_format(F.col("time").cast("timestamp"), "yyyy-MM-dd'T'HH:mm:00"),
)
df.show(truncate=False)
Output:
+-------------------+-------------------+
|time |time_updated |
+-------------------+-------------------+
|2022-02-21T11:23:54|2022-02-21T11:23:00|
+-------------------+-------------------+
You can confirm the schema using df.printSchema()
:
root
|-- time: string (nullable = true)
|-- time_updated: string (nullable = true)