pythonpysparkunix-timestampto-timestamp

convert string type datetime values to specific format in pyspark


I have a pyspark dataframe having below types of date time values (string type) -

|text|date_filing|
|AAA|1998-12-22|
|BBBB|2023-08-30 12:03:17.814757+00|
|CCC|null|
|DDD|2017-11-28|

I want to convert this to string format but in a specific format - "yyyy-MM-ddTHH:mm:ssZ"

I tried below approach -

df.withColumn('time_start',when((df.date_filing.isNull() | (df.date_filing == '')) ,'').otherwise(to_timestamp(col("date_filing"), "yyyy-MM-dd'T'HH:mm:ss'Z'")))

But getting nulls in the new column.

Expected output-

|text|date_filing|
|AAA|1998-12-22T00:00:00Z|
|BBBB|2023-08-30T12:03:17Z|
|CCC||
|DDD|2017-11-28T00:00:00|

Any help would be appreciated.


Solution

  • You can use the date_format function from the pyspark.sql.functions module to format the date string in the desired format. Here's an example:

      df = df.withColumn('date_filing_formatted', when(df.date_filing.isNull() | (df.date_filing == ''), '').otherwise(date_format(to_timestamp(col('date_filing')), 'yyyy-MM-dd\'T\'HH:mm:ss\'Z\'')))
    

    In this example, we first import the necessary functions from the pyspark.sql.functions module. We then use the withColumn method to add a new column to the dataframe named date_filing_formatted. We use the when and otherwise functions to handle the case where the date_filing column is null or empty. We then use the to_timestamp function to convert the date_filing column to a timestamp, and the date_format function to format the timestamp in the desired format. The resulting dataframe will have a new column named date_filing_formatted with the date strings formatted as "yyyy-MM-ddTHH:mm:ssZ".