scalaapache-sparkdataframeapache-spark-sqluser-defined-functions

Converting TZ timestamp string to a given format in UTC using spark and scala


I have a column called lastModified with String as given below that represents time in GMT. "2019-06-24T15:36:16.000Z"

I want to format this string to the format yyyy-MM-dd HH:mm:ss in spark using scala. To achieve this, I created a dataframe with a new column "ConvertedTS". which gives incorrect time.

Machine from where I am running this is in America/New_York timezone.

df.withColumn("ConvertedTS", date_format(to_utc_timestamp(to_timestamp(col("lastModified"), "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"), "America/New_York"), "yyyy-MM-dd HH:MM:SS").cast(StringType))

I am basically looking for formatting the result of below statement in yyyy-MM-dd HH:mm:ss

df.withColumn("LastModifiedTS", col("lastModified"))

One of the ways that is currently working for me is udf but as udfs are not recommended, I was looking for more of a direct expression that I can use.

val convertToTimestamp = (logTimestamp: String) => {
    println("logTimeStamp: " + logTimestamp)
    var newDate = ""
    try {
      val sourceFormat = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXXX")
      sourceFormat.setTimeZone(TimeZone.getTimeZone("GMT"))
      val convertedDate = sourceFormat.parse(logTimestamp)
      val destFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
      destFormat.setTimeZone(TimeZone.getTimeZone("GMT"))
      newDate = destFormat.format(convertedDate)
      println("newDate: " + newDate)
    } catch {
      case e: Exception => e.printStackTrace()
    }
    newDate
  }

  //register for sql
  EdlSparkObjects.sparkSession.sqlContext.udf.register("convertToTimestamp", convertToTimestamp)

  // register for scala
  def convertToTimestampUDF = udf(convertToTimestamp)
  df.withColumn("LastModifiedTS", convertToTimestampUDF(col("lastModified")))

Thanks for help and guidance.


Solution

  • You're almost there with your first withColumn attempt. It just consists of an incorrect date formatting string yyyy-MM-dd HH:MM:SS. Also, cast(StringType) is unnecessary since date_format already returns a StringType column. Below is sample code with the corrected date formatting:

    import org.apache.spark.sql.functions._
    import spark.implicits._
    
    val df = Seq(
      (1, "2019-06-24T15:36:16.000Z"),
      (2, "2019-07-13T16:25:27.000Z")
    ).toDF("id", "lastModified")
    
    df.withColumn("ConvertedTS", date_format(to_utc_timestamp(to_timestamp(
      $"lastModified", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"), "America/New_York"), "yyyy-MM-dd HH:mm:ss")
    ).
    show(false)
    // +---+------------------------+-------------------+
    // |id |lastModified            |ConvertedTS        |
    // +---+------------------------+-------------------+
    // |1  |2019-06-24T15:36:16.000Z|2019-06-24 19:36:16|
    // |2  |2019-07-13T16:25:27.000Z|2019-07-13 20:25:27|
    // +---+------------------------+-------------------+
    

    Additionally, if you always want UTC no matter what time zone this is executing on, you can use java.util.TimeZone to get that like:

    df.withColumn("ConvertedTS", date_format(to_utc_timestamp(to_timestamp(
      $"lastModified", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"), "${TimeZone.getDefault.getID}"), "yyyy-MM-dd HH:mm:ss")
    

    (NOTE: if anyone knows how to do this with the new Java date stuff, please do edit)