I have a column called lastModified with String as given below that represents time in GMT. "2019-06-24T15:36:16.000Z"
I want to format this string to the format yyyy-MM-dd HH:mm:ss in spark using scala. To achieve this, I created a dataframe with a new column "ConvertedTS". which gives incorrect time.
Machine from where I am running this is in America/New_York timezone.
df.withColumn("ConvertedTS", date_format(to_utc_timestamp(to_timestamp(col("lastModified"), "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"), "America/New_York"), "yyyy-MM-dd HH:MM:SS").cast(StringType))
I am basically looking for formatting the result of below statement in yyyy-MM-dd HH:mm:ss
df.withColumn("LastModifiedTS", col("lastModified"))
One of the ways that is currently working for me is udf but as udfs are not recommended, I was looking for more of a direct expression that I can use.
val convertToTimestamp = (logTimestamp: String) => {
println("logTimeStamp: " + logTimestamp)
var newDate = ""
try {
val sourceFormat = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXXX")
sourceFormat.setTimeZone(TimeZone.getTimeZone("GMT"))
val convertedDate = sourceFormat.parse(logTimestamp)
val destFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
destFormat.setTimeZone(TimeZone.getTimeZone("GMT"))
newDate = destFormat.format(convertedDate)
println("newDate: " + newDate)
} catch {
case e: Exception => e.printStackTrace()
}
newDate
}
//register for sql
EdlSparkObjects.sparkSession.sqlContext.udf.register("convertToTimestamp", convertToTimestamp)
// register for scala
def convertToTimestampUDF = udf(convertToTimestamp)
df.withColumn("LastModifiedTS", convertToTimestampUDF(col("lastModified")))
Thanks for help and guidance.
You're almost there with your first withColumn
attempt. It just consists of an incorrect date formatting string yyyy-MM-dd HH:MM:SS
. Also, cast(StringType)
is unnecessary since date_format
already returns a StringType column. Below is sample code with the corrected date formatting:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
(1, "2019-06-24T15:36:16.000Z"),
(2, "2019-07-13T16:25:27.000Z")
).toDF("id", "lastModified")
df.withColumn("ConvertedTS", date_format(to_utc_timestamp(to_timestamp(
$"lastModified", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"), "America/New_York"), "yyyy-MM-dd HH:mm:ss")
).
show(false)
// +---+------------------------+-------------------+
// |id |lastModified |ConvertedTS |
// +---+------------------------+-------------------+
// |1 |2019-06-24T15:36:16.000Z|2019-06-24 19:36:16|
// |2 |2019-07-13T16:25:27.000Z|2019-07-13 20:25:27|
// +---+------------------------+-------------------+
Additionally, if you always want UTC no matter what time zone this is executing on, you can use java.util.TimeZone
to get that like:
df.withColumn("ConvertedTS", date_format(to_utc_timestamp(to_timestamp(
$"lastModified", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"), "${TimeZone.getDefault.getID}"), "yyyy-MM-dd HH:mm:ss")
(NOTE: if anyone knows how to do this with the new Java date stuff, please do edit)