apache-sparkapache-spark-sqlavrospark-avro

how to force avro writer to write timestamp in UTC in spark scala dataframe


I need to write Timestamp field to avro and ensure the data is saved in UTC. currently avro converts it to long (timestamp millis ) in the Local timezone of the server which is causing issues as if the server reading bk is a different timezone. I looked at the DataFrameWriter it seem to mention an option called timeZone but it doesnt seem to help.Is there a way to force Avro to consider all timestamp fields received in a specific timezone?

**CODE SNIPPET** 
--write to spark avro

val data = Seq(Row("1",java.sql.Timestamp.valueOf("2020-05-11 15:17:57.188")))
val schemaOrig = List( StructField("rowkey",StringType,true)
,StructField("txn_ts",TimestampType,true))
val sourceDf =  spark.createDataFrame(spark.sparkContext.parallelize(data),StructType(schemaOrig))
sourceDf.write.option("timeZone","UTC").avro("/test4")

--now try to read back from avro
spark.read.avro("/test4").show(false)
avroDf.show(false)

original value in soure 2020-05-11 15:17:57.188
in avro  1589224677188
read bk from avro wt out format 
+-------------+-------------+
|rowkey       |txn_ts       |
+-------------+-------------+
|1            |1589224677188|
+-------------+-------------+

This is mapping fine but issue is if the local time of the server writing is EST and the one reading back is GMT it would give problem . 

println(new java.sql.Timestamp(1589224677188L))
2020-05-11 7:17:57.188   -- time in GMT

Solution

  • .option("timeZone","UTC") option will not convert timestamp to UTC timezone.

    Set this spark.conf.set("spark.sql.session.timeZone", "UTC") config property to set UTC as default timezone for all timestamps.

    By defaul value for spark.sql.session.timeZone property is JVM system local time zone if not set.

    Incase If above options are not working due to lower version of spark try using below options.

    --conf "spark.driver.extraJavaOptions=-Duser.timezone=UTC" --conf "spark.executor.extraJavaOptions=-Duser.timezone=UTC"