hiveapache-spark-sqlspark-avroavro-tools

hive external table on avro timestamp field returning as long


I have avro data which has a single column timestamp column and now i am trying to create external hive table on top of the avro files .Data gets saved in avro as long and i expect the avro logical type to handle the conversion back to timestamp when i query the hive table. But that is not happening at its simply returning the long value back . How can i get this working as expected?

PS : I am using spark 2.3 and databricks com.databrospark-avro_2.11

    <dependency>
        <groupId>com.databricks</groupId>
        <artifactId>spark-avro_2.11</artifactId>
        <version>4.0.0</version>
    </dependency>

STEP1 : Store the timestamp value to avro

val startTs=java.sql.Timestamp.valueOf("2020-05-11 14:17:57.188")
val df=Seq(startTs).toDF
df.write.avro("/test")
val avroDf=spark.read.avro("/test")
avroDf.show(false)

+-------------+
|value        |
+-------------+
|1589221077188|
+-------------+


AVSC file  generated using avro-tools from the avro data files 
test.avsc 
{
  "type" : "record",
  "name" : "topLevelRecord",
  "fields" : [ {
    "name" : "value",
    "type" :  "long", "logicalType": "timestamp-millis" 
  } ]
}

hdfs dfs -copyFromLocal -f test.avsc /tmp/test.avsc

STEP 2 : CREATE EXTERNAL HIVE TABLE on AVRO data

DROP TABLE IF EXISTS test_a;

CREATE EXTERNAL TABLE  test_a 
STORED AS AVRO
LOCATION '/tenants/gwm/idr/tmp/test'
TBLPROPERTIES ('avro.schema.url'='/tmp/test.avsc');

msck repair table test_a;
select * from  test_a;

+----------------+--+
|  test_a.value  |
+----------------+--+
| 1589221077188  |
+----------------+--+

Instead of this long value i am trying to get the timestamp value.


Solution

  • From the Avro specification:

    A timestamp-millis logical type annotates an Avro long, where the long stores the number of milliseconds from the unix epoch, 1 January 1970 00:00:00.000 UTC.

    Maybe this link could help you

    Creating a Proper avro schema for timestamp record

    Another way could be query the long value from hive or store your timestamp as string value from your Spark app:

    SELECT CONCAT(FROM_UNIXTIME(CAST(SUBSTR(CAST(1589221077188 AS STRING),1,10) AS BIGINT)),'.', SUBSTR(CAST(1589221077188 AS STRING),11,13)) AS timestamp; 
    
    SELECT CONCAT(FROM_UNIXTIME(CAST(SUBSTR(CAST(time AS STRING),1,10) AS BIGINT)),'.', SUBSTR(CAST(time AS STRING),11,13)) AS timestamp; 
    
    "2020-05-11 14:17:57.188"
    

    I Hope this helps.