apache-sparkhadoophive

TIMESTAMP not behaving as intended with parquet in hive


I have parquet data which when read (TIMESTAMP column) using spark works perfectly fine. Below are the sample records:

scala> data.select($"BATCH_LOAD_DATE").limit(5).show()

+-------------------+
|    BATCH_LOAD_DATE|
+-------------------+
|2020-02-21 07:35:27|
|2020-02-21 07:35:27|
|2020-02-21 07:35:27|
|2020-02-21 07:35:27|
|2020-02-21 07:35:27|
+-------------------+   

I checked the schema and it shows TIMESTAMP as the data type for one of the columns which is correct. So, no issue with schema either. However, when I am trying to read the same TIMESTAMP column via Hive, it throws below exception

Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritableV2 (state=,code=0)   

I can see from this link that it is an open issue in Hive. Not sure whether it is resolved yet or not. Is there any workaround for this? Something which can be done while loading data or some transformation once loaded?


Solution

  • I figured out an alternative to my own problem. I changed the column type of TIMESTAMP column to STRING and while fetching data I used from_unixtime method to cast that particular column to the intended date format and was able to fetch it.
    But, the problem over here was if my date value is 2020-02-27 15:40:22 and when I fetched the data of this column via Hive it was returning EpochSeconds i.e 15340232000000.
    So, I solved this problem in Hive via below query:

    select *, from_unixtime(cast(SOURCE_LOAD_DATE as BIGINT) DIV 1000000) as SOURCE_LOAD_DATE from table_name;   
    

    Using the above query I was able to get the proper date with timestamp value.

    Note: You will need to cast every column which has timestamp data.

    This is the only trick which I could think of. I hope this might help you or others!