[SOLVED] Querying timestamp data in Athena when the timestamp format in the underlying JSON files has changed

Querying timestamp data in Athena when the timestamp format in the underlying JSON files has changed

I'm querying data in AWS Athena from JSON files stored in S3. I've loaded all the JSON files into Athena using AWS Glue, and it's been working perfectly so far. However, the timestamp formatting has changed in the JSON files from

2018-03-23 15:00:30.998

2018-08-29T07:59:50.568Z

So the table ends up having entries like this

2018-08-29T07:59:42.803Z
2018-08-29T07:59:42.802Z
2018-08-29T07:59:32.500Z
2018-03-23 15:03:43.232
2018-03-23 15:03:44.697
2018-03-23 15:04:11.951

This results in parsing errors when I try to run queries against the full DB.

How do I accommodate this in AWS Glue (or Athena), so I don't have to split up the data when querying? I've tried looking into custom classifiers, but I'm unsure of how to use them in this particular case.

Thanks in advance.

Solution

Unfortunately you have to unify the data.

If you decide to use "2018-08-29T07:59:50.568Z" format you can read such data by using org.apache.hive.hcatalog.data.JsonSerDe library with the following serde property:

'timestamp.formats'='yyyy-MM-dd\'T\'HH:mm:ss.SSSZ'