I'm querying data in AWS Athena from JSON files stored in S3. I've loaded all the JSON files into Athena using AWS Glue, and it's been working perfectly so far. However, the timestamp formatting has changed in the JSON files from
2018-03-23 15:00:30.998
to
2018-08-29T07:59:50.568Z
So the table ends up having entries like this
2018-08-29T07:59:42.803Z
2018-08-29T07:59:42.802Z
2018-08-29T07:59:32.500Z
2018-03-23 15:03:43.232
2018-03-23 15:03:44.697
2018-03-23 15:04:11.951
This results in parsing errors when I try to run queries against the full DB.
How do I accommodate this in AWS Glue (or Athena), so I don't have to split up the data when querying? I've tried looking into custom classifiers, but I'm unsure of how to use them in this particular case.
Thanks in advance.
Unfortunately you have to unify the data.
If you decide to use "2018-08-29T07:59:50.568Z"
format you can read such data by using org.apache.hive.hcatalog.data.JsonSerDe
library with the following serde property:
'timestamp.formats'='yyyy-MM-dd\'T\'HH:mm:ss.SSSZ'