aws-glueamazon-athenahive-serde

AWS Glue crawler able to parse the struct definition but Athena fails to read correctly


So we have CSV files in a S3 bucket and when the AWS Glue crawler crawls through all the files its able to identify the schema of a struct field correctly as follows:

struct<installation_id:string,device_id:string>

and the CSV file contents are as follows:

publisher_parameters
"{""installation_id"":""a86f3071-e271-4979-8096-d501b8645c7e"",""device_id"":""864006043285537""}"
"{""installation_id"":""3d3a964e-e78f-428b-93cd-1ddfec1fb6d6"",""device_id"":""e9fe5f395a5a4e67""}"

However, when I am reading using AWS Athena its reading the entire string as the value of installation_id within our struct. I have tried using the org.apache.hadoop.hive.serde2.OpenCSVSerde SerDe as well but still facing the same issue.

Does AWS Glue use a SerDe, which is different from what is provided in the table specification, when crawling the files?


Solution

  • Glue has a tendency of creating tables that don't work with Athena, and this is just yet another example.

    The OpenCSVSerde does not have support for complex types, so there's no chance of this working. Why Glue does this is anyone's guess.

    LazySimpleSerDe has some support for complex types, but they need to be encoded differently than in your data.

    My suggestion is to stop using Glue crawler altogether and create the table manually using string as the type for the publisher_parameters column and then use JSON functions to parse it at query time. There is to my knowledge no way to get any Athena CSV serde to parse that data as a struct.