jsonaws-glueamazon-athenaaws-glue-data-cataloghive-serde

Athena DDL statement for different data structures


I have data in XML form which I have converted in JSON format through glue crawler. The problem is in writing the DDL statement for a table in Athena as you can see below there is a contact attribute in JSON data. Somewhere it is a structure (single instance) and somewhere it is in array form (multiple instances). I am sharing the DDL statements below as well for each type.

JSON Data Type 1

"ContactList": {
               "Contact": {
                          }
                }

Athena DDL Statement

CREATE EXTERNAL TABLE IF NOT EXISTS table_name (
ContactList: struct<
             Contact: struct<
                       
             >
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
) LOCATION 's3_bucket_path'
TBLPROPERTIES ('has_encrypted_data'='false')

JSON Data Type 2

"ContactList": {
                   "Contact": [
                              {},
                              {}
                              ]
               }

Athena DDL Statement

CREATE EXTERNAL TABLE IF NOT EXISTS table_name (
ContactList: struct<
             Contact: array < 
                      struct<
                      >
             >
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
) LOCATION 's3_bucket_path'
TBLPROPERTIES ('has_encrypted_data'='false')

I am able to write DDL statement for one case at a time only and it work perfectly for individual type. My question is how we can write DDL statements so it can cater to both types either it is struct or array. Thanks in advance.


Solution

  • The way you solve this in Athena is that you use the string type for the Contact field of the ContactList column, and then JSON functions in your queries.

    When you query you can for example do (assuming contacts have a "name" field):

    SELECT
      COALESCE(
        json_extract_scalar(ContactList.Contact, '$.name[0]'),
        json_extract_scalar(ContactList.Contact, '$.name')
      ) AS name
    FROM table_name
    

    This uses json_extract_scalar which parses a string as JSON and then extracts a value using a JSONPath expression. COALESCE picks the first non-null value, so if the first JSONPath expression does not yield any value (because the property is not an array), the second is attempted.