sqljsonamazon-athenaprestotrino

AWS Athena parse JSON array


I had previously had asked a question, and it was answered (AWS Athena Parse array of JSON objects to rows), about parsing JSON arrays using Athena but running into a variation.

Using the example:

SELECT user_textarray
FROM "sample"."workdetail" 
where workid = '5bb0a33f-3ca6-4f9c-9676-0b4d62dbb195'

The results returned as:

[{"userlist":"{'id': 'd87b002d-6c75-4c5a-b546-fe04cc939da9', 'name': 'John Smith'}"}, 
 {"userlist":"{'id': '41f20d65-c333-4fe5-bbe5-f9c63566cfc3', 'name': 'Larry Johnson'}"}, 
 {"userlist":"{'id': '18106aa2-e461-4ac5-b399-b2e209c0c341', 'name': 'Kim Jackson'}"}
]

What I'm trying to return is the list of id and name as rows related to the workid in the original query. I'm not sure why the JSON is formated this way and it comes from a 3rd party so can't make adjustments so needing to figure out how to parse the object within an object.

workid, id, name
5bb0a33f-3ca6-4f9c-9676-0b4d62dbb195,d87b002d-6c75-4c5a-b546-fe04cc939da9,'John Smith'
5bb0a33f-3ca6-4f9c-9676-0b4d62dbb195,41f20d65-c333-4fe5-bbe5-f9c63566cfc3,'Larry Johnson'
5bb0a33f-3ca6-4f9c-9676-0b4d62dbb195,18106aa2-e461-4ac5-b399-b2e209c0c341,'Kim Jackson'

I have tried variations of this but not working so trying to determine if I need to modify my 'with' statement to get to the object within the object or if on the select I need to further parse the object to get the elements I need.

with dataset as (workid, user_textarray
FROM "sample"."workdetail"
cross join unnest(user_textarray)
where workid = '5bb0a33f-3ca6-4f9c-9676-0b4d62dbb195')
select workid,
       json_extract_scalar(json, '$.userlist.name') name
from dataset
, unnest(user_textarray) as t(json);

Solution

  • The problem is in your data, from the Presto/Trino point of view userlist contains a string, not a JSON object, moreover this string itself is not a valid JSON for it since it contains ' instead of '"' for props.

    To "fix" this you can take the following steps (the only workaround I know):

    1. Extract the userlist
    2. Replace ' with " (some other JSON parsers will actually handle this "correctly" and will not require this step, but not on case of Trino/Presto)
    3. Process new JSON as you need.

    Something to get you started:

    -- sample data
    with dataset(workid, user_textarray) as (
    values ('5bb0a33f-3ca6-4f9c-9676-0b4d62dbb195', array['{"userlist":"{''id'': ''d87b002d-6c75-4c5a-b546-fe04cc939da9'', ''name'': ''John Smith''}"}',
     '{"userlist":"{''id'': ''41f20d65-c333-4fe5-bbe5-f9c63566cfc3'', ''name'': ''Larry Johnson''}"}',
     '{"userlist":"{''id'': ''18106aa2-e461-4ac5-b399-b2e209c0c341'', ''name'': ''Kim Jackson''}"}'
    ])
    )
    
    -- query
    select workid,
           json_extract_scalar(replace(json_extract_scalar(json, '$.userlist'),'''', '"'), '$.name') name
    from dataset,
         unnest(user_textarray) as t(json);
    

    Output:

    workid name
    5bb0a33f-3ca6-4f9c-9676-0b4d62dbb195 John Smith
    5bb0a33f-3ca6-4f9c-9676-0b4d62dbb195 Larry Johnson
    5bb0a33f-3ca6-4f9c-9676-0b4d62dbb195 Kim Jackson

    Note that for your goal you can use CTE/subquery so you don't need to write the handling multiple times.