sqlamazon-redshiftunnest

Unnest array in Redshift using SQL SELECT statement


I am extracting values from API responses stored as strings of JSON data in AWS Redshift. I use JSON_EXTRACT_PATH_TEXT to extract key-value pairs. However, some responses contain lists of dictionaries. For example:

 {"results": ["predictions": [{"date": "2004", "days": 0, "count": 0, "stage": "test_5"}, {"date": "2005", "days": 1, "count": 0, "stage": "test_3"}, {"date": "2005", "days": 2, "count": 0, "stage": "test"}, {"date": "2004", "days": 3, "count": 0, "stage": "test_2"}]]}

JSON_EXTRACT_PATH_TEXT does not work here, because it can only extract one element per array. The final table must contain one record per element in the list of dictionaries.

Following this guide https://www.getdbt.com/blog/how-to-unnest-arrays-in-redshift/, I wrote the script below,

with response as (

    select

        json_extract_path_text(response, 'results', 'predictions', true) as predictions,

    from table

),

numbers as (

    select * from numbers

),

joined as (

    select         
        json_array_length(response.predictions, true) as n_predictions,

        json_extract_array_element_text(

            response.predictions, 

            numbers.ordinal::int, 

            true

            ) as predictions

    from response

    cross join numbers

    where numbers.ordinal <

        json_array_length(response.predictions, true)

),

parsed as (

    select 

        json_extract_path_text(predictions, 'stage', true) as stage,

        json_extract_path_text(predictions, 'date', true)::int as date,

        json_extract_path_text(predictions, 'day', true) as day,

    from joined

)

select * from parsed

FROM table

;

However, the admin for the BI tool I am using has restricted non-SELECT SQL statements, so the solution won't suffice. I tried the solution below instead:

SELECT 
  JSON_EXTRACT_PATH_TEXT(response, 'results', 'predictions', seq, 'date') AS date_,
  JSON_EXTRACT_PATH_TEXT(response, 'results', 'predictions', seq, 'day') AS day,
  JSON_EXTRACT_PATH_TEXT(response, 'results', 'predictions', seq, 'count') AS count,
  JSON_EXTRACT_PATH_TEXT(response, 'results', 'predictions', seq, 'stage') AS stage
FROM 
  (SELECT 
     JSON_ARRAY_LENGTH(response, 'results', 'predictions') AS len
   FROM 
     table) AS len_table
CROSS JOIN 
  generate_series(0, len_table.len-1) AS seq;

that produces the following error:

Error occurred while trying to execute a query: [SQLState 42883] ERROR: function json_array_length(character varying, "unknown", "unknown") does not exist HINT: No function matches the given name and argument types. You may need to add explicit type casts.

I am technically out of my depth here, so I would appreciate some guidance :)

Note: SQL statement must be a select statement and I am scripting from a BI tool.


Solution

  • There are a few issues with this query but first let's clarify the issue. You state that "the admin for the BI tool I am using has restricted non-SELECT SQL statements". Your solution only uses a SELECT statement so I have to guess what this means. I'm assuming, based on how you attempted to remedy, that the admin doesn't want you to use the WITH clause of the SELECT statement (for some reason). WITH is a valid part of a SELECT - see: https://docs.aws.amazon.com/redshift/latest/dg/r_SELECT_synopsis.html

    So if I have this assumption wrong please clarify. This seems like an arbitrary rule.

    The error you are getting is due to the misuse of the json_array_length() function. This needs to be constructed like you did in the original query - "json_array_length(response.predictions, true)". The issue is that "response" doesn't exist in your 2nd query, it needs to be created by "json_extract_path_text(response, 'results', 'predictions', true)". So some substitution is needed like:

    json_array_length(json_extract_path_text(response, 'results', 'predictions', true),true)
    

    However, this won't solve all the issues. The function generate_series() is not support on Redshift as a data operation function (it's a leader-node only function). You use a numbers table in your first query so this is a possible fix for this. But there seems to be another issue here as CROSS JOIN will combine all the array lengths with all the rows which isn't what you want. You've lost the WHERE clause from the first query that is acting to limit the results.

    Next you need to have "response" available to the outer SELECT so this will need to be passed up from the inner SELECT. An easy fix but one that is needed.

    Lastly, you state that "some responses contain lists of dictionaries" which implies that some do not and only contain a dictionary (not in a list). If this is so neither of your queries addresses this possibility. The json_extract_array_element_text() function will return NULL, not the dictionary. There needs to be some CASE structure or DECODE() function to address this case if I'm interpreting the situation correctly.