jsonamazon-redshiftamazon-kinesis

How to extract specific key, value from PARSE_JSON within Materialized View is redshift


Assuming I expect the following JSON:

{
  "key": "value"
}

I want to create a materialized view that will show the key as a column. The data comes from Kinesis Streaming Data so it has to support incremental refresh as it may contain million of new records per minute. I tried a few ways, one of which:

CREATE MATERIALIZED VIEW kd.streams
AUTO REFRESH NO
AS
SELECT approximate_arrival_timestamp   AS apx_arv_at,
       JSON_PARSE(kinesis_data).key    AS key_value
FROM kd_stream."kinesis-stream"
WHERE CAN_JSON_PARSE(kinesis_data);

Which results in:

ERROR:  syntax error at or near "."
LINE 5:        JSON_PARSE(kinesis_data).key                      AS...

I tried JSON_EXTRACT_PATH_TEXT, cast to JSON, and some other methods.

If I just create the column as json (JSON_PARSE(kinesis_data)) it works, but that means I would need to build another MV on top of it. FYI kinesis_data is binary varying, and kinesis-stream is the kinesis stream (not a table).


Solution

  • Based on what you are trying I assume that kinesis_data is a string/text. The error you are seeing is because the query parser doesn't know about "key" when it runs. You see this super value isn't created until run time. So when it sees "." it doesn't know what to do with this.

    If kinesis_data was of type super then using the "." notation would work as the parser knows this is a (potential) part of the super. If you think Redshift should be smarter about this I suspect you will need to submit a case.

    Now getting what you want is simple enough but first I need to understand what you mean by "cast to JSON" as JSON isn't a Redshift data type. I'm guessing you want a super with just a string in it but this isn't clear.

    I'm going to focus on the select query in the MV definition as this is where the issue is located.

    Fix #1 - add a with clause.

    If we start with the test table defined by:

    create table test as select '{
      "key": "value"
    }' as col;
    

    we can extract the value by:

    with super as (
    SELECT JSON_PARSE(col)    AS key_value
    FROM test
    WHERE CAN_JSON_PARSE(col))
    select key_value, key_value.key
    from super;
    

    Fix #2 - use JSON_EXTRACT_PATH_TEXT

    Using the same test table as #1 we can:

    SELECT col, json_parse('"'||JSON_EXTRACT_PATH_TEXT(col,'key')||'"')    AS key_value
    FROM test
    WHERE CAN_JSON_PARSE(col);
    

    The json_parse and all the quoting here isn't needed if you wanted a string instead of a super.

    Lastly as a sanity check here's how the parser handles the "." notation when the table data is a super:

    create table test2 as select json_parse('{
      "key": "value"
    }') as col;
    
    select col.key 
    from test2;
    

    I hope this helps you get what you need.

    ============================ UPDATE ============================

    Per the comments here's some revised code.

    Create test table as varbyte data:

    create table test as select '{
      "key": "value"
    }'::varbyte as col;
    

    Make materialized view with code 1:

    create materialized view public.ttt 
    auto refresh no
    as 
    with super as (
    SELECT JSON_PARSE(col)    AS key_value
    FROM test
    WHERE CAN_JSON_PARSE(col))
    select key_value, key_value.key
    from super;
    

    This works but cannot be incrementally updated due to the extraction of "key" from a super type (untraceable) and likely the use of a cte that also adds some indirection.

    Make materialized view from code 2:

    create materialized view public.ttt 
    auto refresh no
    as 
    SELECT col, json_parse('"'||JSON_EXTRACT_PATH_TEXT(col,'key')||'"')    AS key_value
    FROM test
    WHERE CAN_JSON_PARSE(col);
    

    This has no incremental update limitations.