sqljsonsnowflake-cloud-data-platform

How Best to Extract Specific JSON Key-Value Pairs in Snowflake


I'm working with some JSON data in Snowflake, structured so:

[
  {
    "key": "foo",
    "value": {
      "previous_value": "old_val_f",
      "current_value": "new_val_f"
    }
  },
  {
    "key": "bar",
    "value": {
      "previous_value": "old_val_b",
      "current_value": "new_val_b"
    }
  },
  {
    "key": "qux",
    "value": {
      "previous_value": "old_val_q",
      "current_value": "new_val_q"
    }
  }
]

My goal is to extract specific key-value pairs to a table, with the desired keys as column names:

foo qux
new_val_f new_val_q

I don't have a great deal of experience with JSON traversal, so have muddled my way as best I can. My current solution looks like this:

SELECT

    b.value:value:current_value::STRING AS foo,
    c.value:value:current_value::STRING AS qux

FROM (

    SELECT
    
        PARSE_JSON(json_field) parsed
    
    FROM source_table
)
,lateral flatten(input => parsed, outer => true) b
,lateral flatten(input => parsed, outer => true) c
WHERE b.value:key = 'foo'
AND c.value:key = 'qux'

Given the many many keys within my data, this approach strikes me as pretty clunky and I worry it won't scale well. Is there a more efficient approach I could use instead? Or is the best approach really a separate lateral flatten for every individual key?


Solution

  • I'm not sure if you'd consider this easier or not, but you could just flatten once, and then filter and pivot. Adding attributes would only be needed to be added to the pivot output in this case.

    Breaking up the code...first section just gives me the same dataset that you have:

    with x as (
    select parse_json('
        [
          {
            "key": "foo",
            "value": {
              "previous_value": "old_val_f",
              "current_value": "new_val_f"
            }
          },
          {
            "key": "bar",
            "value": {
              "previous_value": "old_val_b",
              "current_value": "new_val_b"
            }
          },
          {
            "key": "qux",
            "value": {
              "previous_value": "old_val_q",
              "current_value": "new_val_q"
            }
          }
        ]'::variant) as var
    )
    

    Next section does your flatten and filter. This section wouldn't need to change for attribute addition or removal:

    , y as (
        select y.seq,
               y.this:key::string as key,
               y.value:current_value::string as new_value
        from x,
        lateral flatten(input=>var,recursive => True) y
        where key = 'value'
        )
    

    Then, you run the select against y with a pivot that specifies the key-value pairs that you care about:

    select * 
    from y
    PIVOT(max(new_value) for key IN ('foo', 'qux'))
    AS p (seq, foo, qux);
    

    And just to throw this out there as an opinion...your data source is not a great use of JSON structure! Had the author of that JSON string simply used the key's value as the name of the attribute and then nested the previous_value and current_value not only would it had been super easy to parse (no array needed means no flatten needed), but also would've been a smaller file to deal with.