I'm working with some JSON data in Snowflake, structured so:
[
{
"key": "foo",
"value": {
"previous_value": "old_val_f",
"current_value": "new_val_f"
}
},
{
"key": "bar",
"value": {
"previous_value": "old_val_b",
"current_value": "new_val_b"
}
},
{
"key": "qux",
"value": {
"previous_value": "old_val_q",
"current_value": "new_val_q"
}
}
]
My goal is to extract specific key-value pairs to a table, with the desired keys as column names:
foo | qux |
---|---|
new_val_f | new_val_q |
I don't have a great deal of experience with JSON traversal, so have muddled my way as best I can. My current solution looks like this:
SELECT
b.value:value:current_value::STRING AS foo,
c.value:value:current_value::STRING AS qux
FROM (
SELECT
PARSE_JSON(json_field) parsed
FROM source_table
)
,lateral flatten(input => parsed, outer => true) b
,lateral flatten(input => parsed, outer => true) c
WHERE b.value:key = 'foo'
AND c.value:key = 'qux'
Given the many many keys within my data, this approach strikes me as pretty clunky and I worry it won't scale well. Is there a more efficient approach I could use instead? Or is the best approach really a separate lateral flatten
for every individual key?
I'm not sure if you'd consider this easier or not, but you could just flatten once, and then filter and pivot. Adding attributes would only be needed to be added to the pivot output in this case.
Breaking up the code...first section just gives me the same dataset that you have:
with x as (
select parse_json('
[
{
"key": "foo",
"value": {
"previous_value": "old_val_f",
"current_value": "new_val_f"
}
},
{
"key": "bar",
"value": {
"previous_value": "old_val_b",
"current_value": "new_val_b"
}
},
{
"key": "qux",
"value": {
"previous_value": "old_val_q",
"current_value": "new_val_q"
}
}
]'::variant) as var
)
Next section does your flatten and filter. This section wouldn't need to change for attribute addition or removal:
, y as (
select y.seq,
y.this:key::string as key,
y.value:current_value::string as new_value
from x,
lateral flatten(input=>var,recursive => True) y
where key = 'value'
)
Then, you run the select against y
with a pivot that specifies the key-value pairs that you care about:
select *
from y
PIVOT(max(new_value) for key IN ('foo', 'qux'))
AS p (seq, foo, qux);
And just to throw this out there as an opinion...your data source is not a great use of JSON structure! Had the author of that JSON string simply used the key's value as the name of the attribute and then nested the previous_value
and current_value
not only would it had been super easy to parse (no array needed means no flatten needed), but also would've been a smaller file to deal with.