I have a array with JSON strings and each JSON has a different key. Also, the size of the array is not fixed and may increase or decrease in occurrences. An example of this is below :
"[{"key1": 10},{"key2":"val2"}]"
I have this as a column in a delta table which is the data which has been ingested from the vendor.
My question is how do we convert this to a format so I can easily access the keys and values for each occurrence using only Databricks SQL? If the json has a fixed key, it can easily be achieved by casting into a STRUCT. But since the keys are different, I cannot use that.
I tried using explode which gave me 2 rows, each with a json string. But again, How can I get the keys and values from each of those json strings using just Databricks SQL? Also, in this case, the Integer values are also converted to strings which means I lose the datatype. What I'm looking for is a VARIANT version of this.
[{"key_name":"key1","key_value":10},{"key_name":"key2","key_value":"val2"}]
What I have done so far :
%sql
CREATE OR REPLACE TEMPORARY VIEW tmp_array1 AS
SELECT '[{"key1":10},{"key2":"val2"}]' as properties;
CREATE OR REPLACE TEMPORARY VIEW tmp_v1 AS
SELECT from_json(properties, 'ARRAY<STRING>') as col FROM tmp_array1;
SELECT * FROM tmp_v1;
Gives me
And then
%sql
SELECT EXPLODE(col) FROM tmp_v1;
Gives me
So, I can access each occurences but how I access each value when I do not know what the key name may be?
I'm not exactly sure what you're looking for, but here is some examples of transforming the data in such a way that at the end I'm working with rows of key and value columns.
;WITH json_parsed_as_array_of_maps as
(
SELECT
parse_json('[{"key1":10},{"key2":"val2"}]')::ARRAY<MAP<STRING,STRING>> as arr
)--select * from json_parsed_as_array_of_maps
, unified_map AS
(
SELECT
map_from_arrays
(
flatten(transform(arr, x -> MAP_KEYS(x))),
flatten(transform(arr, x -> MAP_VALUES(x)))
) as one_map
FROM json_parsed_as_array_of_maps
)--select * from unified_map
, exploded as
(
SELECT
explode(map_entries(one_map)) as pair
FROM unified_map
)--select * from exploded
SELECT
pair.key,
pair.value
FROM exploded
/*
key value
----------
key1 10
key2 val2
*/
and another way, this time presenting the combination of all the key/value pairs in a JSON array:
;WITH j as
(
SELECT
parse_json('[{"key1":10},{"key2":"val2"},{"key3":null}]')::ARRAY<STRING> as arr
)--select * from j
, unified_array(array_string) as
(
SELECT
'['||
ARRAY_JOIN
(
COLLECT_SET
(
--somewhat dirty string manipulation here (flawed if key or value string contains a colon)
'{"key":'||REPLACE(SUBSTRING(item, 2, LEN(item)-2), ':', ',"value":')||'}'
),
','
)
||']'
FROM (SELECT EXPLODE(arr) as item FROM j)
)--SELECT * FROM unified_array
SELECT
*
--array_string:[2].key, array_string:[1].value
FROM unified_array
--[{"key":"key2","value":"val2"},{"key":"key3","value":null},{"key":"key1","value":10}]