sqljsongoogle-bigquery

Flatten a column with JSON strings with different numbers of keys to a table in bigquery (Dynamic)


This is my first time posting here!

Saw someone gave an answer to a similar question here: How to flatten a colum of large JSON strings with different numbers of keys to a table in bigquery but in my case, the column looks as follows:

id keywords
862 [{'id': 931, 'name': 'jealousy'}, {'id': 4290, 'name': 'toy'}, {'id': 5202, 'name': 'boy'}, {'id': 6054, 'name': 'friendship'}, {'id': 9713, 'name': 'friends'}, {'id': 9823, 'name': 'rivalry'}, {'id': 165503, 'name': 'boy next door'}, {'id': 170722, 'name': 'new toy'}, {'id': 187065, 'name': 'toy comes to life'}]
8844 [{'id': 10090, 'name': 'board game'}, {'id': 10941, 'name': 'disappearance'}, {'id': 15101, 'name': "based on children's book"}, {'id': 33467, 'name': 'new home'}, {'id': 158086, 'name': 'recluse'}, {'id': 158091, 'name': 'giant insect'}]

How can I create a SQL table using the same method described in the other answer which dynamically identifies the Key objects and creates columns while automatically flattening the results. When I run this script I get an error: Query error: SyntaxError: Expected property name or '}' in JSON at position 2 (line 1 column 3) at undefined line 1, columns 2-3 at [24:3]

create temp function  extract_keys(input string) returns array<string> language js as """
return Object.keys(JSON.parse(input));
""";
create temp function  extract_values(input string) returns array<string> language   js as """
return Object.values(JSON.parse(input));
""";

create temp function extract_all_leaves(input string) returns string language js as '''
function flattenObj(obj, parent = '', res = {}){
for(let key in obj){
    let propName = parent ? parent + '.' + key : key;
    if(typeof obj[key] == 'object'){
        flattenObj(obj[key], propName, res);
    } else {
        res[propName] = obj[key];
    }
}
return JSON.stringify(res);
}
return flattenObj(JSON.parse(input));
''';

create temp table temp_table as (
 select offset, key, value, id
 from your_table t, 
 unnest([struct(extract_all_leaves(payload) as leaves)]),
 unnest(extract_keys(leaves)) key with offset
join unnest(extract_values(leaves)) value with offset
using(offset)  
);   

execute immediate (select '''
select * from (select * except(offset) from temp_table)
 pivot (any_value(value) for replace(key, '.', '__') in (''' || keys_list || '''
 ))'''
from (select string_agg('"' || replace(key, '.', '__') || '"', ',' order by offset) keys_list from (
select key, min(offset) as offset from temp_table group by key
))
);  }
                                                           

Solution

  • Your data is nearly JSON; in fact it looks like Javascript, and JSON being a strict subset of Javascript there are some things to change to make it a valid, strict JSON;
    however, as GBQ understands more than strict JSON, with a single change you can make it understand simple entries of your data (like the first one of your example): replace ' by " in field values.

    To that goal, replace extract_all_leaves(payload) by extract_all_leaves(replace(payload, "'", '"')).

    But as your data can embed ' and " (in the text fields, apart from those delimiting text fields), you'll need a bit more of intelligence to handle them.

    Taking in consideration that we may have "…'…", '…"…' but probably some "…'…\"…" too (see my comment question about it),
    this will require a regex to handle the two cases, either "starting and ending with double quotes with no double quote in it or it is preceded by a \", or same with a single quote,
    we can handle those by replacing:
    extract_all_leaves(payload)
    with
    extract_all_leaves(regexp_replace(payload, "'(\\\\.|[^']+)*'|\"(\\\\.|[^\"]+)*\"", '"\\1\\2"'))

    (we cannot handle only the ' case and let the "s as is, because then the ' of "O'Connor" would be interpreted as the start of a new '…' sequence, shifting everything and finally making the expression fail.