This is my first time posting here!
Saw someone gave an answer to a similar question here: How to flatten a colum of large JSON strings with different numbers of keys to a table in bigquery but in my case, the column looks as follows:
id | keywords |
---|---|
862 | [{'id': 931, 'name': 'jealousy'}, {'id': 4290, 'name': 'toy'}, {'id': 5202, 'name': 'boy'}, {'id': 6054, 'name': 'friendship'}, {'id': 9713, 'name': 'friends'}, {'id': 9823, 'name': 'rivalry'}, {'id': 165503, 'name': 'boy next door'}, {'id': 170722, 'name': 'new toy'}, {'id': 187065, 'name': 'toy comes to life'}] |
8844 | [{'id': 10090, 'name': 'board game'}, {'id': 10941, 'name': 'disappearance'}, {'id': 15101, 'name': "based on children's book"}, {'id': 33467, 'name': 'new home'}, {'id': 158086, 'name': 'recluse'}, {'id': 158091, 'name': 'giant insect'}] |
How can I create a SQL table using the same method described in the other answer which dynamically identifies the Key objects and creates columns while automatically flattening the results. When I run this script I get an error: Query error: SyntaxError: Expected property name or '}' in JSON at position 2 (line 1 column 3) at undefined line 1, columns 2-3 at [24:3]
create temp function extract_keys(input string) returns array<string> language js as """
return Object.keys(JSON.parse(input));
""";
create temp function extract_values(input string) returns array<string> language js as """
return Object.values(JSON.parse(input));
""";
create temp function extract_all_leaves(input string) returns string language js as '''
function flattenObj(obj, parent = '', res = {}){
for(let key in obj){
let propName = parent ? parent + '.' + key : key;
if(typeof obj[key] == 'object'){
flattenObj(obj[key], propName, res);
} else {
res[propName] = obj[key];
}
}
return JSON.stringify(res);
}
return flattenObj(JSON.parse(input));
''';
create temp table temp_table as (
select offset, key, value, id
from your_table t,
unnest([struct(extract_all_leaves(payload) as leaves)]),
unnest(extract_keys(leaves)) key with offset
join unnest(extract_values(leaves)) value with offset
using(offset)
);
execute immediate (select '''
select * from (select * except(offset) from temp_table)
pivot (any_value(value) for replace(key, '.', '__') in (''' || keys_list || '''
))'''
from (select string_agg('"' || replace(key, '.', '__') || '"', ',' order by offset) keys_list from (
select key, min(offset) as offset from temp_table group by key
))
); }
Your data is nearly JSON; in fact it looks like Javascript, and JSON being a strict subset of Javascript there are some things to change to make it a valid, strict JSON;
however, as GBQ understands more than strict JSON, with a single change you can make it understand simple entries of your data (like the first one of your example):
replace '
by "
in field values.
To that goal, replace extract_all_leaves(payload)
by extract_all_leaves(replace(payload, "'", '"'))
.
But as your data can embed '
and "
(in the text fields, apart from those delimiting text fields), you'll need a bit more of intelligence to handle them.
Taking in consideration that we may have "…'…"
, '…"…'
but probably some "…'…\"…"
too (see my comment question about it),
this will require a regex to handle the two cases, either "starting and ending with double quotes with no double quote in it or it is preceded by a \
", or same with a single quote,
we can handle those by replacing:
extract_all_leaves(payload)
with
extract_all_leaves(regexp_replace(payload, "'(\\\\.|[^']+)*'|\"(\\\\.|[^\"]+)*\"", '"\\1\\2"'))
(we cannot handle only the '
case and let the "
s as is, because then the '
of "O'Connor"
would be interpreted as the start of a new '…'
sequence, shifting everything and finally making the expression fail.