arraysgoogle-bigquerydata-conversiongoogle-cloud-dataprep

Convert a string to an array in Google Dataprep


I'm trying to denormalize data in Dataprep so that I can use it in BigQuery.

More specifically, I'd like to turn entries in an account_profile table linked to my account table with with foreign key 'account_id' into an array in my account table. (Account_profile stores contact methods... bad name, I know.)

In dataprep, I've

  1. turned the rows in account_profile into json objects,
  2. and then joined the two tables via account_id,
  3. then grouped the rows by account_id and used the aggregate function LIST to convert all objects into an array of objects.

The problem is that when I try to unnest that column in BigQuery, or do any other array-like operation in BigQuery, I get an error like this: "Values referenced in UNNEST must be arrays."

My data looks good. For example, here is a row.

[{"profile_identifier":"ttcuongem+29@gmail.com","verification_code":"abc789","enabled":true,"id1":2818},{"profile_identifier":"xyz123,"enabled":false,"id1":2874}]

I can't find a way to make BigQuery see this as an array, nor can I find a way to make Dataprep create this kind of data as an array rather than a string. The only solutions people have posted are very specific hacks that wouldn't apply to this generic case.

I feel that I'm following denormalization best practices and am surprised that this gap exists in the Google ELT toolchain. What am I missing?


Solution

  • Below is for BigQuery Standard SQL

    You can use recently introduced JSON_EXTRACT_ARRAY function for this as in example below

    #standardSQL
    WITH `project.dataset.table` AS (
    SELECT '''[
      {"profile_identifier":"ttcuongem+29@gmail.com","verification_code":"abc789","enabled":true,"id1":2818},
      {"profile_identifier":"xyz123","enabled":false,"id1":2874}
    ]''' string_col
    )
    SELECT JSON_EXTRACT_ARRAY(string_col) AS arr_col
    FROM `project.dataset.table`   
    

    with output

    Row arr_col
    1   {"profile_identifier":"ttcuongem+29@gmail.com","verification_code":"abc789","enabled":true,"id1":2818}   
        {"profile_identifier":"xyz123","enabled":false,"id1":2874}