google-bigqueryexternal-tables

BigQuery external table using JSON files


I have e-commerce product data in JSON format. This data is frequently updated. I am planning to load this data into BigQuery. Given that the JSON files are huge (a few hundred attributes) and there are many updates, I found that it is not very cost-effective to store the entire data in BigQuery and update it frequently using merge queries. 

I am now planning to store the frequently used attributes in individual columns in a BigQuery table. However, I also need to support the use cases when somebody wants to access any other attribute of the JSON for some ad hoc analysis. To address this, can I use the following strategy:

Store the JSON for a product in its own file in GCP in a specific directory. Create an external table using all the files in that directory. Each file's content becomes a row in the external table describing a single product. When updates happen to a product, I update the BigQuery table and also replace the existing file for that product with a new file.

Is this a reasonable approach? If yes, how can I create the external table from the directory?


Solution

  • There are multiple approaches to loading JSON data to BigQuery. Every approach may have design considerations in terms of cost, complexity, Ops. To provide you with Yes/No answer really depends on these three factors.

    Having said, your approach of create external table will work too.

    Solution: All SQL-like solution

    Here's a external table query which will create a load data in a single column - see documentation

    CREATE OR REPLACE EXTERNAL TABLE YOUR_PROJECT.YOUR_DATASET.my_newline_json_arrays ( col1 STRING ) OPTIONS ( format = 'CSV', field_delimiter = '\x10', quote = '', uris = ['gs://YOUR_BUCKET/my_newline_json_arrays.json'] );

    After creating external table, you can use BigQuery json functions to select the attributes you want and run a BigQuery scheduled query to load the data into BigQuery. To identify changes to the attributes, you can consider using table valued function (TVF) as mentioned here in documentation