I have e-commerce product data in JSON format. This data is frequently updated. I am planning to load this data into BigQuery. Given that the JSON files are huge (a few hundred attributes) and there are many updates, I found that it is not very cost-effective to store the entire data in BigQuery and update it frequently using merge queries.
I am now planning to store the frequently used attributes in individual columns in a BigQuery table. However, I also need to support the use cases when somebody wants to access any other attribute of the JSON for some ad hoc analysis. To address this, can I use the following strategy:
Store the JSON for a product in its own file in GCP in a specific directory. Create an external table using all the files in that directory. Each file's content becomes a row in the external table describing a single product. When updates happen to a product, I update the BigQuery table and also replace the existing file for that product with a new file.
Is this a reasonable approach? If yes, how can I create the external table from the directory?
There are multiple approaches to loading JSON
data to BigQuery. Every approach may have design considerations in terms of cost, complexity, Ops.
To provide you with Yes/No
answer really depends on these three factors.
Having said, your approach of create external
table will work too.
Solution: All SQL-like solution
Here's a external
table query which will create a load data in a single column - see documentation
CREATE OR REPLACE EXTERNAL TABLE YOUR_PROJECT.YOUR_DATASET.my_newline_json_arrays ( col1 STRING ) OPTIONS ( format = 'CSV', field_delimiter = '\x10', quote = '', uris = ['gs://YOUR_BUCKET/my_newline_json_arrays.json'] );
After creating external table, you can use BigQuery json functions
to select the attributes you want and run a BigQuery scheduled query to load the data into BigQuery. To identify changes to the attributes, you can consider using table valued function
(TVF) as mentioned here in documentation