pythongoogle-bigquerygoogle-cloud-storagegoogle-python-api

Read huge JSON line by line from Google Cloud Storage with Python


I know I should have a code but I have nothing useful yet.

There is ~300GB JSON file on my GCS gs://path/listings_all.json ultimately I'm trying to import it into BigQuery but it has some wrong data structure (I have sourced it by mongoexport from MongoDB)

invalid field name "$date". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long

So, now my approach is to somehow read source file line by line from GCS process it and upload each processed line to BigQuery using python API.

Below simple reader I have put together to test with sample 100 lines from the original huge file:

import json
from pprint import pprint

with open('schema_in_10.json') as f:
    for line in f:
        j_content = json.loads(line)

        # print(j_content['id'], j_content['city'], j_content['country'], j_content['state'], j_content['country_code'], j_content['smart_location'], j_content['address'], j_content['market'], j_content['neighborhood'])
        # // geo { lat, lng}'])
        print('------')
        pprint(j_content['is_location_exact'])
        pprint(j_content['zipcode'])
        pprint(j_content['name'])

Can you please help me on how can I read or stream a huge JSON line by line from Google Cloud Storage with Python3?


Solution

  • Reading it line by line and then trying to stream to BigQuery won't scale with 300GB on your local machine, and you'll struggle to get this working TBH.

    There's a couple of scalable options:

    1. Write a Cloud Dataflow pipeline to read your file from GCS (it will scale for you and read in parallel), correct the field name, and then write to BigQuery. See here.
    2. Load it directly into BigQuery using CSV instead JSON as the format and using a delimiter that doesn't appear in your data. This will load each record into a single String column and then you can use BigQuery's JSON functions to extract what you need. See here.