pythongoogle-cloud-platformgoogle-bigqueryetl

What is the preferred way to load data from an API into BigQuery?


I am trying to get data from a REST API into BigQuery on the Google Cloud Platform (GCP). What is the best way to achieve that (without using any third party tools such as Funnel.io or Supermetrics)?

Most tutorials I could find suggest to write the data as CSV files to Cloud Storage and then use DataFlow to load the data into BigQuery. This however seems to be a bit cumbersome, especially for smaller files. There should be a way to do that without the intermediate step to write to CSV. Can this be achieved (within GCP) and if so, what is the best way?

PS: If the size of the data is relevant for the answer: I'm trying to load a total of about 10,000 rows of data (one-off) with about 100 new columns coming in every day - ideally updating every hour.


Solution

  • Following up on the hint by @Kolban above, loading data from an API into BigQuery without using third party tools and without writing an intermediate file to Google Cloud Storage is possible, and indeed quite simple, by "streaming" data into BigQuery:

    rows_to_insert = [(u"Phred Phlyntstone", 32), (u"Wylma Phlyntstone", 29)]
    
    errors = client.insert_rows(table, rows_to_insert)  # Make an API request.
    if errors == []:
        print("New rows have been added.")
    

    (From the BQ documentation)

    In order to prepare the JSON data, it has to be turned into tuples. Here's an excerpt from my code to achieve this:

    # Turn JSON into tuples
    data_tuples = []
    for key,value in resp_json[product_id].items():
        data_tuples.append((
            value["product_id"],
            value["downloads"]
            )
        )
    
    # Insert into BQ
    errors = client.insert_rows(table, data_tuples)
    if errors == []:
        print("New rows have been added.")
    else:
        print(errors)