google-cloud-platformgoogle-bigquerydata-integration

How to make requests in third party APIs and load the results periodically on google BigQuery? What google services should I use?


I need to get the data from a third party API and ingest it in google BigQuery. Perhaps, I need to automate this process through google services to do it periodically.

I am trying to use Cloud Functions, but it needs a trigger. I have also read about App Engine, but I believe it is not suitable for only one function to make pull requests.

Another doubt is: do I need to load the data into cloud storage or can I load it straight to BigQuery? Should I use Dataflow and make any configuration?

def upload_blob(bucket_name, request_url, destination_blob_name):
    """
    Uploads a file to the bucket.
    """
    storage_client = storage.Client()
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(destination_blob_name)

    request_json = requests.get(request_url['url'])

    print('File {} uploaded to {}.'.format(
        bucket_name,
        destination_blob_name))

def func_data(request_url):
    BUCKET_NAME = 'dataprep-staging'
    BLOB_NAME = 'any_name'
    BLOB_STR = '{"blob": "some json"}'

    upload_blob(BUCKET_NAME, request_url, BLOB_NAME)
    return f'Success!'

I expect advise about the architecture (google services) that I should use for creating this pipeline. For example, use cloud functions (get the data from API), then schedule a job using service 'X' to input data to storage and finally pull the data from storage.


Solution

  • You can use function. Create an http triggered function and call it periodically with cloud scheduler.

    By the way, you can also call http endpoint of appengine or cloud run.

    About storage, answer is no. If the API result is not too large for function allowed memory, you can write in /tmp directory and load data to bigquery with this file. You can size your function up to 2go if needed