pythongoogle-cloud-platformgoogle-bigqueryftpgoogle-cloud-functions

Extracting excel files from the FTP to BigQuery using Cloud Functions


I am working on creating an automated script to download files from a FTP and store them into BigQuery.

Problem is that BigQuery accepts only .csv files. For this reason I am looking for ways to handle also .xlsx and .xls files, conditional on that I am planning to put this batch code in the Cloud.

I am mentioning the latter as one way of going about to convert .xlsx files to .csv is to use something like:

import pandas as pd
data_xls = pd.read_excel('file_on_ftp.xlsx')
data_xls.to_csv('csvfile.csv', encoding='utf-8', index=False)

However, this will create local files somewhere in the temporary storage. Obviously with Cloud Functions I have to then monitor whether the file has been deleted afterwards or not, making it not reliable when one cloud function would Crash.

Are there therefore better ways to handle .xlsx loading into BigQuery? Or is this the way to go?


Solution

  • You might be interested this guide, which was just recently published: "Streaming data from Cloud Storage into BigQuery using Cloud Functions".

    The general architecture would be:

    1. upload the files in question from FTP to Cloud Storage
    2. your Cloud Function receives the upload event from Cloud Storage
    3. your Cloud Function loads the file into memory (no storage on disk)
    4. your Cloud Function streams the data to BigQuery

    I'm not sure if #1 suits your need to ensure that the file is not left behind somewhere, but I think it's probably the best possible solution if you are operating under the assumption that the file needs to be uploaded to GCP somewhere (the alternative would be streaming to BigQuery directly from your local machine, or an instance that is under your control).