pythonpandasgoogle-bigquerygoogle-cloud-storagegoogle-cloud-python

Efficiently write a Pandas dataframe to Google BigQuery


I'm trying to upload a pandas.DataFrame to Google Big Query using the pandas.DataFrame.to_gbq() function documented here. The problem is that to_gbq() takes 2.3 minutes while uploading directly to Google Cloud Storage takes less than a minute. I'm planning to upload a bunch of dataframes (~32) each one with a similar size, so I want to know what is the faster alternative.

This is the script that I'm using:

dataframe.to_gbq('my_dataset.my_table', 
                 'my_project_id',
                 chunksize=None, # I have tried with several chunk sizes, it runs faster when it's one big chunk (at least for me)
                 if_exists='append',
                 verbose=False
                 )

dataframe.to_csv(str(month) + '_file.csv') # the file size its 37.3 MB, this takes almost 2 seconds 
# manually upload the file into GCS GUI
print(dataframe.shape)
(363364, 21)

My question is, what is faster?

  1. Upload Dataframe using pandas.DataFrame.to_gbq() function
  2. Saving Dataframe as CSV and then upload it as a file to BigQuery using the Python API
  3. Saving Dataframe as CSV and then upload the file to Google Cloud Storage using this procedure and then reading it from BigQuery

Update:

Alternative 1 seems faster than Alternative 2 , (using pd.DataFrame.to_csv() and load_data_from_file() 17.9 secs more in average with 3 loops):

def load_data_from_file(dataset_id, table_id, source_file_name):
    bigquery_client = bigquery.Client()
    dataset_ref = bigquery_client.dataset(dataset_id)
    table_ref = dataset_ref.table(table_id)
    
    with open(source_file_name, 'rb') as source_file:
        # This example uses CSV, but you can use other formats.
        # See https://cloud.google.com/bigquery/loading-data
        job_config = bigquery.LoadJobConfig()
        job_config.source_format = 'text/csv'
        job_config.autodetect=True
        job = bigquery_client.load_table_from_file(
            source_file, table_ref, job_config=job_config)

    job.result()  # Waits for job to complete

    print('Loaded {} rows into {}:{}.'.format(
        job.output_rows, dataset_id, table_id))

Solution

  • I did the comparison for alternative 1 and 3 in Datalab using the following code:

    from datalab.context import Context
    import datalab.storage as storage
    import datalab.bigquery as bq
    import pandas as pd
    from pandas import DataFrame
    import time
    
    # Dataframe to write
    my_data = [{1,2,3}]
    for i in range(0,100000):
        my_data.append({1,2,3})
    not_so_simple_dataframe = pd.DataFrame(data=my_data,columns=['a','b','c'])
    
    #Alternative 1
    start = time.time()
    not_so_simple_dataframe.to_gbq('TestDataSet.TestTable', 
                     Context.default().project_id,
                     chunksize=10000, 
                     if_exists='append',
                     verbose=False
                     )
    end = time.time()
    print("time alternative 1 " + str(end - start))
    
    #Alternative 3
    start = time.time()
    sample_bucket_name = Context.default().project_id + '-datalab-example'
    sample_bucket_path = 'gs://' + sample_bucket_name
    sample_bucket_object = sample_bucket_path + '/Hello.txt'
    bigquery_dataset_name = 'TestDataSet'
    bigquery_table_name = 'TestTable'
    
    # Define storage bucket
    sample_bucket = storage.Bucket(sample_bucket_name)
    
    # Create or overwrite the existing table if it exists
    table_schema = bq.Schema.from_dataframe(not_so_simple_dataframe)
    
    # Write the DataFrame to GCS (Google Cloud Storage)
    %storage write --variable not_so_simple_dataframe --object $sample_bucket_object
    
    # Write the DataFrame to a BigQuery table
    table.insert_data(not_so_simple_dataframe)
    end = time.time()
    print("time alternative 3 " + str(end - start))
    

    and here are the results for n = {10000,100000,1000000}:

    n       alternative_1  alternative_3
    10000   30.72s         8.14s
    100000  162.43s        70.64s
    1000000 1473.57s       688.59s
    

    Judging from the results, alternative 3 is faster than alternative 1.