pythonpandasgoogle-cloud-platformgoogle-bigqueryparquet

Loading BigQuery tables from large pandas DataFrames


I am trying to load a relatively large pandas dataframe df into a Google BigQuery table table_ref using the official python google-cloud-bigquery client library.

So far I have tried two different approaches:

1) load the table directly from the dataframe in memory

client = bigquery.Client()
client.load_table_from_dataframe(df, table_ref)

2) save the dataframe to a parquet file in Google Cloud Storage at the uri parquet_uri and load the table from that file:

df.to_parquet(parquet_uri)
client = bigquery.Client()
client.load_table_from_uri(parquet_uri, table_ref)

Both approaches lead to the same error:

google.api_core.exceptions.BadRequest: 400 Resources exceeded during query execution: UDF out of memory.; Failed to read Parquet file [...]. This might happen if the file contains a row that is too large, or if the total size of the pages loaded for the queried columns is too large.

The dataframe df has 3 columns and 184 million rows. When saved to parquet file format, it occupies 1.64 GB.

Is there any way to upload such a dataframe into a BigQuery table using the official python client library?

Thank you in advance,

Giovanni


Solution

  • I was able to upload the large df to BigQuery by splitting it into a few chunks and loading-appending each of them to a table in BigQuery, e.g.:

    client = bigquery.Client()
    for df_chunk in np.array_split(df, 5):
        job_config = bigquery.LoadJobConfig()
        job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND
        job = client.load_table_from_dataframe(df_chunk, table_id, job_config=job_config)
        job.result()