google-cloud-platformgoogle-bigquerygoogle-cloud-storagegoogle-cloud-vertex-aigoogle-cloud-api-gateway

Save the result of a query in a BigQuery Table, in Cloud Storage


I would like to know what is the OPTIMAL way to store the result of a Google BigQuery table query, to Google Cloud storage. My code, which is currently being run in some Jupyter Notebook (in Vertex AI Workbench, same project than both the BigQuery data source, as well as the Cloud Storage destination), looks as follows:

# CELL 1 OF 2

from google.cloud import bigquery
bqclient = bigquery.Client()

# The query string can vary:
query_string = """
        SELECT *  
        FROM `my_project-name.my_db.my_table` 
        LIMIT 2000000
        """

dataframe = (
    bqclient.query(query_string)
    .result()
    .to_dataframe(
        create_bqstorage_client=True,
    )
)
print("Dataframe shape: ", dataframe.shape)

# CELL 2 OF 2:

import pandas as pd
dataframe.to_csv('gs://my_bucket/test_file.csv', index=False)

This code takes around 7.5 minutes to successfully complete.

Is there a more OPTIMAL way to achive what was done above? (It would mean faster, but maybe something else could be improved).

Some additional notes:

  1. I want to run it "via a Jupyter Notebook" (in Vertex AI Workbench), because sometimes some data preprocessing, or special filtering must be done, which cannot be easily accomplished via SQL queries.
  2. For the first part of the code, I have discarded pandas.read_gbq, as it was giving me some weird EOF errors, when (experimentally) "storing as .CSV and reading back".
  3. Intuitively, I would focus the optimization efforts in the second half of the code (CELL 2 OF 2), as the first one was borrowed from the official Google documentation. I have tried this but it does not work, however in the same thread this option worked OK.
  4. It is likley that this code will be included in some Docker image afterwards, so "as little libraries as possible" must be used.

Thank you.


Solution

  • After some experiments, I think I have got to a solution for my original post. First, the updated code:

    import pandas as pd  # Just one library is imported this time
    
    # This SQL query can vary, modify it to match your needs
    query_string = """
    SELECT *
    FROM `my_project.my_db.my_table`
    LIMIT 2000000
    """
    
    # One liner to query BigQuery data.
    downloaded_dataframe = pd.read_gbq(query_string, dialect='standard', use_bqstorage_api=True)
    
    # Data processing (OPTIONAL, modify it to match your needs)
    # I won't do anything this time, just upload the previously queried data
    
    # Data store in GCS
    downloaded_dataframe.to_csv('gs://my_bucket/uploaded_data.csv', index=False)
    

    Some final notes:

    1. I have not done an "in-depth research" about the processing speed VS the number of rows existing in a BigQuery table, however I saw that the processing time with the updated code and the original query, now takes ~6 minutes; that's enough for the time being. This answer might have some room for further improvements therefore, but it's better than the original situation.
    2. The EOF error I mentioned in my original post was: ParserError: Error tokenizing data. C error: EOF inside string starting at row 70198. In the end I got to realize that it did not have anything to do with pandas_gbq function, but with "how I was saving the data". See, I was 'experimentally' storing the .csv file in the Vertex AI Workbench local storage, then downloading it to my local device, and when trying to open that data from my local device, I kept stumbling upon that error, however not getting the same when downloading the .csv data from Cloud Storage ... Why? Well, it happens that if you download the .csv data "very quickly" after "it gets generated" (i.e., after few seconds), from Vertex AI Workbench local storage, the data is simply still incomplete, but it does not give any error or warning message: it will simply "let you start with the download". For this reason, I think it is safer to export your data to Cloud Storage, and then download safely from there. This behaviour is more noticeable on large files (i.e. my own generated file, which had ~3.1GB in size).

    Hope this helps.

    Thank you.