I would like to know what is the OPTIMAL way to store the result of a Google BigQuery table query, to Google Cloud storage. My code, which is currently being run in some Jupyter Notebook (in Vertex AI Workbench, same project than both the BigQuery data source, as well as the Cloud Storage destination), looks as follows:
# CELL 1 OF 2
from google.cloud import bigquery
bqclient = bigquery.Client()
# The query string can vary:
query_string = """
SELECT *
FROM `my_project-name.my_db.my_table`
LIMIT 2000000
"""
dataframe = (
bqclient.query(query_string)
.result()
.to_dataframe(
create_bqstorage_client=True,
)
)
print("Dataframe shape: ", dataframe.shape)
# CELL 2 OF 2:
import pandas as pd
dataframe.to_csv('gs://my_bucket/test_file.csv', index=False)
This code takes around 7.5 minutes to successfully complete.
Is there a more OPTIMAL way to achive what was done above? (It would mean faster, but maybe something else could be improved).
Some additional notes:
CELL 2 OF 2
), as the first one was borrowed from the official Google documentation. I have tried this but it does not work, however in the same thread this option worked OK.Thank you.
After some experiments, I think I have got to a solution for my original post. First, the updated code:
import pandas as pd # Just one library is imported this time
# This SQL query can vary, modify it to match your needs
query_string = """
SELECT *
FROM `my_project.my_db.my_table`
LIMIT 2000000
"""
# One liner to query BigQuery data.
downloaded_dataframe = pd.read_gbq(query_string, dialect='standard', use_bqstorage_api=True)
# Data processing (OPTIONAL, modify it to match your needs)
# I won't do anything this time, just upload the previously queried data
# Data store in GCS
downloaded_dataframe.to_csv('gs://my_bucket/uploaded_data.csv', index=False)
Some final notes:
ParserError: Error tokenizing data. C error: EOF inside string starting at row 70198
. In the end I got to realize that it did not have anything to do with pandas_gbq function, but with "how I was saving the data". See, I was 'experimentally' storing the .csv file in the Vertex AI Workbench local storage, then downloading it to my local device, and when trying to open that data from my local device, I kept stumbling upon that error, however not getting the same when downloading the .csv data from Cloud Storage ... Why? Well, it happens that if you download the .csv data "very quickly" after "it gets generated" (i.e., after few seconds), from Vertex AI Workbench local storage, the data is simply still incomplete, but it does not give any error or warning message: it will simply "let you start with the download". For this reason, I think it is safer to export your data to Cloud Storage, and then download safely from there. This behaviour is more noticeable on large files (i.e. my own generated file, which had ~3.1GB in size).Hope this helps.
Thank you.