I am currently exporting my data (from a destination table in Bigquery) to a bucket in GCS. Doing this programmatically using the Bigquery API.
There is a constraint while exporting data from Bigquery to GCS - the data should not be greater than 1GB.
Here is the code snippet for the function exportDataToGCS() where this is happening:
http = authorize();
bigquery_service = build('bigquery', 'v2', http=http)
query_request = bigquery_service.jobs()
DESTINATION_PATH = constants.GCS_BUCKET_PATH + canonicalDate + '/'
query_data = {
'projectId': 'ga-cnqr',
'configuration': {
'extract': {
'sourceTable': {
'projectId': constants.PROJECT_ID,
'datasetId': constants.DEST_TABLES_DATASET_ID,
'tableId': canonicalDate,
},
'destinationUris': [DESTINATION_PATH + canonicalDate + '-*.gz'],
'destinationFormat': 'CSV',
'printHeader': 'false',
'compression': 'GZIP'
}
}
}
query_response = query_request.insert(projectId=constants.PROJECT_NUMBER,
body=query_data).execute()
After this function is executed, in my GCS bucket, my files show up in the following manner:
However, I am curious to know whether there can be any scenarios where the file was supposed to be split in 10 parts, but only 3 parts made it to the bucket because the above function failed.
That is, could there be a partial export ?
Could reasons like network drop or the process running the function being killed etc lead to this? Is this process a blocking call? Asynchronous?
Thanks in advance.
Update 1: Status parameter in query response
This is how I am checking for the DONE status.
while True:
status = query_request.get(projectId=constants.PROJECT_NUMBER, jobId=query_response['jobReference']['jobId']).execute()
if 'DONE' == status['status']['state']:
logging.info("Finished exporting for the date : " + stringCanonicalDate);
return
Partial exports are possible if the job fails for some reason mid-way through execution.
If the job is in the DONE state, and there are no errors in the job, then all the data has been exported.
I recommend waiting a bit before polling for job done -- you can hit rate limit errors if you poll too quickly, and given that these async jobs are not fast, millisecond accuracy isn't needed.
Working with your sample code, here is how you could test for the presence of errors:
while True:
status = query_request.get(projectId=constants.PROJECT_NUMBER, jobId=query_response['jobReference']['jobId']).execute()
if 'DONE' == status['status']['state']:
if 'errorResult' in status['status']:
logging.error("Error exporting for the date : " + stringCanonicalDate);
logging.error(status['status']['errorResult'])
return False
logging.info("Finished exporting for the date : " + stringCanonicalDate);
return True
time.sleep(1)
To be super robust, you could also catch HTTP errors that occasionally occur while in a polling wait loop. It looks like you are using the python apiclient, which raises apiclient.errors.HttpError
on such failures.