pythongoogle-cloud-platformgoogle-bigquerygoogle-cloud-python

Is there a BigQuery LoadJob config to filter out unwanted columns using schema autodetect


I appreciate this may be quite trivial, but I am struggling to find an elegant solution.

Providing that I have access to modify the job configuration, through python in this case.

When I am invoking a load job through BigQuery Python API using schema autodetect to parse a CSV file into a BigQuery table.

Is it possible to ignore certain columns as a part of the load job?

For Example

I am creating a LoadJob sourced from the following CSV file (which I have formatted to make it easier on the eyes).

First_Name, Age, Gender
   John,    26,   Male

Is it possible to invoke a LoadJob using the Python BigQuery API which will produce the following table -

----------------
| Age | Gender |
----------------
| 26  |  Male  |

My current solution uses an external table which is as follows, creates an external table and uses SQL to filter and save the result as a new table, surely there is an easier way to do this via the bigquery.job.LoadJobConfig class.

# Configure external Table
uri = 'gs://my-bucket/myFile.csv'
table = bigQuery.Table('my-project.my-dataset.my-temporary-table')
external_config = bigquery.ExternalConfig("CSV")
external_config.source_uris = [
    uri
]
external_config.options.skip_leading_rows = 1
external_config.autodetect = True
table.external_data_configuration = external_config

# Create External Table
external_table = bq_client.create_table(table) 

filtered_table = bigquery.Table('my-project.my-dataset-filtered-table')

filtered_table_config = bigquery.QueryJobConfig(destination=filtered_table)
   sql = 
     f"""
       SELECT Age, Gender
       FROM {external_table.project}.{external_table.dataset_id}.{external_table.table_id}
     """
# Starting the query, passing in the extra configuration.
query_job = bq_client.query(sql, job_config=job_config)  # Make an API request.
query_job.result() # async, wait for response from API.
bq_client.delete_table(table=external_table, timeout=30) # Remove the external table

Solution

  • No, you can't do that. The best way is to load the whole file in a table and then to performa select only on the desired column.

    Ok, you will get useless space in BigQuery, but you won't pay for the processing (because you don't select the unwanted column).

    If there are too much unused data, the next solution is to query the useful data and to save them in a definitive table, and to delete the temporary table with all the data.