sqlgoogle-cloud-platformgoogle-bigquery

BigQuery select last modified time using __TABLES__ from all tables in all datasets?


I would like to get list or csv and store it in GCS, where the list contains list of all the tables and the last modified date.

I have more than 130 datasets and 10-20 tables in each datasets. Since I will be running this query everyday using scheduled github actions to compare and check which tables are modified after 2 months gap and send alert notification everyday to my team, I am trying to find a simple solution to get fetch this information from BQ.

I have got the query for single dataset:

SELECT 
   table_id,
   DATE(TIMESTAMP_MILLIS(last_modified_time)) AS last_modified
FROM
   `project_id.dataset_name.__TABLES__`

and to find all datasets I can use query :

SELECT
    schema_name
FROM
    `project_id`.`region-europe-west3`.INFORMATION_SCHEMA.SCHEMATA;

Solution

  • In general, this is called dynamic SQL and it can be done in GBQ with the execute immediate command.

    execute immediate (
      select
        string_agg(
          concat(
            'select ',
            chr(39),
            schema_name,
            chr(39),
            ' as schema_name, table_id, date(timestamp_millis(last_modified_time)) as last_modified from project-id.', 
            schema_name,
            '.__TABLES__'
          ),
          ' union all '
        ) as query
      from `project-id`.`region-id`.INFORMATION_SCHEMA.SCHEMATA
    )
    ;
    

    If you have a lot of schemata and are hitting rate limits, you may consider using some kind of loop in a scripting language with delays. Here's a Python/SQL example:

    import csv
    
    from google.cloud import bigquery
    
    client = bigquery.Client(project='project-id')
    data = []
    for dataset in client.list_datasets():
        dataset_id = dataset.dataset_id
        query = f"""
            select
                table_id,
                timestamp_millis(last_modified_time) AS modified
            from `{dataset_id}`.`__TABLES__`
        """
        query_job = client.query(query)
        results = query_job.result()
        for row in results:
            data.append([dataset_id, row.table_id, row.modified])
        # time.sleep(1)
    
    with open('file.csv', 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerows(data)
    

    Here's a pure Python example:

    import csv
    import time
    
    from google.cloud import bigquery
    
    client = bigquery.Client(project='project-id')
    data = []
    for dataset in client.list_datasets():
        for table in client.list_tables(client.dataset(dataset.dataset_id)):
            table = client.get_table(f"{dataset.dataset_id}.{table.table_id}")
            data.append([dataset.dataset_id, table.table_id, table.modified])
            # time.sleep(1)
        # time.sleep(1)
    
    with open('file.csv', 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerows(data)   
    

    You can try this out easily in Google Colab, just add this at the top and make sure that you are signed into an account with access to run jobs and view data.

    from google.colab.auth import authenticate_user
    authenticate_user()