I would like to get list or csv and store it in GCS, where the list contains list of all the tables and the last modified date.
I have more than 130 datasets and 10-20 tables in each datasets. Since I will be running this query everyday using scheduled github actions to compare and check which tables are modified after 2 months gap and send alert notification everyday to my team, I am trying to find a simple solution to get fetch this information from BQ.
I have got the query for single dataset:
SELECT
table_id,
DATE(TIMESTAMP_MILLIS(last_modified_time)) AS last_modified
FROM
`project_id.dataset_name.__TABLES__`
and to find all datasets I can use query :
SELECT
schema_name
FROM
`project_id`.`region-europe-west3`.INFORMATION_SCHEMA.SCHEMATA;
In general, this is called dynamic SQL and it can be done in GBQ with the execute immediate
command.
execute immediate (
select
string_agg(
concat(
'select ',
chr(39),
schema_name,
chr(39),
' as schema_name, table_id, date(timestamp_millis(last_modified_time)) as last_modified from project-id.',
schema_name,
'.__TABLES__'
),
' union all '
) as query
from `project-id`.`region-id`.INFORMATION_SCHEMA.SCHEMATA
)
;
If you have a lot of schemata and are hitting rate limits, you may consider using some kind of loop in a scripting language with delays. Here's a Python/SQL example:
import csv
from google.cloud import bigquery
client = bigquery.Client(project='project-id')
data = []
for dataset in client.list_datasets():
dataset_id = dataset.dataset_id
query = f"""
select
table_id,
timestamp_millis(last_modified_time) AS modified
from `{dataset_id}`.`__TABLES__`
"""
query_job = client.query(query)
results = query_job.result()
for row in results:
data.append([dataset_id, row.table_id, row.modified])
# time.sleep(1)
with open('file.csv', 'w', newline='', encoding='utf-8') as csvfile:
writer = csv.writer(csvfile)
writer.writerows(data)
Here's a pure Python example:
import csv
import time
from google.cloud import bigquery
client = bigquery.Client(project='project-id')
data = []
for dataset in client.list_datasets():
for table in client.list_tables(client.dataset(dataset.dataset_id)):
table = client.get_table(f"{dataset.dataset_id}.{table.table_id}")
data.append([dataset.dataset_id, table.table_id, table.modified])
# time.sleep(1)
# time.sleep(1)
with open('file.csv', 'w', newline='', encoding='utf-8') as csvfile:
writer = csv.writer(csvfile)
writer.writerows(data)
You can try this out easily in Google Colab, just add this at the top and make sure that you are signed into an account with access to run jobs and view data.
from google.colab.auth import authenticate_user
authenticate_user()