I am using the Firebase Extension Stream Firestore to BigQuery to export my data to BQ. It is working great, and I have been using it for a couple of years now. However, the table size is starting to get larger causing an increase in cost for my queries. I have the following tables created by this and npx @firebaseextensions/fs-bq-schema-views:
A single SELECT from the mytable_schema_latest table now processes 3.2GB
The options I can think of are
I am looking to go with Option 1 - remove old rows for now as this seems the simplest. Is there a recommended way or any guides on how to do this?
I only want to remove old historic changes that occurred before a certain date where the document has since been updated.
It may also be a good idea to copy these to a historic table before deleting them in case I need them in the future.
Update 2024-03-28: I noticed there is a new parameter with the latest version of the extension. Not much info but maybe this will do the trick?
Exclude old data payloads If enabled, table rows will never contain old data (document snapshot before the update), which should be more performant, and avoid potential resource limitations.
Update 2024-05-02: I used Gregs suggestion below and it worked well. Although due to the possibility of duplicate document_id I needed to use document_name which is the full document path and is actually unique.
DELETE FROM your_dataset.your_table_changelog
WHERE STRUCT(document_name, timestamp) NOT IN (
SELECT AS STRUCT document_name, MAX(timestamp) AS timestamp
FROM your_dataset.your_table_changelog
GROUP BY document_name
)
Update 2024-05-06:
Turns out I needed to add the where at the bottom as I was getting the error: UPDATE or DELETE statement over table ... would affect rows in the streaming buffer, which is not supported
DELETE FROM your_dataset.your_table_changelog
WHERE STRUCT(document_name, timestamp) NOT IN (
SELECT AS STRUCT document_name, MAX(timestamp) AS timestamp
FROM your_dataset.your_table_changelog
GROUP BY document_name
)
AND timestamp < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 45 MINUTE)
To create a "backup" of a given table, you can simply run a query and then use SAVE RESULTS >> BigQuery Table to save to a new table. Now you have a "backup" and if something goes wrong you can restore from this new table.
I think the following SQL statement might delete "old" rows for you:
DELETE FROM your_dataset.your_table
WHERE STRUCT(document_name, timestamp) NOT IN (
SELECT AS STRUCT document_name, MAX(timestamp) AS timestamp
FROM your_dataset.your_table
GROUP BY document_name
)