I was hoping for an explanation of some unexpected BQ snapshot behavior. Here's a minimal example to create a table and snapshot it twice:
-- Create a partitioned table and populate it with data
CREATE OR REPLACE TABLE `your_project.your_dataset.snapshot_example_data` (
id STRING,
value STRING,
created_on DATE
)
PARTITION BY created_on AS
SELECT
GENERATE_UUID() AS id,
GENERATE_UUID() AS value,
CASE
WHEN n <= 500000 THEN DATE("2025-01-01")
ELSE DATE("2025-02-01")
END AS created_on
FROM
UNNEST(GENERATE_ARRAY(1, 1000000)) AS n;
-- Create the first snapshot
CREATE OR REPLACE SNAPSHOT TABLE `your_project.your_dataset.snapshot_example_data_1`
CLONE `your_project.your_dataset.snapshot_example_data`;
-- Update some rows in the January partition
UPDATE `your_project.your_dataset.snapshot_example_data`
SET value = GENERATE_UUID()
WHERE created_on = DATE('2025-01-01')
AND STARTS_WITH(value, 'a');
-- Create the second snapshot
CREATE OR REPLACE SNAPSHOT TABLE `your_project.your_dataset.snapshot_example_data_2`
CLONE `your_project.your_dataset.snapshot_example_data`;
My second snapshot is the same size as the first, in both logical bytes and physical bytes, even though I only modified a subset of rows, columns, and partitions. I'd the second snapshot to be significantly smaller, storing only the differences from the first.
Why is this the case? All I can think of is that there are unexpected consequences of creating snapshots within the time travel window, but doesn't seem likely.
INFORMATION_SCHEMA.TABLE_STORAGE
doesn't account for the way snapshot storage works. From the docs:
Clones and snapshots show *_BYTES column values as if they were complete tables, rather than showing the delta from the storage used by the base table, so they are an over-estimation. Your bill does account correctly for this delta in storage usage. For more information on the delta bytes stored and billed by clones and snapshots, see the TABLE_STORAGE_USAGE_TIMELINE view.
I wanted to test that suggestion out, but also from the docs:
The data in this table is not kept in real time. It takes approximately 72 hours for table data to be reflected in this view.
EDIT: (SpongeBob-style "Three Days Later") I came back to check. snapshot1
stores almost 50% of the original table (data
) and snapshot2
... doesn't even show up, which makes sense, even if we might like it to say 0 bytes instead.
select
usage_date,
table_name,
billable_total_logical_usage,
billable_active_logical_usage
from `region-us`.INFORMATION_SCHEMA.TABLE_STORAGE_USAGE_TIMELINE
where table_schema = 'temp'
and table_name in ('snapshot1', 'snapshot2', 'data')
;
-- 2025-02-22 data 5335051 5335051
-- 2025-02-22 snapshot1 2667228 2667228