google-cloud-platform google-cloud-bigtable

How to calculate the compressed storage size of rows matching a key regex in Google Cloud Bigtable?

I have a very large Google Cloud Bigtable table, currently at 52~57 TB, and I need to find out how much storage a specific subset of rows is consuming. The rows I'm interested in can be identified using a regular expression on their row key.

I've attempted to use the cbt command-line tool, but it seems impractical for this task. Even running a simple operation to count the first 100 rows matching my regex takes a very long time/timeouts as per documentation

More importantly, even if I could efficiently retrieve the data for the matching keys, I'm not sure if the size I calculate locally would be accurate. I suspect cbt provides the uncompressed data, but what I need is the actual, post-compression storage size that these rows occupy on Google's servers. I need this number to plan future storage needs.

Does any tool, API, etc, exist that can provide me with this info?

Solution

Since Google Cloud Bigtable does not provide a native tool, API, or built-in feature that directly reports the post-compression storage size for a specific set of rows or a subset of data matching a regex on the row key.

You could set up a Dataflow job that reads from Bigtable, filters based on your regex, and writes the results to GCS. This will allow you to store the data in a compressed format and then use GCS tools to check the size.

Otherwise, use the Bigtable client libraries to scan rows matching your regex and then estimate the compressed size using a typical compression ratio for Bigtable's Snappy compression.

You can pattern to this code:

from google.cloud import bigtable
from google.cloud.bigtable import row_filters

# Initialize Bigtable client
client = bigtable.Client(project='your-project-id', admin=True)
instance = client.instance('your-instance-id')
table = instance.table('your-table-name')

# Use RowKeyRegexFilter to filter rows based on regex pattern
regex_filter = row_filters.RowKeyRegexFilter(b"your-regex-pattern")

# Read rows matching the regex
rows = table.read_rows(filter=regex_filter)

# Calculate the uncompressed size of these rows
total_uncompressed_size = 0
for row in rows:
    total_uncompressed_size += len(row.cells)  # or len(row.data)

# Estimate compressed size (assuming 3x compression ratio)
estimated_compressed_size = total_uncompressed_size / 3

print(f"Estimated uncompressed size: {total_uncompressed_size} bytes")
print(f"Estimated compressed size: {estimated_compressed_size} bytes")