I have a very large Google Cloud Bigtable table, currently at 52~57 TB, and I need to find out how much storage a specific subset of rows is consuming. The rows I'm interested in can be identified using a regular expression on their row key.
I've attempted to use the cbt command-line tool, but it seems impractical for this task. Even running a simple operation to count the first 100 rows matching my regex takes a very long time/timeouts as per documentation
More importantly, even if I could efficiently retrieve the data for the matching keys, I'm not sure if the size I calculate locally would be accurate. I suspect cbt provides the uncompressed data, but what I need is the actual, post-compression storage size that these rows occupy on Google's servers. I need this number to plan future storage needs.
Does any tool, API, etc, exist that can provide me with this info?
Since Google Cloud Bigtable does not provide a native tool, API, or built-in feature that directly reports the post-compression storage size for a specific set of rows or a subset of data matching a regex on the row key.
You could set up a Dataflow job that reads from Bigtable, filters based on your regex, and writes the results to GCS. This will allow you to store the data in a compressed format and then use GCS tools to check the size.
Otherwise, use the Bigtable client libraries to scan rows matching your regex and then estimate the compressed size using a typical compression ratio for Bigtable's Snappy compression.
You can pattern to this code:
from google.cloud import bigtable
from google.cloud.bigtable import row_filters
# Initialize Bigtable client
client = bigtable.Client(project='your-project-id', admin=True)
instance = client.instance('your-instance-id')
table = instance.table('your-table-name')
# Use RowKeyRegexFilter to filter rows based on regex pattern
regex_filter = row_filters.RowKeyRegexFilter(b"your-regex-pattern")
# Read rows matching the regex
rows = table.read_rows(filter=regex_filter)
# Calculate the uncompressed size of these rows
total_uncompressed_size = 0
for row in rows:
total_uncompressed_size += len(row.cells) # or len(row.data)
# Estimate compressed size (assuming 3x compression ratio)
estimated_compressed_size = total_uncompressed_size / 3
print(f"Estimated uncompressed size: {total_uncompressed_size} bytes")
print(f"Estimated compressed size: {estimated_compressed_size} bytes")