google-cloud-platformgoogle-cloud-bigtable

How to calculate the compressed storage size of rows matching a key regex in Google Cloud Bigtable?


I have a very large Google Cloud Bigtable table, currently at 52~57 TB, and I need to find out how much storage a specific subset of rows is consuming. The rows I'm interested in can be identified using a regular expression on their row key.

I've attempted to use the cbt command-line tool, but it seems impractical for this task. Even running a simple operation to count the first 100 rows matching my regex takes a very long time/timeouts as per documentation

More importantly, even if I could efficiently retrieve the data for the matching keys, I'm not sure if the size I calculate locally would be accurate. I suspect cbt provides the uncompressed data, but what I need is the actual, post-compression storage size that these rows occupy on Google's servers. I need this number to plan future storage needs.

Does any tool, API, etc, exist that can provide me with this info?


Solution

  • Since Google Cloud Bigtable does not provide a native tool, API, or built-in feature that directly reports the post-compression storage size for a specific set of rows or a subset of data matching a regex on the row key.

    You could set up a Dataflow job that reads from Bigtable, filters based on your regex, and writes the results to GCS. This will allow you to store the data in a compressed format and then use GCS tools to check the size.

    Otherwise, use the Bigtable client libraries to scan rows matching your regex and then estimate the compressed size using a typical compression ratio for Bigtable's Snappy compression.

    You can pattern to this code:

    from google.cloud import bigtable
    from google.cloud.bigtable import row_filters
    
    # Initialize Bigtable client
    client = bigtable.Client(project='your-project-id', admin=True)
    instance = client.instance('your-instance-id')
    table = instance.table('your-table-name')
    
    # Use RowKeyRegexFilter to filter rows based on regex pattern
    regex_filter = row_filters.RowKeyRegexFilter(b"your-regex-pattern")
    
    # Read rows matching the regex
    rows = table.read_rows(filter=regex_filter)
    
    # Calculate the uncompressed size of these rows
    total_uncompressed_size = 0
    for row in rows:
        total_uncompressed_size += len(row.cells)  # or len(row.data)
    
    # Estimate compressed size (assuming 3x compression ratio)
    estimated_compressed_size = total_uncompressed_size / 3
    
    print(f"Estimated uncompressed size: {total_uncompressed_size} bytes")
    print(f"Estimated compressed size: {estimated_compressed_size} bytes")