nosqlgoogle-cloud-bigtablebigtable

Is it better to omit prefixes for Bigtable Rowkeys when attempting single value lookup?


Bigtable recommends to design row keys in such a way that a common value acts as a prefix and a granular value is placed at the end. That way, one can efficiently retrieve a range of related rows. Their example mentions

phone#india#pkc1124#1941

I have a pipeline which receives data from a relatively small amount of accounts (about 20). I use Bigtable to store and retrieve a single identifier from these events (max 1 row per retrieval)

Let's say one of those accounts was called "FOO", and the event in question had a unique identifier called "SomeUniqueIdentifier" with value "BAR", so my row key design was FOO#SomeUniqueIdentifier#Bar

However, this approach seems problematic to me, because the input data is not evenly distributed through accounts, one account in particular is responsible for around 90% of the pipeline's input, so their partition would be huge.

Removing the account id prefix would lead to an even bigger issue, as the partition SomeUniqueIdentifier would be a lot bigger than FOO#SomeUniqueIdentifier

So in this case, wouldn't it be more efficient to just have the row key be BAR? as I said, I'm only interested in retrieving ONE value, my table design is such that I just need to store one row with one cell with one value per identifier, so I'm not interested in retrieving range of rows. There is garbage collection to keep only the latest value, and every value is flagged for GC after being a few days old

Would looking for a specific granular identifier still trigger a full table scan and be less efficient than the prefixes approach?


Solution

  • If you're only interested in performing specific key lookups in Bigtable then you wont run into any issues with full table scans as Bigtable is designed for efficient row key lookups by ID.

    To your point about having more values in FOO vs other accounts, Bigtable will be able to partition the keys deeper than the first prefix, so you could end up with a partition that goes from FOO#123-FOO#456, and another for FOO#456-FOO#789. Then for your smaller accounts, you might just have a partition that is BAR#-DAR# if those aren't getting as much traffic.

    The Bigtable documentation has some information on schema design best practices and you can run some tests then use the key visualizer to visually see if the schema you have is distributing throughput evenly.