bigtablegoogle-cloud-bigtable

Bigtable: Avoiding hotspotting when using timestamps on row keys


Cloud Bigtable docs on schema design for time series say:

In the vast majority of cases, time-series queries are accessing a given dataset for a given time period. Therefore, make sure that all of the data for a given time period is stored in contiguous rows, unless doing so would cause hotspotting.

Additionally, here's what they recommend to avoid hotspotting:

If you're storing a cell phone's battery status, and your row key consists of the word "BATTERY" plus a timestamp, the row key will always increase in sequence. Because Cloud Bigtable stores adjacent row keys on the same server node, all writes will focus only on one node until that node is full, at which point writes will move to the next node in the cluster.

Field promotion is suggested:

Move fields from the column data into the row key to make writes non-contiguous.

For example:

BATTERY#20150301124501001 --> BATTERY#Corrie#20150301124501001

Questions:

  1. Field promotion may solve hotspotting. Still, wouldn't that make querying by time range a little bit difficult?
  2. On the other side, is hotspotting avoidable if you want to query a range ONLY by TIMESTAMP? Don't think so, right?

Solution

    1. Field promotion may solve hotspotting. Still, wouldn't that make querying by time range a little bit difficult?

    That depends what your query looks like. For example, if you want to query Corrie's battery status from T1 to T2, you can construct a row range easily: [BATTERY#Corrie#T1, BATTERY#Corrie#T2]. However, if you want to query the battery status of all the users, then all the rows with prefix BATTERY will be scanned.

    So, the most important queries you have should dictate which fields you promote to the row key. Also, fields with high cardinality help more when promoted to row key, as they distribute load to a larger number of tablets.

    1. On the other side, is hotspotting avoidable if you want to query a range ONLY by TIMESTAMP? Don't think so, right?

    I am not entirely sure what you mean by "query a range only the timestamp", can you provide an example?

    A lot will depend on what "TIMESTAMP" means. If you always want to query for last 10 minutes, then all of your queries will go to a single server at any given time and you will experience hotspotting.

    Another thing to keep in mind is that if you don't design the row key properly, writes will encounter hotspotting and you will not get good write throughput. Its recommended to design row-keys to avoid hotspotting.