google-cloud-platformgoogle-cloud-bigtablebigtable

How does Cloud Bigtable read rows that are non-contiguous?


Given a large number of known row keys. How does bigtable read(not a scan operation) those rows? Does it read the rows one after the other or all at once? If I have a large number of non-contiguous rows that I want to read, is it better to make separate concurrent or parallel hits to read each or to give all rows to bigtable i.e. a "batch read"?


Solution

  • There are three options for a non-contiguous batch read which depend on your latency and CPU requirements. You can do all the reads as get requests in parallel, you can issue a read rows request/scan with multiple ranges that include only one row, or you can do a hybrid.

    Reading with multiple parallel get requests

    This option can be great if you have a lot of processing power or don't need to read a huge number of rows. This will issue multiple requests to Bigtable, so it's going to have an impact on your CPU utilization. One Bigtable node supports around 10K reads per second, but if you have 1000 rows you need to read individually that might make a dent in your capacity.

    Also, if you need all of the requests to resolve before you can process the data, you may run into performance issues if one request is slow, it slows down the entire result.

    Scan with multiple rows

    Bigtable supports scanning with multiple filters. One filter is a row range based on the row key. You can create a row range filter that includes exactly one row and do a scan with a filter for each row.

    The Bigtable client libraries support queries like this, so you can just pass the row keys and don't need to create all of those row range filters. However, it's important to know what is happening under the hood for performance. This one query will be performed sequentially on the Bigtable server, so it could take a lot more time than multiple gets.

    In Java, to do this kind of query, you just pass multiple row keys to the Query builder like so:

    Query query = Query.create(tableId).rowKey("phone#4c410523#20190501").rowKey("phone#4c410523#20190502");
    ServerStream<Row> rows = dataClient.readRows(query);
    for (Row row : rows) {
      printRow(row);
    }
    

    Hybrid approach

    Depending on the scale of rows you're working with, it may make sense to take your set of row keys, divide them up and issue multiple scans in parallel. You can get the benefit of fewer requests while still potentially getting better latency since the requests are parallelized.

    I would recommend experimenting to see which scenario works best for your use case, or leave a comment with more information on your use case and I can see if there is more information I can offer you.