hbaseapache-phoenixgoogle-cloud-bigtable

Google Bigtable secondary indices


In reviewing Google Bigtable I found that it does not offer the ability to define secondary indices.

So if you have a billion transactions, for 10 million customers, it would seem you need a full table scan to pull out all transactions for one customer.

As Google Bigtable seems to be using Apache HBase under the hood, my first thought was: Presumably one can put Apache Phoenix on top.

However, I found surprisingly little in this direction, the most relevant seems a mailinglist post of 2018 mentioning that 'it would be hard because co-processors are not supported'.

Well, now we are quite a few years further and though I confirmed co-processors still do not appear to be supported, I wondered if any pattern had emerged to enable secondary indices?


Solution

  • So if you have a billion transactions, for 10 million customers, it would seem you need a full table scan to pull out all transactions for one customer.

    This is a read access pattern. If a per-customer query is a frequent access pattern, then an effective schema design would have row-keys prefixed by customer ID. That way, a per-customer read query will translate to an extremely fast lookup of that customer's data.

    If you can identify additional read and write access patterns, the frequency that each takes place, the requirements relating to latency and throughput for each access pattern, then additional refinements of schema design can be devised. I will be happy to help you think through it, if you provide those details.

    In reviewing Google Bigtable I found that it does not offer the ability to define secondary indices.

    That is correct.

    As Google Bigtable seems to be using Apache HBase under the hood ...

    This is not quite right, but close. HBase is modeled after BigTable. Neither project relies on implementation from the other. However, one way to communicate with Cloud Bigtable is through the Cloud Bigtable HBase client for Java, which is a customized version of the Apache HBase client.

    As Google Bigtable seems to be using Apache HBase under the hood, my first thought was: Presumably one can put Apache Phoenix on top.

    You could refactor Phoenix to be compatible with the Cloud Bigtable HBase client for Java. It would be a year-long project, at least, if working on it individually. As far as missing features like coprocessors, you could surely find a way to replicate that sort of parallelism achieved by HB co-processors.

    Well, now we are quite a few years further and though I confirmed co-processors still do not appear to be supported, I wondered if any pattern had emerged to enable secondary indices?

    While there is nothing worse then a StackOverflow answer that questions all underlying technical choices the OP already made, based on very little information about the OP's use-case, I am nonetheless going to proceed to do something like that... May I ask why you are planning to use BigTable? A different storage system might be better One useful way to breakdown the data profile is by Volume, Velocity, Variation, Access, and Security. I will be happy to help you think through it, if you can please provide those details.

    Volume: Need more info, but if the data is indeed a billion transaction, then assuming that each transaction is reasonably row sized (this is an important assumption, please let me know more about size of each transaction), then your data would easily fit into any solution i listed, including CloudSQL, which has a max instance size of 30 to 64 TB. 64 TB Divided by one billion is 64 KB. For comparison, SQL Server has a max row length of 8KB. For an example outside of GCP, SQL Server allows a database size of over 500,000 terabytes.

    As far as velocity, I suspect this data is low velocity, for two reasons. One, the nature of the data: Web applications and mobile apps that collect and store human-entered data are typically low velocity, at least when measured by individual user. Two, from the details you provided about the data: if 10 million customers have a billion transactions, that is 100 transactions per customer. Again, sounds like a sql server or MongoDB/Firebase would be appropriate.

    Variation in data: Given the nature of customer transactions, and given that you are trying to wrap your bigtable in a OLTP DB facade such as Phoenix, it sounds like a typical OLTP Relational DB use case, which is to say highly structured and low variation. If it is higher variation, though, you can use Firebase or MongoDB.

    Access: You told me about one read-access pattern, though there could be many others, which I have no way to divine. Questions you may want to answer, for each access pattern:

    1. How much data is retrieved in a read operation? (you already answered this for the access pattern you described in your question)
    2. How much data is written in an insert operation?
    3. How often is data written?
    4. How often is data read?

    As far as the write access pattern, I would suspect that there is a single write-access pattern, characterized by a small amount of data is written at a time, at a low rate.