hbase

HBase - Best approach to change rowkey design on live table


What is the best approach(industry standard) to change rowkey design on a table that is already containing approx. 1.5 mil rows with one cell with JSONs and is live in production? Multiple systems access the table, so it is preferred to end up with the same table name at the end of the process. We have about 40 capacity nodes and usually under 1MB per cell. Estimated table size is under 10GB(but probably even under 2GB)

At this moment we are thinking of

The con of this approach is that it will take ages to convert, we would pay the toll of attempting first read with new key that doesn't exist yet, rollback is complicated

The con is that only way to rename a table is to clone snapshots. Unknown impact on compaction => performance of given table

The con is downtime and risk of triggering compaction that would hinder performance for a long time

Can you think of a better approach? Or which one would you suggest?


Solution

  • A lot depends on the size of your cluster and how heavily it is being accessed in real-time. But here are some things to keep in mind:

    1. You said you have 1.5mln rows. That is VERY small by HBase standards, so you might be overthinking how much impact you will actually be making on your cluster and how long various operations will take (again, depends on the size of your cluster and on how much data is actually stored in those 1.5mln rows).
    2. As far as I know, there is no such thing as 'renaming' a table in HBase. I.e. don't assume that you can create a new table in parallel and then simply rename NewTableName->OldTableName. You might be able to fake it through lots of trickery, but I'd be concerned doing it on a production table.
    3. If your client application can be changed to be pointed to a new table, then things will be a lot easier.
    4. Snapshots and compactions are 'online' operations. Even major compactions of the live table won't actually interrupt your operation and can happen quite quickly. Let alone the compaction of a table that is not actually live. But with snapshots I'm not sure how you'd handle the updates that may have happened post-snapshot.