My key has three components: num, type, name
The 'type' is only of two kinds A and B while num can have more values e.g. 0,1,2..,30
I have to fetch data with respect to num and type i.e. fetch all rows which have keys with the specified num and type.
I can either store data in the form:
1. num|type|name
or
2. type|num|name
Considering how HBase scans through data if I use partial key scanning, which is the best strategy to store data?
This is how I will set my partial key scanning: For 1.
scan.setStartRow(Bytes.toBytes(num);
scan.setStopRow(Bytes.toBytes(num+1);
For 2.
scan.setStartRow(Bytes.toBytes(type + "|" + num);
scan.setStopRow(Bytes.toBytes(type + "|" + (num+1));
First I would recommend against using pipe as a delimiter - that is ASCII 124 and falls after all letters and numbers and sorting will not be what you expect (unless you left pad everything - but that makes for overly large keys). For HBase rowkey delimiters you want to use something that is lexicographically before all of your valid key characters to preserve correct sorting. Tab works well at ASCII 9.
Considering that type only has two valid values and assuming a random distribution I would go with num type
. This allows you to select just on num if you need to in the future. Selecting on just num with the reverse order, type num
, is two fetchs, once for type 'A' and again for type 'B'. Not the most efficient.
If you will rarely select on just number then it does make sense to go with type num
as that is the most selective on the row level, if inflexible.
Really you should try them both out and see what works best with your data.