hbasehypertable

Estimate row size HBase/HyperTable


Is there a way to estimate row size if I know what kind of data I'll be storing (with compression in mind)?

I'm looking at something like

bson_id | string (max 200 chars) | int32 | int32 | int32 | bool | bool | DateTime | DateTime | DateTime | int32

I am trying to find the best DB solution for about 2 trillion records like the one above, combined with about x20 like

bson_id | bson_id

Any other recommendations are welcome


Solution

  • Sort for very generic answer.

    As far as I know, only tests with dummy data is reliable way to measure such thing. “Dummy” here means fake but not repeated, because strong repetition may spoil compression estimates.

    For example you may put 1m, 2m, 4m, 8m, 32m, 128m and so on… records and check is there any linear dependency. If it's linear, you can easily with some contingency extrapolate values for billions and trillions of records.

    In such tests you also able to check performance against your needs. For example you can increase replication factor of HDFS to improve read performance.

    And finally you can check this for compression viewpoint.

    Good luck with BigData!