hbaseschema-design

HBase schema design correct?


I would like to ask you if the current schema design on a HBase table is correct for the following scenario: I receive 10 million events per day each having a unix epoch timestamp and an id. I will have to group by day, so that I can easily scan for those events that happened on a specific day.

Current design: Events timestamp is converted to a format "MM-YYYY_DD" string as key and each id of an event that occurred on that day is stored in the row. This will result in up to 10 million columns in one row. As far as I understand HBase there is a lock on writing on a single row. Resulting in having many locks when importing a single day and decreasing performance.

Maybe this would be a better design?: Use the unix epoch timestamp as a row's key resulting in many rows with several thousand columns (several events may occurring on the same second, because my timestamp has a max. resolution of one second). When scanning one can calculate the start and end time in unix epoch and do the scan.


Solution

  • HBase is best used for faster random reads and writes. Anythinig other than that, you have to pay extra caution. In your case keeping the row key as day is very bad because, as you said, it will result in millions of columns. Its not good practise. Mostly you might end up into memory issues when holding such large rows.

    You want grouping/partitioning - then using scan with filter is not a bad approach. You can query based on a column with "SingleColumnValueFilter". Performnce will not be optimal compared to rowkey scan. Again, I am not sure whats the response time you are expecting.