javahadoophbasedistributed

Distributed analysis of HBase data


I'm a bit new to HBase and have been able to setup HBase and query the data that's being stored on multiple Hadoop machines but I'm wondering if its possible to distribute the analysis of data in HBase as well.

Here's my situation: I have a few billion records that I need to analyse quickly and I would like to have X servers query the database and get unique parts of the query so they can work on it instead of having a single server that goes through the entire dataset. Is this possible and how can I do it?

I'm very unsure how to approach this because I realize all the queries will need to be coordinated (each server cannot query HBase individually otherwise HBase will not know how to split the request among the servers). I'm confused but thought maybe there's either a native way to do this in Hadoop?

My application is running Java and I'm running the cluster on EC2 using the Cloudera distribution.


Solution

  • HBase builds on Hadoop for a reason :) you can use Hadoop's map-reduce framework to distribute analytics and let hadoop/hbase take care of distributing the load. You can start with the docs to see what can be done.

    Another option you have is to write co-processors. Coprocessors run on the region servers so they work close to the data. You can find a nice intro here