hadoophbasehive

How does Hive compare to HBase?


I'm interested in finding out how the recently-released Hive compares to HBase in terms of performance. The SQL-like interface used by Hive is very much preferable to the HBase API we have implemented.


Solution

  • It's hard to find much about Hive, but I found this snippet on the Hive site that leans heavily in favor of HBase (bold added):

    Hive is based on Hadoop which is a batch processing system. Accordingly, this system does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real time queries. As a result it should not be compared with systems like Oracle where analysis is done on a significantly smaller amount of data but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes. For Hive queries response times for even the smallest jobs can be of the order of 5-10 minutes and for larger jobs this may even run into hours.

    Since HBase and HyperTable are all about performance (being modeled on Google's BigTable), they sound like they would certainly be much faster than Hive, at the cost of functionality and a higher learning curve (e.g., they don't have joins or the SQL-like syntax).