optimizationjoinhivebuckets

Hive - Efficient join of two tables


I am joining two large tables in Hive (one is over 1 billion rows, one is about 100 million rows) like so:

create table joinedTable as select t1.id, ... from t1 join t2 ON (t1.id = t2.id);

I have bucketed the two tables in the same way, clustering by id into 100 buckets for each, but the query is still taking a long time.

Any suggestions on how to speed this up?


Solution

  • As you bucketed the data by the join keys, you could use the Bucket Map Join. For that the amount of buckets in one table must be a multiple of the amount of buckets in the other table. It can be activated by executing set hive.optimize.bucketmapjoin=true; before the query. If the tables don't meet the conditions, Hive will simply perform the normal Inner Join.

    If both tables have the same amount of buckets and the data is sorted by the bucket keys, Hive can perform the faster Sort-Merge Join. To activate it, you have to execute the following commands:

    set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
    set hive.optimize.bucketmapjoin=true;
    set hive.optimize.bucketmapjoin.sortedmerge=true;
    

    You can find some visualizations of the different join techniques under https://cwiki.apache.org/confluence/download/attachments/27362054/Hive+Summit+2011-join.pdf.