hadoophivequery-optimizationhiveqlapache-tez

Hive on TEZ query taking forever at Reducer cross product


I have 2 tables:

db1.main_table (32 GB)
db2.lookup_table (2.5 KB)

lookup table has just one column named id, which is also present and is the primary key of main_ table. The goal is to just look at the values in lookup table and delete from main_table all the rows which has these values present. I am, using this hive query(on TEZ) which has suddenly started creating a cross product at Reduce stage.

insert overwrite table 
db1.main_table 
select * from db1.main_table where nvl(id,'NV') not in (select nvl(id,'RV') from db2.lookup_table);

I am using nvl due to the presence of null values in my main table for id column which I don't want to loose.

My query is getting hung up at Reducer 2(only 3 containers) forever.

I am getting this warning for Reducer 2

INFO : Warning: Shuffle Join MERGEJOIN[34][tables = [$hdt$_0, $hdt$_1]] in Stage 'Reducer 2' is a cross product

And I am getting the following plan for this query which gets hanged at Reducer 2 vertex of TEZ.

enter image description here

Can we please suggest a way by which this Reducer 2 can get more containers or a way to solve this very long running job. Solution would be highly appreciated.


Solution

    1. If lookup table can contain many records with NULLs, this means that at least 'RV' records are not unique in your query, better to use DISTINCT to reduce the size of lookup before join. But you are saying "..id, which is also present and is the primary key of main_ table..." Primary key is unique and NOT NULL. If PK constraint is really enforced by the process loading lookup table, you do not need DISTINCT and NVL. And the same for main table. PK = unique+NOT NULL.

    2. If main table has many NULLs and they all will be transformed to 'NV' before join with, this value can create a skew on JOIN reducer. If 'NV' should be passed, you can exclude it from the join at all.

    3. This is the most significant one. If Lookup table is small enough to fit in memory, use Map-Join. Read this question about mapjoin: Hive Map-Join configuration mystery And it is rather small: (2.5 KB) - Map-join should work definitely.

    set hive.auto.convert.join=true; 
    set hive.mapjoin.smalltable.filesize=157286400; --adjust the figure for mapjoin
    
    insert overwrite table db1.main_table 
    select m.* 
      from db1.main_table m 
           left join (select DISTINCT nvl(id,'RV') id from db2.lookup_table) l 
                  on m.id=l.id --NULLs are not joined
     where l.id is NULL --Not joined records, including NULLs in main table
    ;