Using Hive 1.2.1000.2 on Azure HDInsight 3.6 performing an INNER JOIN
to get the count of records that are present both in Table_1
and Table_2
.
Details of the tables:
Table_1: 310M records
Sample data:
master_id modelkey order_id
---------------------------------------
mi0000bd1444 4874 d988e53cd
mi000097d5 44365 p0905gd44
mi0000d2ab09ea 309141
mi0001d6a 8705 7574
mi00011f7c085 4063 d165804b2
mi0001a57db 314 9c84ft879
Table_2: 35M records
Sample data:
order_id vendor_id
---------------------------------------
81d162f23 7122a0c
6988e53cd 517ba6e
5165804b2 5c5e161
47ba91ea3 7686b2d
f45cab9de 35be1af
Below are the details of what I've tried so far.
Hive query:
SELECT COUNT(*)
FROM db.table_1 t1
INNER JOIN db.table_2 t2 ON t1.order_id = t2.order_id;
Hive properties:
SET hive.tez.container.size=10240;
SET tez.am.resource.memory.mb=10240;
SET tez.task.resource.memory.mb=10240;
SET hive.execution.engine=tez;
SET hive.exec.compress.output=true;
SET hive.vectorized.execution.enabled = true;
SET hive.vectorized.execution.reduce.enabled = true;
SET hive.optimize.skewjoin=true;
SET hive.skewjoin.key=100000;
The query is executing for >7 hours and getting stuck at the last Reducer job .e.
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 1 1 0 0 0 0
Map 4 .......... SUCCEEDED 715 715 0 0 0 0
Reducer 2 ..... RUNNING 189 188 1 0 0 0
Reducer 3 RUNNING 1 0 1 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/04 [=========================>>-] 99% ELAPSED TIME: 25307.97 s
--------------------------------------------------------------------------------
Is there a way to overcome the last Reducer's issue and, get the result?
Explain for:
Performed the following steps, it helped! and hope it help others:
order_id=''