The join documentation for Hive encourages the use of implicit joins, i.e.
SELECT *
FROM table1 t1, table2 t2, table3 t3
WHERE t1.id = t2.id AND t2.id = t3.id AND t1.zipcode = '02535';
Is this equivalent to
SELECT t1.*, t2.*, t3.*
FROM table1 t1
INNER JOIN table2 t2 ON
t1.id = t2.id
INNER JOIN table3 t3 ON
t2.id = t3.id
WHERE t1.zipcode = '02535'
, or will the above return additional records?
Not always. Your queries are equivalent. But without WHERE t1.id = t2.id AND t2.id = t3.id
it will be CROSS JOIN
.
Update:
This is interesting question and I decided to add some demo. Let's create two tables:
A(c1 int, c2 string)
and B(c1 int, c2 string)
.
Load data:
insert into table A
select 1, 'row one' union all
select 2, 'row two';
insert into table B
select 1, 'row one' union all
select 3, 'row three';
Check data:
hive> select * from A;
OK
1 row one
2 row two
Time taken: 1.29 seconds, Fetched: 2 row(s)
hive> select * from B;
OK
1 row one
3 row three
Time taken: 0.091 seconds, Fetched: 2 row(s)
Check cross join (implicit join without where
transformed to cross):
hive> select a.c1, a.c2, b.c1, b.c2 from a,b;
Warning: Map Join MAPJOIN[14][bigTable=a] in task 'Stage-3:MAPRED' is a cross product
Warning: Map Join MAPJOIN[22][bigTable=b] in task 'Stage-4:MAPRED' is a cross product
Warning: Shuffle Join JOIN[4][tables = [a, b]] in Stage 'Stage-1:MAPRED' is a cross product
OK
1 row one 1 row one
2 row two 1 row one
1 row one 3 row three
2 row two 3 row three
Time taken: 54.804 seconds, Fetched: 4 row(s)
Check inner join (implicit join with where
works as INNER):
hive> select a.c1, a.c2, b.c1, b.c2
from a,b
where a.c1=b.c1;
OK
1 row one 1 row one
Time taken: 38.413 seconds, Fetched: 1 row(s)
Try to perform left join by adding OR b.c1 is null
to the where:
hive> select a.c1, a.c2, b.c1, b.c2
from a,b
where (a.c1=b.c1) OR (b.c1 is null);
OK
1 row one 1 row one
Time taken: 57.317 seconds, Fetched: 1 row(s)
As you can see we got inner join again. or b.c1 is null
is ignored
Now left join
without where
and ON
clause (transformed to CROSS):
select a.c1, a.c2, b.c1, b.c2
from a left join b;
OK
1 row one 1 row one
1 row one 3 row three
2 row two 1 row one
2 row two 3 row three
Time taken: 37.104 seconds, Fetched: 4 row(s)
As you can see we got cross again.
Try left join with where
clause and without ON
(works as INNER):
select a.c1, a.c2, b.c1, b.c2
from a left join b
where a.c1=b.c1;
OK
1 row one 1 row one
Time taken: 40.617 seconds, Fetched: 1 row(s)
We got INNER join
Try left join with where
clause and without ON
+ try to allow nulls:
select a.c1, a.c2, b.c1, b.c2
from a
left join b
where a.c1=b.c1
or b.c1 is null;
OK
1 row one 1 row one
Time taken: 53.873 seconds, Fetched: 1 row(s)
Again got INNER. or b.c1 is null
is ignored.
Left join with on
clause:
hive> select a.c1, a.c2, b.c1, b.c2
from a left join b on a.c1=b.c1;
OK
1 row one 1 row one
2 row two NULL NULL
Time taken: 48.626 seconds, Fetched: 2 row(s)
Yes, it is true left join.
Left join with on
+ where
(got INNER):
hive> select a.c1, a.c2, b.c1, b.c2
from a left join b on a.c1=b.c1
where a.c1=b.c1;
OK
1 row one 1 row one
Time taken: 49.54 seconds, Fetched: 1 row(s)
We got INNER because WHERE does not allow NULLS.
Left join with where + allow nulls:
hive> select a.c1, a.c2, b.c1, b.c2
from a left join b on a.c1=b.c1
where a.c1=b.c1
or b.c1 is null;
OK
1 row one 1 row one
2 row two NULL NULL
Time taken: 55.951 seconds, Fetched: 2 row(s)
Yes, it is left join.
Conclusion: