hadoopjoinhivehiveql

Are Hive's implicit joins always inner joins?


The join documentation for Hive encourages the use of implicit joins, i.e.

SELECT * 
FROM table1 t1, table2 t2, table3 t3 
WHERE t1.id = t2.id AND t2.id = t3.id AND t1.zipcode = '02535';

Is this equivalent to

SELECT t1.*, t2.*, t3.*
FROM table1 t1
INNER JOIN table2 t2 ON
  t1.id = t2.id 
INNER JOIN table3 t3 ON
  t2.id = t3.id
WHERE t1.zipcode = '02535'

, or will the above return additional records?


Solution

  • Not always. Your queries are equivalent. But without WHERE t1.id = t2.id AND t2.id = t3.id it will be CROSS JOIN.

    Update:

    This is interesting question and I decided to add some demo. Let's create two tables:

    A(c1 int, c2 string) and B(c1 int, c2 string).

    Load data:

    insert into table A
     select 1, 'row one' union all
     select 2, 'row two';
    
    insert into table B
     select 1, 'row one' union all
     select 3, 'row three';
    

    Check data:

    hive> select * from A;
    OK
    1       row one
    2       row two
    Time taken: 1.29 seconds, Fetched: 2 row(s)
    hive> select * from B;
    OK
    1       row one
    3       row three
    Time taken: 0.091 seconds, Fetched: 2 row(s)
    

    Check cross join (implicit join without where transformed to cross):

    hive> select a.c1, a.c2, b.c1, b.c2 from a,b;
    
    Warning: Map Join MAPJOIN[14][bigTable=a] in task 'Stage-3:MAPRED' is a cross product
    Warning: Map Join MAPJOIN[22][bigTable=b] in task 'Stage-4:MAPRED' is a cross product
    Warning: Shuffle Join JOIN[4][tables = [a, b]] in Stage 'Stage-1:MAPRED' is a cross product
    
    OK
    1       row one 1       row one
    2       row two 1       row one
    1       row one 3       row three
    2       row two 3       row three
    Time taken: 54.804 seconds, Fetched: 4 row(s)
    

    Check inner join (implicit join with where works as INNER):

    hive> select a.c1, a.c2, b.c1, b.c2 
            from a,b 
           where a.c1=b.c1;
    OK
    1       row one 1       row one
    Time taken: 38.413 seconds, Fetched: 1 row(s)
    

    Try to perform left join by adding OR b.c1 is null to the where:

    hive> select a.c1, a.c2, b.c1, b.c2 
            from a,b 
           where (a.c1=b.c1) OR (b.c1 is null);
    
    OK
    1       row one 1       row one
    Time taken: 57.317 seconds, Fetched: 1 row(s)
    

    As you can see we got inner join again. or b.c1 is null is ignored

    Now left join without where and ON clause (transformed to CROSS):

     select a.c1, a.c2, b.c1, b.c2 
       from a left join b;
    
    OK
    1       row one 1       row one
    1       row one 3       row three
    2       row two 1       row one
    2       row two 3       row three
    Time taken: 37.104 seconds, Fetched: 4 row(s)
    

    As you can see we got cross again.

    Try left join with where clause and without ON (works as INNER):

    select a.c1, a.c2, b.c1, b.c2 
      from a left join b 
     where a.c1=b.c1;
    
    OK
    1       row one 1       row one
    Time taken: 40.617 seconds, Fetched: 1 row(s)
    

    We got INNER join

    Try left join with where clause and without ON+ try to allow nulls:

     select a.c1, a.c2, b.c1, b.c2 
       from a 
       left join b 
      where a.c1=b.c1 
         or b.c1 is null;
    
    OK
    1       row one 1       row one
    Time taken: 53.873 seconds, Fetched: 1 row(s)
    

    Again got INNER. or b.c1 is null is ignored.

    Left join with on clause:

    hive> select a.c1, a.c2, b.c1, b.c2 
            from a left join b on a.c1=b.c1;
    
    OK
    1       row one 1       row one
    2       row two NULL    NULL
    Time taken: 48.626 seconds, Fetched: 2 row(s)
    

    Yes, it is true left join.

    Left join with on + where (got INNER):

    hive> select a.c1, a.c2, b.c1, b.c2 
            from a left join b on a.c1=b.c1 
           where a.c1=b.c1;
    
    OK
    1       row one 1       row one
    Time taken: 49.54 seconds, Fetched: 1 row(s)
    

    We got INNER because WHERE does not allow NULLS.

    Left join with where + allow nulls:

    hive> select a.c1, a.c2, b.c1, b.c2 
            from a left join b on a.c1=b.c1 
           where a.c1=b.c1 
              or b.c1 is null;
    
    OK
    1       row one 1       row one
    2       row two NULL    NULL
    Time taken: 55.951 seconds, Fetched: 2 row(s)
    

    Yes, it is left join.

    Conclusion:

    1. Implicit join works as INNNER (with where) or CROSS if without WHERE clause.
    2. Left join can work as CROSS if without ON and without WHERE, can also work as INNER if WHERE clause does not allows nulls for right table.
    3. Better use ANSI syntax because it is self-explaining and it is easy to understand what do you expect it to work like. Implicit joins or left joins working as INNER or CROSS are difficult to understand and very prone to error.