When performing a join in Hive and then filtering the output with a where clause, the Hive compiler will try to filter data before the tables are joined. This is known as predicate pushdown (http://allabouthadoop.net/what-is-predicate-pushdown-in-hive/)
For example:
SELECT * FROM a JOIN b ON a.some_id=b.some_other_id WHERE a.some_name=6
Rows from table a which have some_name = 6 will be filtered before performing the join, if push down predicates are enabled(hive.optimize.ppd).
However, I have also learned recently that there is another way of filtering data from a table before joining it with another table(https://vinaynotes.wordpress.com/2015/10/01/hive-tips-joins-occur-before-where-clause/).
One can provide the condition in the ON clause, and table a will be filtered before the join is performed
For example:
SELECT * FROM a JOIN b ON a.some_id=b.some_other_id AND a.some_name=6
Do both of these provide the predicate pushdown optimization?
Thank you
Both are valid and in case of INNER JOIN and PPD both will work the same. But these methods works differently in case of OUTER JOINS
ON join condition works before join.
WHERE is applied after join.
Optimizer decides is Predicate push-down applicable or not and it may work, but in case of LEFT JOIN for example with WHERE filter on right table, the WHERE filter
SELECT * FROM a
LEFT JOIN b ON a.some_id=b.some_other_id
WHERE b.some_name=6 --Right table filter
will restrict NULLs, and LEFT JOIN
will be transformed into INNER JOIN
, because if b.some_name=6, it cannot be NULL.
And PPD does not change this behavior.
You can still do LEFT JOIN with WHERE filter if you add additional OR condition allowing NULLs in the right table:
SELECT * FROM a
LEFT JOIN b ON a.some_id=b.some_other_id
WHERE b.some_name=6 OR b.some_other_id IS NULL --allow not joined records
And if you have multiple joins with many such filtering conditions the logic like this makes your query difficult to understand and error prune.
LEFT JOIN with ON filter does not require additional OR condition because it filters right table before join, this query works as expected and easy to understand:
SELECT * FROM a
LEFT JOIN b ON a.some_id=b.some_other_id and b.some_name=6
PPD still works for ON filter and if table b is ORC, PPD will push the predicate to the lowest possible level to the ORC reader and will use built-in ORC indexes for filtering on three levels: rows, stripes and files.
More on the same topic and some tests: https://stackoverflow.com/a/46843832/2700344
So, PPD or not PPD, better use explicit ANSI syntax with ON condition and ON filtering if possible to keep the query as simple as possible and avoid converting to INNER JOIN unintentionally.