I am trying to figure out the Spark SQL query performance with OR vs IN vs UNION ALL.
Option-1:
select cust_id, prod_id, prod_typ
from cust_prod
where prod_typ = '0102' OR prod_typ = '0265';
Option-2:
select cust_id, prod_id, prod_typ
from cust_prod
where prod_typ IN ('0102, '0265');
Option-3:
select cust_id, prod_id, prod_type
from cust_prod
where prod_typ = '0102'
union all
select cust_id, prod_id, prod_type
from cust_prod
where prod_typ = '0265';
I have checked the query plans for all of the above options and have found that with OR, the source table is scanned only once. However, with UNION ALL, I found that the source table is being scanned twice. Hence, logically speaking, the query using OR and IN would be more efficient than the one with UNION ALL.
But I read somewhere that the UNION ALL is preferred (over OR) in such scenarios. Hence, I am bit confused as to which one to follow - OR vs IN vs UNION ALL.
What is the right approach?
Note: We use Spark SQL version 2.4.0.
Union All can also be made to reuse exchanges in specific situations by excluding optimizer's rule PushDownPredicate in some instances.