apache-spark-sql

Performance of OR vs UNION in Spark SQL


I am trying to figure out the Spark SQL query performance with OR vs IN vs UNION ALL.

Option-1:

select cust_id, prod_id, prod_typ
from cust_prod
where prod_typ = '0102' OR prod_typ = '0265';

Option-2:

select cust_id, prod_id, prod_typ
from cust_prod
where prod_typ IN ('0102, '0265');

Option-3:

select cust_id, prod_id, prod_type
from cust_prod
where prod_typ = '0102'
union all
select cust_id, prod_id, prod_type
from cust_prod
where prod_typ = '0265';

I have checked the query plans for all of the above options and have found that with OR, the source table is scanned only once. However, with UNION ALL, I found that the source table is being scanned twice. Hence, logically speaking, the query using OR and IN would be more efficient than the one with UNION ALL.

But I read somewhere that the UNION ALL is preferred (over OR) in such scenarios. Hence, I am bit confused as to which one to follow - OR vs IN vs UNION ALL.

What is the right approach?

Note: We use Spark SQL version 2.4.0.


Solution

  • Union All can also be made to reuse exchanges in specific situations by excluding optimizer's rule PushDownPredicate in some instances.