I got a very simple query which shows significant performance difference when running on Spark SQL and Presto (3 hrs v.s 3 mins) in the same hardware.
SELECT field
FROM test1
WHERE field NOT IN (SELECT field FROM test2)
After some research of the query plan, I found out the reason is how Spark SQL deals with NOT IN
predicate subquery.
To correctly handle the NULL of NOT IN, Spark SQL translate the NOT IN
predicate as Left AntiJoin( (test1=test2) OR isNULL(test1=test2))
.
Spark SQL introduces OR isNULL(test1=test2)
to ensure the correct semantics of NOT IN
.
However, the OR
of Left AntiJoin join predicate causes the only feasible physical join strategy for Left AntiJoin
is BroadcastNestedLoopJoin
. For current stage, I could rewrite NOT IN to NOT EXISTS to workaround this issue. In the query plan of NOT EXISTS, I could see the the join predicate is Left AntiJoin(test1=test2)
which causes a better physical join operator for NOT EXISTS (5 mins to finish).
So far I am lucky since my dataset currently does not have any NULL
attributes, but it may have in the future and the semantics of NOT IN is what I really want.
So I check query plan of Presto, It does not really provides Left AntiJoin
but it uses SemiJoin
with a FilterPredicate = not (expr)
. The query plan of Presto does not provide too much info like Spark.
So my question is more like:
Could I assume Presto has a better physical join operator to handle NOT IN
operation? Not like Spark SQL, it does not rely on the rewrite of join predicates isnull(op1 = op2)
to ensure the correct semantics of NOT IN in the logical plan level.
I am actually the person who implemented NULL
treatment for semi join (IN
predicate) in Presto.
Presto uses "replicate nulls and any row" replication mode in addition to hash-partitioning¹, which allows it to process IN
correctly in the presence of NULL
s on either side of the IN
, without falling back to broadcasting, or making the execution single-threaded or single-node. The runtime performance cost is practically the same as if NULL
values didn't exist at all.
If you want to learn more about Presto internals, join the #dev
channel on Trino Community Slack.
¹) to be precise, semi join is hash-partitioned or broadcast, depending on cost-based decision or configuration.