nullapache-spark-sqlprestoisnull

NOT IN implementation of Presto v.s Spark SQL


I got a very simple query which shows significant performance difference when running on Spark SQL and Presto (3 hrs v.s 3 mins) in the same hardware.

SELECT field 
FROM test1 
WHERE field NOT IN (SELECT field FROM test2)

After some research of the query plan, I found out the reason is how Spark SQL deals with NOT IN predicate subquery. To correctly handle the NULL of NOT IN, Spark SQL translate the NOT IN predicate as Left AntiJoin( (test1=test2) OR isNULL(test1=test2)).

Spark SQL introduces OR isNULL(test1=test2) to ensure the correct semantics of NOT IN.

However, the OR of Left AntiJoin join predicate causes the only feasible physical join strategy for Left AntiJoin is BroadcastNestedLoopJoin. For current stage, I could rewrite NOT IN to NOT EXISTS to workaround this issue. In the query plan of NOT EXISTS, I could see the the join predicate is Left AntiJoin(test1=test2) which causes a better physical join operator for NOT EXISTS (5 mins to finish).

So far I am lucky since my dataset currently does not have any NULL attributes, but it may have in the future and the semantics of NOT IN is what I really want.

So I check query plan of Presto, It does not really provides Left AntiJoin but it uses SemiJoin with a FilterPredicate = not (expr). The query plan of Presto does not provide too much info like Spark.

So my question is more like:

Could I assume Presto has a better physical join operator to handle NOT IN operation? Not like Spark SQL, it does not rely on the rewrite of join predicates isnull(op1 = op2) to ensure the correct semantics of NOT IN in the logical plan level.


Solution

  • I am actually the person who implemented NULL treatment for semi join (IN predicate) in Presto.

    Presto uses "replicate nulls and any row" replication mode in addition to hash-partitioning¹, which allows it to process IN correctly in the presence of NULLs on either side of the IN, without falling back to broadcasting, or making the execution single-threaded or single-node. The runtime performance cost is practically the same as if NULL values didn't exist at all.

    If you want to learn more about Presto internals, join the #dev channel on Trino Community Slack.

    ¹) to be precise, semi join is hash-partitioned or broadcast, depending on cost-based decision or configuration.