I'm trying to compare a staging table ('new data') to another table ('existing data') to identify added/changed/removed rows, and eventually an upsert. This is an expensive operation - a full diff on a large dataset. I really wanted to use the EXCEPT
command for syntactic clarity, but am having serious performance problems with it, and find a LEFT JOIN
is much better.
The two tables have a similar number of rows and the same schema (almost - the 'second' table has an additional created_date
column).
They both share distkey(date)
and sortkey(date, id1, id2)
; I'm even specifying the columns in the 'correct' order within the EXCEPT
statement to help the optimizer.
The query plans for each, on a testing-sized subset of the data, are below.
explain
select date, id1, id2, id3, value, attr1, attr2, attr3 from new_data
except select date, id1, id2, id3, value, attr1, attr2, attr3 from existing_data;
XN SetOp Except (cost=1000002817944.78..1000003266822.61 rows=1995013 width=1637)
-> XN Sort (cost=1000002817944.78..1000002867820.09 rows=19950126 width=1637)
Sort Key: date, id1, id2, id3, value, attr1, attr2, attr3
-> XN Append (cost=0.00..399002.52 rows=19950126 width=1637)
-> XN Subquery Scan "*SELECT* 1" (cost=0.00..199501.26 rows=9975063 width=1637)
-> XN Seq Scan on new_data (cost=0.00..99750.63 rows=9975063 width=1637)
-> XN Subquery Scan "*SELECT* 2" (cost=0.00..199501.26 rows=9975063 width=1636)
-> XN Seq Scan on existing_data (cost=0.00..99750.63 rows=9975063 width=1636)
Compare with my much uglier LEFT JOIN
explain
select t1.* from new_data t1
left outer join existing_data t2 on
t1.date = t2.date
and t1.id1 = t2.id1
and coalesce(t1.id2, -1) = coalesce(t2.id2, -1)
and coalesce(t1.id3, -1) = coalesce(t2.id3, -1)
and coalesce(t1.value, -1) = coalesce(t2.value, -1)
and coalesce(t1.attr1, '') = coalesce(t2.attr1, '')
and coalesce(t1.attr2, '') = coalesce(t2.attr2, '')
and coalesce(t1.attr3, '') = coalesce(t2.attr3, '')
where t2.id1 is null;
XN Merge Left Join DS_DIST_NONE (cost=0.00..68706795.68 rows=9975063 width=1637)
Merge Cond: (("outer".date = "inner".date) AND (("outer".id1)::bigint = "inner".id1))
Join Filter: (((COALESCE("outer".id2, -1))::bigint = COALESCE("inner".id2, -1::bigint)) AND ((COALESCE("outer".id3, -1))::bigint = COALESCE("inner".id3, -1::bigint)) AND ((COALESCE("outer".value, -1::numeric))::double precision = COALESCE("inner".value, -1::double precision)) AND ((COALESCE("outer".attr1, ''::character varying))::text = (COALESCE("inner".attr1, ''::character varying))::text) AND ((COALESCE("outer".attr2, ''::character varying))::text = (COALESCE("inner".attr2, ''::character varying))::text) AND ((COALESCE("outer".attr3, ''::character varying))::text = (COALESCE("inner".attr3, ''::character varying))::text))
Filter: ("inner".id1 IS NULL)
-> XN Seq Scan on new_data t1 (cost=0.00..99750.63 rows=9975063 width=1637)
-> XN Seq Scan on existing_data t2 (cost=0.00..99750.63 rows=9975063 width=1636)
The query costs are 1000003266822.61
vs 68706795.68
. I know I'm not supposed to compare across queries but it's proven out in the execution times. Any ideas why an EXCEPT
statement is so much slower than a LEFT JOIN
here?
The left join
is generating a pile of cross joined rows for each (presumably ordered) key value then filtering out the ones it doesn't want via on
; it can also stop when the (presumably ordered) old key values exceed the new key values since there can't be any more matches--which also involves some inferencing via some coalesce
SARG smarts. The except
is sorting everything first. In this case the sorting costs more than generating & throwing away rows, multiply walking through each key's rows of the right hand table. Of course the optimizer could include an outer join
idiom in its except
planning--but it apparently doesn't.
Related: PostgreSQL: NOT IN versus EXCEPT performance difference