I've seen a great many different forms of left anti-semi join. Allow me to list and name every one that comes to mind. The following queries are intended to return every ROSTER_ID
that's not used by any employee and who is the owner of that roster.
--1) NOT EXISTS, with a particular column selected in the subquery
SELECT ROSTER_ID, ROSTER_OWNER
FROM ROSTERS
WHERE NOT EXISTS (SELECT EMP_ID FROM EMPLOYEES WHERE EMP_ID = ROSTERS.EMP_ID)
--2) NOT EXISTS, with a particular column selected in the subquery and TOP (1) used
SELECT ROSTER_ID, ROSTER_OWNER
FROM ROSTERS
WHERE NOT EXISTS (SELECT TOP (1) EMP_ID FROM EMPLOYEES WHERE EMP_ID = ROSTERS.EMP_ID)
--3) NOT EXISTS, with all data selected in the subquery
SELECT ROSTER_ID, ROSTER_OWNER
FROM ROSTERS
WHERE NOT EXISTS (SELECT * FROM EMPLOYEES WHERE EMP_ID = ROSTERS.EMP_ID)
--4) NOT EXISTS, with all columns selected in the subquery and TOP (1) used
SELECT ROSTER_ID, ROSTER_OWNER
FROM ROSTERS
WHERE NOT EXISTS (SELECT TOP (1) * FROM EMPLOYEES WHERE EMP_ID = ROSTERS.EMP_ID)
--5) NOT EXISTS, but just use SELECT 1
SELECT ROSTER_ID, ROSTER_OWNER
FROM ROSTERS
WHERE NOT EXISTS (SELECT 1 FROM EMPLOYEES WHERE EMP_ID = ROSTERS.EMP_ID)
--6) NOT IN
SELECT ROSTER_ID, ROSTER_OWNER
FROM ROSTERS
WHERE EMP_ID NOT IN (SELECT EMP_ID FROM EMPLOYEES)
--7) LEFT JOIN
SELECT ROSTER_ID, ROSTER_OWNER
FROM ROSTERS
LEFT OUTER JOIN EMPLOYEES
ON EMPLOYEES.EMP_ID = ROSTERS.EMP_ID
WHERE EMPLOYEES.EMP_ID IS NULL
My question is this: Is there any objective reason - e.g. performance, backwards compatibility, portability, NULL-handling, ease of testing, extensibility, etc - to prefer any particular way of doing a left anti-semi join? I'm also interested to hear subjective reasons, e.g. style concerns or clarity, but only including them would be a non-answer.
My own research only points to the below, but it's all weak and probably subjective:
NULL
handling with IN
is always cause for concern, which is a mild reason to prefer EXISTS
over it.EXCEPT
, but I don't think that it generalises to cases where a column appears in only the outermost SELECT
.I don't think that any of the answers have covered everything, so here's my attempt to put all of the answers and comments together:
EXISTS
examples, are all the same. This is not easy to prove (you might find KumarHarsh's example convincing) and relies on you trusting the optimiser.IN
. You just can't trust it with NULL
s.LEFT JOIN
, can duplicate results if you use it for a normal (i.e. not anti
) semijoin. There is also suspicion around how the optimiser will treat it (see the link below). Subjectively, it takes what appears to be more typing and isn't all that good at communicating your intent. Overall, it's best avoided.In conclusion, use EXISTS
. Examples #2 and #4, which use TOP
, pointlessly do so, so you might as well avoid them. As for if you should use examples #1, #3, or #5, that's just a matter of taste. I think that #5 most clearly communicates intent, but that's completely a matter of opinion.
This very helpful article also suggests OUTER APPLY
, but I don't think that anyone takes that seriously.