I found some similar questions but none exactly matching my use case. When trying to use a MERGE statement Microsoft mentions the following:
It's important to specify only the columns from the target table to use for matching purposes. That is, specify columns from the target table that are compared to the corresponding column of the source table. Don't attempt to improve query performance by filtering out rows in the target table in the ON clause; for example, such as specifying AND NOT target_table.column_x = value. Doing so can return unexpected and incorrect results.
In another place it mentions that you could use views to "filter" the source table.
Define a view on the source or target that returns the filtered rows and reference the view as the source or target table. If the view is defined on the target table, any actions against it must satisfy the conditions for updating views. For more information about updating data by using a view, see Modifying Data Through a View.
But what I cannot find is whether or not you can "safely" use a subquery instead of the source table. For example:
MERGE TableA a
USING (
SELECT *
FROM TableB b
WHERE
b.Status = 5
) b
ON a.ID = b.ForeignID
... MATCH statements ...
Would this cause similar problems as using a CTE or filtering by using the ON statement?
These two methods are exactly the same. A view and a CTE and a derived table (what you are calling a subquery) and an inline TVF behave identically to the compiler: they are inlined into the query.
The documentation is actually incorrect (note I've linked the latest docs):
Use the
WITH <common table expression>
clause to filter out rows from the source or target tables. This method is similar to specifying additional search criteria in the ON clause and may produce incorrect results. We recommend that you avoid using this method or test thoroughly before implementing it.
This is completely false as I will explain.
MERGE
functions like a FULL JOIN
. The source
and target
are full-joined together, and the ON
condition in the MERGE
is the ON
condition of the FULL JOIN
. Then the WHEN MATCHED AND
conditions go in a later filter like a WHERE
. Sometimes the FULL JOIN
becomes a LEFT
or INNER
if the match clauses are not present.
So for example, the following
MERGE TableA a
USING (
SELECT *
FROM TableB b
WHERE
b.Status = 5
) b
ON a.ID = b.ForeignID
WHEN NOT MATCHED THEN
INSERT (...)
VALUES (...)
WHEN MATCHED AND a.x <> b.x THEN
UPDATE SET
x = b.x;
WHEN NOT MATCHED THEN
DELETE;
becomes pseudo-code like:
SELECT -- data to MERGE
a.ID,
b.ForeignID,
b.x,
CASE WHEN b.ForeignID IS NULL THEN 'INSERT'
WHEN b.ForeignID IS NOT NULL AND a.x <> b.x THEN 'UPDATE'
WHEN a.ID IS NULL THEN 'DELETE'
END AS $action
FROM TableA a
FULL JOIN (
SELECT *
FROM TableB b
WHERE
b.Status = 5
) b
ON a.ID = b.ForeignID
WHERE (b.ForeignID IS NULL) -- not matched
OR (a.ID IS NOT NULL AND b.ForeignID IS NOT NULL AND a.x <> b.x) -- matched AND
OR (a.ID IS NULL) -- not matched by source
;
As you can see, it would make no difference where the source or target come from, whether a table, view, derived table or iTVF.
The key here is: if you want to merge to and from a subset of the data, then filter the source and target before you get to the ON
. The ON
clause should only be used for one thing: matching up rows. Do not put any other conditions in there.