Performing a merge against a table with 40 million rows takes many hours, on a database server with significant horsepower (64 GB memory, 16 cores).
I load data into a staging table via SqlBulkCopy
, and then MERGE
into a target table. The MERGE
is performed in batches to minimize impact on TempDb
.
The gist of my statement is:
DECLARE @RowID int = 0,
@RowCount int,
@Batches int = 0,
@BatchSize int = 10000
SELECT @RowCount = COUNT(1)
FROM [someStagingTable]
WHILE @RowID <= @RowCount
BEGIN
MERGE INTO [someTargetTable] AS Target
USING (SELECT * FROM [someStagingTable]
WHERE ID BETWEEN @RowID AND @RowID + @BatchSize - 1) AS Source
ON Target.AccountNumber = Source.AccountNumber
WHEN MATCHED THEN
UPDATE
SET ...
WHEN NOT MATCHED BY TARGET THEN
INSERT ...
SET @RowID = @RowID + @BatchSize
SET @Batches = @Batches + 1
COMMIT
END
Regarding indexing:
[someStagingTable].ID
is an int
identity column with a clustered index[someTargetTable].AccountNumber
is indexedIn this particular case, despite the presence of an obvious index on [someTargetTable].AccountNumber
, the MERGE
statement needed an index hint:
MERGE INTO [someTargetTable] WITH (INDEX=IX_someTargetTableAccountNumber) AS Target
USING (SELECT * FROM [someStagingTable] WHERE ID BETWEEN @RowID AND @RowID + @BatchSize - 1) AS Source
ON Target.AccountNumber = Source.AccountNumber
This became obvious when inspecting the SQL execution plan, which started with a TABLE SCAN
of [someTargetTable]
in each batch. Adding the index hint reduced the execution time of each batch from ~300 seconds to ~1 second. The execution plan shifts from a TABLE SCAN
to an INDEX SEEK
, dropping rows read from 40M to 10K (the batch size).