It's only loading some of the data even though we expect it to load all of the data in one single run (aka 43ish iterations of the loop) I have to run this script many times to get all of the records. I have an idea that it has something to do with the fetch next, offset, and/or optimize for statements.
There are about 2 million records that need to be loaded from the CSV file. The code below is supposed to:
It's only loading some of the data on a complete run. I have to run this script many times to get all of the records.
I have some other code that I've removed here for brevity that tracks the migration history and it shows that each time I run this it will catch more of the records and add them to the database. Honestly, I'm at a loss as to why this could be. I've tried logging and investigating and I'm stuck. Below you can see the number of times this was ran and the number of records/rows in the csv as ChecksumMigration then the number that was in the table when it started as ChecksumTableBefore and the number that was added to the database during the run as ChecksumTableAfter.
AppliedDateTime | ChecksumMigration | ChecksumTableBefore | ChecksumTableAfter |
---|---|---|---|
2024-05-06 00:20:05 | 2036473 | 1986473 | 2036473 |
2024-05-06 00:06:27 | 2036473 | 1936473 | 1986473 |
2024-05-05 23:57:51 | 2036473 | 1786473 | 1936473 |
2024-05-05 23:54:07 | 2036473 | 1536473 | 1786473 |
2024-05-05 23:49:35 | 2036473 | 1036473 | 1536473 |
2024-05-05 23:42:20 | 2036473 | 0 | 1036473 |
USE FoodData_Central;
DECLARE @beforeChecksum INT = 0;
DECLARE @afterChecksum INT = 0;
DECLARE @migrationName NVARCHAR(40) = N'2024 April Full from 2021 - ';
DECLARE @pathToInputFolder NVARCHAR(40) = N'C:\FoodData_Central_csv_2024-04-18\';
DECLARE @tableName Nvarchar(40);
DECLARE @startTime DATETIME2 = GETDATE();
BEGIN TRY
set @tableName = 'food';
-- CHECKSUM
DECLARE @SQL NVARCHAR(MAX) = 'SELECT @ResultVariable = count(*) FROM ' + @tableName;
EXEC sp_executesql @SQL, N'@ResultVariable INT OUTPUT', @ResultVariable = @beforeChecksum output;
-- TRUNCATE TABLE food; --only truncated the first time and then use fetch next to get through the data, solving problem data along the way
-- Mapping
DROP TABLE IF EXISTS #tmp;
create table #tmp(
fdc_id NVARCHAR(max) NOT NULL,
data_type NVARCHAR(max) NULL,
description NVARCHAR(max) NULL,
food_category_id NVARCHAR(max) NULL,
publication_date NVARCHAR(max) NULL
)
bulk insert #tmp
From 'C:\FoodData_Central_csv_2024-04-18\food.csv' -- update file name, (!sometimes the file names are different <crosses eyes>)
WITH
(
CODEPAGE = '65001'
,FIRSTROW = 2
,FIELDTERMINATOR = '\",\"'
,ROWTERMINATOR = '0x0A' --Use to shift the control to next row
,batchsize=500000
,TABLOCK
);
DECLARE @i int = 1
DECLARE @offsetCount int = 1;
DECLARE @nextCount int = 50000;
WHILE @i < 43
BEGIN
SET @i = @i + 1
-- DDL
insert into food(fdc_id, data_type, description, food_category_id, publication_date) -- UPDATE file name, and columns
select
CAST(REPLACE(t.fdc_id,'"','') AS INT) AS fdc_id
, t.data_type
, t.description
, CAST(t.food_category_id AS SMALLINT) AS food_category_id
, CAST(REPLACE(REPLACE(t.publication_date, '"', ''), CHAR(13), '') AS DATETIME2) AS publication_date
from #tmp t
WHERE NOT EXISTS ( -- skip duplicates
SELECT 1 FROM food AS d --UPDATE
WHERE d.fdc_id = CAST(REPLACE(t.fdc_id,'"','') AS INT)
)
ORDER BY fdc_id DESC
OFFSET @offsetCount - 1 ROWS
FETCH NEXT @nextCount - @offsetCount + 1 ROWS ONLY
OPTION ( OPTIMIZE FOR (@offsetCount = 1, @nextCount = 2036474) );
set @offsetCount = @offsetCount + 50000;
set @nextCount = @nextCount + 50000;
END
-- CLEANUP
DROP TABLE IF EXISTS #tmp;
END TRY
BEGIN CATCH
PRINT 'Error Number: ' + CAST(ERROR_NUMBER() AS NVARCHAR(10));
PRINT 'Error Message: ' + ERROR_MESSAGE();
-- CLEANUP
DROP TABLE IF EXISTS #tmp;
END CATCH;
GO
Edit:
Thanks to everyone who helped out with this! Including @MatBailie's recommendations for performance.
I used @MatBailie 's recommendation to create a culustered index for the temp table. That wasn't possible for the initial temp table, but I created a second temp table with an index, put the data there and dropped the first temp table. It cut the time in half even considering the additional subquery of this answer.
OK.. this one is a bit weird, but I believe the OFFSET/FETCH is running after the WHERE clause which makes the offset strange.
For example,
Instead, try putting the OFFSET/FETCH into a subquery or separate query (e.g., select the next 50,000 records), then do the INSERT INTO with the WHERE clause only afterwards.
I think something like this works
insert into food(fdc_id, data_type, description, food_category_id, publication_date) -- UPDATE file name, and columns
select fdc_id, data_type, description, food_category_id, publication_date
from (select CAST(REPLACE(t.fdc_id,'"','') AS INT) AS fdc_id
, t.data_type
, t.description
, CAST(t.food_category_id AS SMALLINT) AS food_category_id
, CAST(REPLACE(REPLACE(t.publication_date, '"', ''), CHAR(13), '') AS DATETIME2) AS publication_date
from #tmp t
ORDER BY fdc_id DESC
OFFSET @offsetCount - 1 ROWS
FETCH NEXT @nextCount - @offsetCount + 1 ROWS ONLY
) AS a
WHERE NOT EXISTS ( -- skip duplicates
SELECT 1 FROM food AS d --UPDATE
WHERE d.fdc_id = a.fdc_id AS INT)
)
OPTION ( OPTIMIZE FOR (@offsetCount = 1, @nextCount = 2036474) );
As per the comments below, @MatBailie suggested I add this to the answer:
SELECT TOP (50000)
) rather than using OFFSET/FETCHThese are probably better approaches to mine above.