sql-serverunixbulkinsertdouble-quotesline-endings

SQL Server bulk insert doesn't recognize double quotes as fieldquote


I'm trying to bulk insert a file in SQL Server 2017 14.0.1000.169. I would like to take the file exactly as it arrives, save it to the desired location and then run the bulk insert query without having to modify the file at all. I'm having difficulties getting the script to recognize and ignore the double quotes in the text file unless I manually change the line endings from Unix to Windows. I have read quite a few topics here and outside of SO discussing subjects close to this one, alas, none of which gave me the answer to my question:

How do I bulk insert my file with Unix line endings and not end up with double quotes?

My file looks like this:

"Report Name","Daily Extract (ID: 111111)"
"Date/Time Generated(UTC)","01-Mar-2020 15:08:51"
"Workspace Name","Company (ID: 22222)"
"Account Name","Client Account"
"Date Range","01-Jan-2019 - 29-Feb-2020"

"Dimension 1","Dimension 2","Dimension 3","Dimension 4","Dimension 5","Dimension 6","Dimension 7","Dimension 8","Dimension 9","Dimension 10","Dimension 11","Dimension 12","Dimension 13","Dimension 14","Dimension 15","Dimension 16","Dimension 17","Metric 1","Metric 2","Metric 3","Metric 4","Metric 5","Metric 6","Metric 7","Metric 8","Metric 9","Metric 10","Metric 11","Metric 12"
"string","string","date as string","string","string","string","string","string","string","string","string","string","string","string","string","string","string","bigint","bigint","decimal","decimal","decimal","bigint","decimal","decimal","bigint","decimal","bigint","bigint"

The query I'm using as follows:

DROP TABLE IF EXISTS Table
GO

CREATE TABLE [dbo].[Table](
    [Dimension 1] [varchar] (255) NULL,
    [Dimension 2] [varchar] (255) NULL,
    [Dimension 3] [varchar] (255) NULL,
    [Dimension 4] [varchar]  (255) NULL,
    [Dimension 5] [varchar] (255),
    [Dimension 6] [varchar] (255) NULL,
    [Dimension 7] [varchar] (255) NULL,
    [Dimension 8] [varchar] (255) NULL,
    [Dimension 9] [varchar] (1000) NULL,
    [Dimension 10] [varchar] (255) NULL,
    [Dimension 11] [varchar] (255) NULL,
    [Dimension 12] [varchar] (255) NULL,
    [Dimension 13] [varchar] (1000) NULL,
    [Dimension 14] [varchar] (1000) NULL,
    [Dimension 15] [varchar] (1000) NULL,
    [Dimension 16] [varchar] (1000) NULL,
    [Dimension 17] [varchar] (1000) NULL,
    [Metric 1] [varchar] (50) NULL,
    [Metric 2] [varchar] (50) NULL,
    [Metric 3] [varchar] (50) NULL,
    [Metric 4] [varchar] (50) NULL,
    [Metric 5] [varchar] (50) NULL,
    [Metric 6] [varchar] (50) NULL,
    [Metric 7] [varchar] (50) NULL,
    [Metric 8] [varchar] (50) NULL,
    [Metric 9] [varchar] (50) NULL,
    [Metric 10] [varchar] (255) NULL,
    [Metric 11] [varchar] (50) NULL,
    [Metric 12] [varchar] (50) NULL
) ON [PRIMARY]
GO

BULK
INSERT Table
FROM 'C:\Users\username\Folder\File.csv'
WITH
(
--FORMAT = 'CSV',
DATAFILETYPE = 'char',
FIELDTERMINATOR = ',',
--ROWTERMINATOR = '\n',
ROWTERMINATOR = '0x0a',
FIRSTROW = 7,
--FIELDQUOTE = '"'
FIELDQUOTE = '0x22'
)
;

As you can see above, I'm importing everything as a varchar. Originally I only used this for one metric (due to data quality issues on supply end), as I fully intend to correct every blemish after the file has already been loaded. Having run into difficulties however I have set all metrics to varchar, so at least the file would load and I could see what it looks like and dig further.

So far I have tried the following:

Every other thing I have tried so far resulted in various errors, which all lead back to the same two things: either I can't use FORMAT = 'CSV' (if I leave the Unix line endings in), or the moment I try loading metrics as float, it errors out because of the double-quotes.

I do have a workaround for the time being (I can remove the double quotes and convert the fields after the thing has loaded), I do wonder however whether I can integrate that step into the bulk insert (like I did when I loaded the file with Windows endings).

N.B. I am aware that FIELDQUOTE hasn't been around for too long, it should, however, apply to my build, as per Microsoft:

"FIELDQUOTE = 'field_quote' Applies to: SQL Server 2017 (14.x) CTP 1.1. Specifies a character that will be used as the quote character in the CSV file. If not specified, the quote character (") will be used as the quote character as defined in the RFC 4180 standard."

Did I forget to disclose anything? If not, any ideas what I might have overlooked?

Thanks in advance!


Solution

  • OK. The biggest problem here is your file. Firstly, the file does not RFC 4180, due to rows at the top. This makes for a headache.

    Next is the important caveat on FIRSTROW:

    When skipping rows, the SQL Server Database Engine looks only at the field terminators, and does not validate the data in the fields of skipped rows.

    Notice this says field terminators not row terminators. This is the second problem. For your data, you have this at the start:

    "Report Name","Daily Extract (ID: 111111)"
    "Date/Time Generated(UTC)","01-Mar-2020 15:08:51"
    "Workspace Name","Company (ID: 22222)"
    "Account Name","Client Account"
    "Date Range","01-Jan-2019 - 29-Feb-2020"
    <-- Blank Line -->
    

    This is 6 field terminators, and 6 row terminators.

    Next, you have more columns in the CSV file than in the table Table. Table doesn't have a column Dimension 17.

    After adding this missing column, I managed to get this working for the results I believe you are after with the below:

    BULK INSERT [Table]
    FROM '/tmp/YourFile2.txt'
    WITH (FIELDTERMINATOR = ',',
          ROWTERMINATOR = '\n',
          FIRSTROW = 2,
          FORMAT = 'CSV',
          FIELDQUOTE = '"');
    

    This inserted 1 row into the table.