sqlsql-serverreplacesubstringsql-server-2016

Find and remove ASCI values from varchar


Context

  1. Reading from CSV files using SQL Server 2016 (no format option available).

  2. The files are crlf/windows and each value will be wrapped in double quotes. Files are located here, note that I changed them to crlf for consistency.

Problem

Every now and then specifically I see this often with numbers, when we attempt to do a replace(value, '"', '') and then cast to int, it will fail with error below

Error

Conversion failed when converting the nvarchar value '331877 ' to data type int.

But if we do a substring() it produces a value that cannot be cast to int.

Super hacky IMHO because you need to guess at the length of the string using len().

I have verified that the results of both replace and substring look exactly alike, but when I run a

nullif(replace(value, '"', ''), substring(value, 2, len(value)-2))

it claims they are different somehow. (crosses eyes) They look the same... they are the same type... they are the same, but nullif recognizes that something is wrong. My code is below. Perhaps there is something with the coalition??

bulk insert #tmp
    From 'C:\acquisition_samples.csv'
    WITH
        (
            CODEPAGE = '65001'
            ,FIRSTROW = 2
            ,FIELDTERMINATOR = ','
            ,ROWTERMINATOR = '0x0A'  
            ,batchsize=10
            ,TABLOCK
        );
    -- DDL
    insert into acquisition_sample(fdc_id_of_sample_food, fdc_id_of_acquisition_food) -- UPDATE file name, and columns
        select
            nullif(REPLACE(t.fdc_id_of_sample_food, '"', ''), substring(t.fdc_id_of_sample_food,2,LEN(t.fdc_id_of_sample_food)-2))
            , nullif(REPLACE(t.fdc_id_of_acquisition_food, '"', ''), substring(t.fdc_id_of_acquisition_food,2,LEN(t.fdc_id_of_acquisition_food)-2)) as wtf
            , CAST(substring(t.fdc_id_of_sample_food,2,LEN(t.fdc_id_of_sample_food)-2) AS INT) 
            , t.fdc_id_of_acquisition_food
            , CAST(substring(t.fdc_id_of_acquisition_food,2,LEN(t.fdc_id_of_acquisition_food)-3) AS INT)
        from #tmp t


Solution

  • Thanks to the hints from @AlwaysLearning and @T N

    This now becomes obvious that the reason why we get a different value is that substring was removing the extra character, but replace was only targeting the double quotes I fed to it.

    Discovering the correct code for the character(s).

    UNICODE(RIGHT(value, 1))
    

    Solution

    REPLACE(Fieldname, CHAR(13), '')
    

    This now becomes obvious that the reason why we get a different value is that substring was removing the extra character, but replace was only targeting the double quotes I fed to it.

    I realized that the reason the text at the end of the line is pushed down by one line in the error console is that it's a carriage return.

    This similar answer also helped. CSV Carriage Return Character