sql-serverutf-8ansisql-server-2019flat-file

Openrowset problems with Umlauts/Codepage when flatfile is ANSI and not UTF8


I've got a Problem with umlauts (Example: Empänger) when using OPENROWSET. In our old envirionment with Windows Server 2012, SQL-Server 2012 and a COLLATION without UTF-8 (Latin1_General_CI_AS) everything runs without Problems. In most Times our Flatfiles are in ANSI (I think cp1252).
Now we changed to Windows Server 2016, SQL-Server 2019 and the UTF-8 Collation (Latin1_General_100_CI_AS_SC_UTF8) and got a problem with the german Umlauts (Umlaute like Übung).

We have many Users that uses the ability to upload files and using them as an extended Version in SSRS. So it would be nice if it is just an option to change.

Our original script works fine with UTF-8 but it returns the wrong characters in ANSI:

SELECT * FROM OPENROWSET('MSDASQL','Driver={Microsoft Access Text Driver (*.txt, *.csv)}','select * from \\localhost\_RepUploader\FlatFileTest_ANSI.csv')

This is what I get: "Emfpfänger" >> "Empf�nger" When I change the Encoding from ANSI tu UTF-8 I get the correct spelling.

Here is the Content of that File:

FirstWord;SecondWord
Empfänger;Üben

In the "Schema.ini" I tried differnet CharakterSet Options: "1252,65001,OEM,ANSI,1250" and also others between the known Codepages. Without luck.

I got it Working With a Polybase External_Table Connection. I also got it working with a BULK INSERT and the Codepage 1252.

drop table if exists #tmp
create table #tmp (FirstWord varchar(100), SecondWord nvarchar(100))
bulk insert #tmp
from '\\localhost\_RepUploader\FlatFileTest_ANSI.csv'
with
(
    fieldterminator = ';',
    codepage = 1252,
    FIRSTROW = 2    
)
select 'ANSI File', * from #tmp

It seems to me, that the Polybase Connection uses the Schema.ini, because when I try CharacterSet=65001 I get the wrong Character. When I uncomment or use CharacterSet=1252, everything is fine. So how can Polybase select the correct characters.


Solution

  • Our temporary Solution is to stay with our old script

    SELECT * FROM OPENROWSET('MSDASQL','Driver={Microsoft Access Text Driver (*.txt, *.csv)}','select * from \\localhost\_RepUploader\FlatFileTest_ANSI.csv')
    

    Our users upload files under use of an c# uploadtool. This Tool now converts all cp1252 File to cp65001 (without BOM). Now we don't need any CharakterSet Setting in the Schema.INI.

    With this converted UTF-8 File we can use our old Syntax to complete the Server switching without too many changes.

    The Solution from Panagiotis also works without this convert step, but we decided to do this way, because UTF-8 is the better encoding for us. Any CharakterSet Setting in the OPENROWSET seems to be ignored!

    SELECT * FROM OPENROWSET('Microsoft.Ace.OLEDB.12.0', 'Text;Database=\\localhost\_RepUploader\', 'Select * from [FlatFileTest_ANSI.csv]')
    

    In the future we would focus BULK Insert. But in my tests i've had many Problems with no results und much more important no error even if there where data. Because the files we get "have passed through many hands" they are not 100% save designed. So it's better to get an error than not getting any data.