sql-server

Detect character encoding of file in SQL Server


I'm writing a program to read the contents of text files on my SQL Server instance. Usually, for a .log file or .txt file, it is non-Unicode, so SINGLE_CLOB would be used in the query:

SELECT BulkColumn 
FROM OPENROWSET(BULK N'C:\Temp\sample_log_file.log', SINGLE_CLOB) AS Contents;

However, every once in a while I'll stumble across a file that has some Unicode characters in it. Hence, I would have to write the query with SINGLE_NCLOB:

SELECT BulkColumn 
FROM OPENROWSET(BULK N'C:\Temp\sample_text_file.txt', SINGLE_NCLOB) AS Contents;

If not, I'll get the error:

SINGLE_NCLOB requires a UNICODE (widechar) input file. The file specified is not Unicode.

These are not ad hoc queries or T-SQL; I'm running a program. So my question is, how I can detect the character encoding of a file before trying to parse it in a query? Or really, how do I check if a file is Unicode or non-Unicode (ASCII)? Seems simple, but I can't find anything.


Solution

  • There is no foolproof way, but with some logic you can make an educated guess.

    By examining the first few bytes of the file in binary mode, the presence of Byte Order Marks (BOMs) can identify some encodings with high confidence: FF FE for UTF-16LE (little-endian), FE FF for UTF-16BE (big-endian), and (unofficially) EF BB BF for UTF-8 are the most common. However, BOMs are optional, so this technique is only a partial solution.

    Beyond that, you can use pattern matching to make educated guesses: Alternating (mostly) zero and non-zero bytes likely indicates UTF-16LE or UTF-16BE. All bytes <= 7E usually means ASCII or some national variant. If bytes in the range 80-BF are present and always preceded either by a another byte 80-BF or a byte C0-FF, that probably means UTF-8 (additional checks may be needed). Otherwise, you may have some single-byte extended ASCII encoding, such as Windows-1252 or Codepage 850, among many other possibilities.

    Such tests are not guaranteed. In particular languages using non-Latin alphabets will make encoding detection more difficult.

    As for the implementation details, I believe you can use SINGLE_BLOB option to get the raw data as VARBINARY(MAX). After that, you can extract individual bytes or byte sequences using SUBSTRING() for analysis. For example:

    SUBSTRING(@blob, 1, 2) = 0xFFFE    -- UTF-16LE BOM
    SUBSTRING(@blob, 1, 3) = 0xEFBBBF  -- UTF-8 BOM
    

    The additional pattern matching logic would be much more involved, requiring scanning the entire blob to gather statistics and (for UTF-8) to check for valid multi-byte sequences.