sql-servert-sqlunicodesupplementary

Cannot produce characters of the supplementary plain of Unicode (SC) from their codepoint


I fail with my attempts to produce any supplementary characters (SC) outside the BMP.

I tested in SSMS. The SQL Server version is Microsoft SQL Server 2019 - v15.0.2130.3.

I tried NCHAR() with all codepoints from 65536 to 100000, and only get NULL. Here is an example with a codepoint from the Unicode block "GOTHIC" (wikipedia)

SELECT NCHAR(0x10335)

and it returns NULL.

The documentation says about NCHAR() that in order to produce a supplementary character outside the BMP it needed a database collation with the SC flag.

The database collation is Czech_BIN2 and I cannot change it. Using an explicit COLLATE this way:

SELECT NCHAR(0x10335) COLLATE Latin1_General_100_CI_AS_SC

does not work either, it produces NULL.

I also tried

SELECT CONVERT(NVARCHAR(100), 0x10335)

but this of course gets interpreted as two characters and not one 4 byte utf16 character.

My setup is able, though, to process that "GOTHIC" character (looks like a U), which I copy pasted from wikipedia into the query window in SSMS, properly and

DECLARE @SC_nvc NVARCHAR(2)
SELECT @SC_nvc = N'𐌵'
SELECT @SC_nvc

DECLARE @SC TABLE(SC_vc VARCHAR(4) COLLATE Latin1_General_100_CI_AI_SC_UTF8)
INSERT INTO @SC(SC_vc)
SELECT N'𐌵' 
SELECT SC_vc FROM @SC

works as expected.

Is there a way to produce supplementary characters using NCHAR() without changing the database collation?

Is there any other way to transform a given codepoint for a supplementary character into that character?

What are the ways to store such a character? It seems a variable in VARCHAR cannot be convinced to hold an SC in utf8...

Ultimately, I would like to be able to produce any SC from its codepoint and store it in a query field, table column or variable with a utf8 collation like Latin1_General_100_CI_AI_SC_UTF8 or German_PhoneBook_100_CI_AI_SC_UTF8, and that without being dependent on the database collation.

Thanks for your help.


Solution

  • The SQL Server NCHAR documentation is quite clear that it behaves differently in the case that the collation of the database does/does not support supplementary characters.

    The documentation does also show an "Old style method" alongside the "Preferred method" (of using a database collation with SC support).

    For this method you would need to figure out the surrogate pair values for the code point.

    The website here contains the following algorithm

    Conversion of a Unicode scalar value S to a surrogate pair <H, L>:

    H = (S - 0x10000) / 0x400 + 0xD800

    L = (S - 0x10000) % 0x400 + 0xDC00

    where the operator “/” is ... “integer division (rounded down),” and “%” as “modulo operation

    So in TSQL this would be

    DECLARE @ScCodePoint INT = 0x10335;
    
    SELECT NCHAR((@ScCodePoint - 0x10000) / 0x400 + 0xD800)
           + NCHAR((@ScCodePoint - 0x10000) % 0x400 + 0xDC00); 
    

    Regarding your VARCHAR question the collation of VARCHAR variables will again use the default collation of the database. In your case this is Czech_BIN2 so it is restricted to the 256 characters supported by code page 1250.

    So to support characters outside of this set you would need to use nvarchar parameters/variables in that database context (though then assigning to a varchar column where collation has explicitly been defined as one supporting UTF8 works fine)