.netunicodembcs

How to represent Unicode characters in an API


This is more an MBCS question than a Unicode question. I need to create an API that returns a list of structs that each instance holds a Unicode character as one of its members. This is in .NET so you'd think I'd want UTF-16, but then for Asian characters, there'd like be two characters required. What's the best practice when returning Unicode characters?

  1. Use an array of 2 UTF-16 chars - Test the 1st char to see if it's surrogate, have a count?
  2. Ignore the surrogate issue and leave it to the caller to figure out the actual glyph encoding spans structs?
  3. Use a string instead so I don't care if it's one or two chars in length?
  4. Use UTF-32

What do people normally do for UTF-8? I'm guessing they never deal with individual characters and everything is held in a string (for example, searching for a character in a string is really done by looking for a sub-string). Maybe it's the C++ programmer in me but a string seems so heavy handed.

I think I'm going to do #3. What have others done?


Solution

  • You are right about using strings. In Unicode, because even a single character might require multiple codepoints (which would each take a certain number of bytes depending on the encoding), you can't really ever work on anything less than strings. Even functions like isUpper or such should take a string and only work on the first element of it.

    The reason a character might require multiple codepoints is typically because of the combining characters, for accents and such.

    See this question in the Unicode FAQ.