This is more an MBCS question than a Unicode question. I need to create an API that returns a list of structs that each instance holds a Unicode character as one of its members. This is in .NET so you'd think I'd want UTF-16, but then for Asian characters, there'd like be two characters required. What's the best practice when returning Unicode characters?
What do people normally do for UTF-8? I'm guessing they never deal with individual characters and everything is held in a string (for example, searching for a character in a string is really done by looking for a sub-string). Maybe it's the C++ programmer in me but a string seems so heavy handed.
I think I'm going to do #3. What have others done?
You are right about using strings. In Unicode, because even a single character might require multiple codepoints (which would each take a certain number of bytes depending on the encoding), you can't really ever work on anything less than strings. Even functions like isUpper
or such should take a string and only work on the first element of it.
The reason a character might require multiple codepoints is typically because of the combining characters, for accents and such.
See this question in the Unicode FAQ.