x86language-designlanguage-implementation

Unicode strings in process memory


What is the most preferred format of unicode strings in memory when they are being processed? And why?

I am implementing a programming language by producing an executable file image for it. Obviously a working programming language implementation requires a protocol for processing strings.

I've thought about using dynamic arrays as the basis for strings because they are very simple to implement and very efficient for short strings. I just have no idea about the best possible format for characters when using strings in this manner.


Solution

  • UTF16 is the most widely used format.

    The advantage of UTF16 over UTF8 is that, despite being less compact, every character has a constant size of 2bytes (16bits) - as long as you don't use surrogates (when sticking to 2bytes chars, the encoding is called UCS-2).

    In UTF8 there is only a small set of characters coded on 1bytes, others are up 4 bytes. This makes character processing less direct and more error prone.

    Of course using Unicode is preferred since it enables to handle international characters.