unicodeencoding

Why UTF-32 exists whereas only 21 bits are necessary to encode every character?


We know that codepoints can be in this interval 0..10FFFF which is less than 2^21. Then why do we need UTF-32 when all codepoints can be represented by 3 bytes? UTF-24 should be enough.


Solution

  • Computers are generally much better at dealing with data on 4 byte boundaries. The benefits in terms of reduced memory consumption are relatively small compared with the pain of working on 3-byte boundaries.

    (I speculate there was also a reluctance to have a limit that was "only what we can currently imagine being useful" when coming up with the original design. After all, that's caused a lot of problems in the past, e.g. with IPv4. While I can't see us ever needing more than 24 bits, if 32 bits is more convenient anyway then it seems reasonable to avoid having a limit which might just be hit one day, via reserved ranges etc.)

    I guess this is a bit like asking why we often have 8-bit, 16-bit, 32-bit and 64-bit integer datatypes (byte, int, long, whatever) but not 24-bit ones. I'm sure there are lots of occasions where we know that a number will never go beyond 221, but it's just simpler to use int than to create a 24-bit type.