I've recently been reading up on the UTF-8 variable-width encoding, and I found it strange that UTF-8 specifies the first two bits of every continuation byte to be 10.
Range | Encoding
-----------------+-----------------
0 - 7f | 0xxxxxxx
80 - 7ff | 110xxxxx 10xxxxxx
800 - ffff | 1110xxxx 10xxxxxx 10xxxxxx
10000 - 10ffff | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
I was playing around with other possible variable width encodings, and found that by using the following scheme, at most 3 bytes are necessary to store all of Unicode. If the first bit is a 1, then the character is encoded in at least one more byte (read until the first bit is a 0).
Range | Encoding
-----------------+-----------------
0 - 7f | 0xxxxxxx
80 - 407f | 1xxxxxxx 0xxxxxxx
4080 - 20407f | 1xxxxxxx 1xxxxxxx 0xxxxxxx
Are the continuation bits in UTF-8 really that important? The second encoding seems much more efficient.
The UTF-8 is self-validating, fast on stepping forward, and easier to step backward.
Self-validating: Since the first byte in the sequence specifies the length, the next X bytes must fit 10xxxxxx
, or you have an invalid sequence. Seeing a 10xxxxxx
byte by itself is immediately recognizable as invalid.
Your suggested encoding has no validation built-in.
Fast on step forward: If you have to skip the character, you can immediately skip X bytes as determined by the first byte, without having to examine each intermediate byte.
Easier to step backward: If you have to read the bytes backwards, you can immediately recognize a continuation character by the 10xxxxxx
. You'll then be able to scan backwards past the 10xxxxxx
bytes for the 11xxxxxx
lead byte, without having to scan past the lead byte.
See UTF-8 Invalid sequences and error handling on Wikipedia.