I have heard many times, when reading about Unicode, that UTF-32 is a fixed width encoding.
Taking fixed width encoding to mean "a code which maps source symbols to a set number of bits," and, assuming that the source symbols in question are Unicode code points, this all makes sense. However, if you think of the underlying language of source symbols being graphemes, things get a lot more complicated.
So my question is this, in the sense of graphemes, is UTF-32 truly a fixed length encoding? And if not, is there a possible fixed length encoding in that sense?
One of the comments referenced Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) article, which was written in 2003. At the time, it served as a wake-up call (it probably still does in some places). However, it is not without its (minor, but significant) technical problems — though the overall thesis ('you need to know about Unicode, and you need to know which encoding a string is in') remains valid. The comment then continued:
And yes, UTF-16 and UTF-32 are both fixed width. UTF-8 … isn't.
UTF-16 isn't really fixed width; some Unicode code points are one 16-bit code unit, others require two 16-bit code units — just like UTF-8 isn't fixed width; some Unicode code points require one 8-bit code units, others require two, three or even four 8-bit code units (but not five or six, despite the comment from Joel's article that mentions the possibility). UTF-32, on the other hand, is fixed width; all Unicode code points can be encoded in a single 32-bit code unit. (Indeed, the maximum possible Unicode code point is U+10FFFF, so Unicode is a 21-bit code set, though it does not use all possible combinations of 21 bits.)
However, code points are not identical to characters, let alone graphemes. The Unicode FAQ has a section on Characters and Combining Marks that discusses graphemes, referencing the glossary definition.
The better word for what end-users think of as characters is grapheme (as defined in the Unicode glossary): a minimally distinctive unit of writing in the context of a particular writing system.
Graphemes are not necessarily combining character sequences, and combining character sequences are not necessarily graphemes.
Q: How are characters counted when measuring the length or position of a character in a string?
A: Computing the length or position of a "character" in a Unicode string can be a little complicated, as there are four different approaches to doing so, plus the potential confusion caused by combining characters. The correct choice of which counting method to use depends on what is being counted and what the count or position is used for.
To address the question here:
If you mean something to do with 'it can take multiple Unicode code points to get a complete character (grapheme) with associated diacritics (combining markers, etc.)' then yes, even UTF-32 isn't necessarily fixed width and there is no fixed width encoding for Unicode.
UTF-32 employs a fixed-width encoding for each Unicode code point, but since it can take multiple code points to create a complete grapheme, even UTF-32 does not have a 1:1 mapping between code points and graphemes.
Of course, you can also find interesting character stacks in some comments on SO. For example:
@̮̘̮̜̤͓͓̓ͪ̓͆͗̑Ṷ̫̠̤̙̻͚̗ͭs̹͓̰̫͉̲̺̈̏̽̅̑ͩ̇̓̉e͖̝̦̦̿r͔̒̿̋̂̓n̹͖̥ͥͦͤ̍͊̏ä͇͖͚͖̃̎͊m̭͇̂͆͋̋͒e̫̠͇̰̱̦̹͗͋̓̿͒ ͔͖̫̬̗̪̪̳ͧ̄ͫB̜̥̣̬̮͈͒̄ͪ͊l̮͉̣̟̪̪̿̍ͫ͋͐̑a̜̦̪͗͗̈́ͣ͊ḫ̘̯͈̠̞͒ͯ ̣͕͚̗̠͖̫̆͌͒̓͛b̖̣͇̖̦̃̑ͬͭͥl͔͍͚͕̲̪̼͎ͧ̇̏ạ̖̪͚̯̊ͤͣͦͮ̌h̘͓͔̟͔͍̏ͣͦ̓̓ ̫̼̫ͮ͌̄ͤ̿̈͆b̙͍̼̜͍̹̬̬͎ͥ̓ͯ̂ḽ̜̟̲̾̅̆ͦ̃ͨa͇̰̝̺͊ͧͫ͛h̯̻͉̉̒̉̈́́ͥ̀.̖̩̭͇̭͔̹̈́̇͐ͬͦͦͨ̾̇.͍̪̣͂ͬ.̞͍̥̪̺̤̣̜͆ͫ̈́͑ͦ͂͑͑
Why/how do "Zalgo pings" work?
Ȩ̸҉̟͎͚̹͚̙̟̖x̨͙̰͕̖͉̼̜̲̦̟͈́ͅͅą̷̘͕͈̹͓̣̮̼̣̠̹́c̼͙̠̭̫̰͈͍̮͢͡ţ̢̛̠͇̬̖̟̺͈̲̻̣̲͙͈̼͍̘̱ͅl̶̷̨̲͙͖̻̲̗̦͚͙̮͘͠y̭̖̰͚̞̣̗̳̠͕̻̼͡ͅ!̛͖̮͔͍̰͉͢ ̭̙̖͔̩̗̠͕̦̬͓͞͝ͅO҉҉̣̜̺̪̳͕̖͔̠͙͎͕̙̦ͅn̩͓͖̝̟̭͙͙͓͚̼͖͖͜͞ȩ̧̬̱̦̠̙̥͇͔̪́ ҉̸̗̦͇̰̪̰̭̘̹͘͢i̴͞͏̩̤̹̗̖̰͎̖̲̲̘͓̗̯͚̞͖̥̻͝s͞҉̲͈̙̹̤̫͇ ͚̭͎͉̠̺͉̮̞̻̣̰̺̖͖̀́͢͞e̷̪̭̯̼͓͎̹̠͖̲͔̪͈̦͈̱͍̭̩͠ņ͞҉̮̳͓͙͈̼͉̬͕͈̺͈̭̩̪o͇̗̱̠̱̠̯̕͢u̸̳̦̩̳̫̖̜ͅǵ̢̲̣͎̮̮̼̫̥̠͙̱̝̘͕͎̳̜̲̖h̸̛̩͚̮̤̖̹͙.̶̨̳̖̠̗̼̩͕͇͉͓̟̦͜͞ͅ
What you see, of course, depends on the quality of the Unicode support in your browser (which, in turn, depends in part on the quality of the O/S support). I get to see different results on two different Macs running rather different versions of Firefox, even though they're running the same base O/S version (10.10.4 Yosemite).
The second of those examples can be decoded from UTF-8 into the following sequence of Unicode code points — it is only 700 bytes on disk:
0xC8 0xA8 = U+0228
0xCC 0xB8 = U+0338
0xD2 0x89 = U+0489
0xCC 0x9F = U+031F
0xCD 0x8E = U+034E
0xCD 0x9A = U+035A
0xCC 0xB9 = U+0339
0xCD 0x9A = U+035A
0xCC 0x99 = U+0319
0xCC 0x9F = U+031F
0xCC 0x96 = U+0316
0x78 = U+0078
0xCC 0xA8 = U+0328
0xCD 0x99 = U+0359
0xCC 0xB0 = U+0330
0xCD 0x95 = U+0355
0xCC 0x96 = U+0316
0xCD 0x89 = U+0349
0xCC 0xBC = U+033C
0xCC 0x9C = U+031C
0xCC 0xB2 = U+0332
0xCC 0xA6 = U+0326
0xCC 0x9F = U+031F
0xCD 0x88 = U+0348
0xCC 0x81 = U+0301
0xCD 0x85 = U+0345
0xCD 0x85 = U+0345
0xC4 0x85 = U+0105
0xCC 0xB7 = U+0337
0xCC 0x98 = U+0318
0xCD 0x95 = U+0355
0xCD 0x88 = U+0348
0xCC 0xB9 = U+0339
0xCD 0x93 = U+0353
0xCC 0xA3 = U+0323
0xCC 0xAE = U+032E
0xCC 0xBC = U+033C
0xCC 0xA3 = U+0323
0xCC 0xA0 = U+0320
0xCC 0xB9 = U+0339
0xCC 0x81 = U+0301
0x63 = U+0063
0xCC 0xBC = U+033C
0xCD 0x99 = U+0359
0xCC 0xA0 = U+0320
0xCC 0xAD = U+032D
0xCC 0xAB = U+032B
0xCC 0xB0 = U+0330
0xCD 0x88 = U+0348
0xCD 0x8D = U+034D
0xCC 0xAE = U+032E
0xCD 0xA2 = U+0362
0xCD 0xA1 = U+0361
0xC5 0xA3 = U+0163
0xCC 0xA2 = U+0322
0xCC 0x9B = U+031B
0xCC 0xA0 = U+0320
0xCD 0x87 = U+0347
0xCC 0xAC = U+032C
0xCC 0x96 = U+0316
0xCC 0x9F = U+031F
0xCC 0xBA = U+033A
0xCD 0x88 = U+0348
0xCC 0xB2 = U+0332
0xCC 0xBB = U+033B
0xCC 0xA3 = U+0323
0xCC 0xB2 = U+0332
0xCD 0x99 = U+0359
0xCD 0x88 = U+0348
0xCC 0xBC = U+033C
0xCD 0x8D = U+034D
0xCC 0x98 = U+0318
0xCC 0xB1 = U+0331
0xCD 0x85 = U+0345
0x6C = U+006C
0xCC 0xB6 = U+0336
0xCD 0x98 = U+0358
0xE2 0x80 0x8C = U+200C
0xE2 0x80 0x8B = U+200B
0xCC 0xB7 = U+0337
0xCC 0xA8 = U+0328
0xCC 0xB2 = U+0332
0xCD 0x99 = U+0359
0xCD 0x96 = U+0356
0xCC 0xBB = U+033B
0xCC 0xB2 = U+0332
0xCC 0x97 = U+0317
0xCC 0xA6 = U+0326
0xCD 0x9A = U+035A
0xCD 0x99 = U+0359
0xCC 0xAE = U+032E
0xCD 0xA0 = U+0360
0x79 = U+0079
0xCC 0xAD = U+032D
0xCC 0x96 = U+0316
0xCC 0xB0 = U+0330
0xCD 0x9A = U+035A
0xCC 0x9E = U+031E
0xCC 0xA3 = U+0323
0xCC 0x97 = U+0317
0xCC 0xB3 = U+0333
0xCC 0xA0 = U+0320
0xCD 0x95 = U+0355
0xCC 0xBB = U+033B
0xCC 0xBC = U+033C
0xCD 0xA1 = U+0361
0xCD 0x85 = U+0345
0x21 = U+0021
0xCC 0x9B = U+031B
0xCD 0x96 = U+0356
0xCC 0xAE = U+032E
0xCD 0x94 = U+0354
0xCD 0x8D = U+034D
0xCC 0xB0 = U+0330
0xCD 0x89 = U+0349
0xCD 0xA2 = U+0362
0x20 = U+0020
0xCC 0xAD = U+032D
0xCC 0x99 = U+0319
0xCC 0x96 = U+0316
0xCD 0x94 = U+0354
0xCC 0xA9 = U+0329
0xCC 0x97 = U+0317
0xCC 0xA0 = U+0320
0xCD 0x95 = U+0355
0xCC 0xA6 = U+0326
0xCC 0xAC = U+032C
0xCD 0x93 = U+0353
0xCD 0x9E = U+035E
0xCD 0x9D = U+035D
0xCD 0x85 = U+0345
0x4F = U+004F
0xD2 0x89 = U+0489
0xD2 0x89 = U+0489
0xCC 0xA3 = U+0323
0xCC 0x9C = U+031C
0xCC 0xBA = U+033A
0xCC 0xAA = U+032A
0xCC 0xB3 = U+0333
0xCD 0x95 = U+0355
0xCC 0x96 = U+0316
0xCD 0x94 = U+0354
0xCC 0xA0 = U+0320
0xCD 0x99 = U+0359
0xCD 0x8E = U+034E
0xCD 0x95 = U+0355
0xCC 0x99 = U+0319
0xCC 0xA6 = U+0326
0xCD 0x85 = U+0345
0x6E = U+006E
0xCC 0xA9 = U+0329
0xCD 0x93 = U+0353
0xCD 0x96 = U+0356
0xCC 0x9D = U+031D
0xCC 0x9F = U+031F
0xCC 0xAD = U+032D
0xCD 0x99 = U+0359
0xCD 0x99 = U+0359
0xCD 0x93 = U+0353
0xCD 0x9A = U+035A
0xCC 0xBC = U+033C
0xCD 0x96 = U+0356
0xCD 0x96 = U+0356
0xCD 0x9C = U+035C
0xCD 0x9E = U+035E
0xC8 0xA9 = U+0229
0xCC 0xA7 = U+0327
0xCC 0xAC = U+032C
0xCC 0xB1 = U+0331
0xCC 0xA6 = U+0326
0xCC 0xA0 = U+0320
0xCC 0x99 = U+0319
0xCC 0xA5 = U+0325
0xCD 0x87 = U+0347
0xCD 0x94 = U+0354
0xCC 0xAA = U+032A
0xCC 0x81 = U+0301
0x20 = U+0020
0xD2 0x89 = U+0489
0xCC 0xB8 = U+0338
0xCC 0x97 = U+0317
0xCC 0xA6 = U+0326
0xCD 0x87 = U+0347
0xCC 0xB0 = U+0330
0xCC 0xAA = U+032A
0xCC 0xB0 = U+0330
0xCC 0xAD = U+032D
0xCC 0x98 = U+0318
0xCC 0xB9 = U+0339
0xCD 0x98 = U+0358
0xCD 0xA2 = U+0362
0x69 = U+0069
0xCC 0xB4 = U+0334
0xCD 0x9E = U+035E
0xCD 0x8F = U+034F
0xCC 0xA9 = U+0329
0xCC 0xA4 = U+0324
0xCC 0xB9 = U+0339
0xCC 0x97 = U+0317
0xCC 0x96 = U+0316
0xCC 0xB0 = U+0330
0xCD 0x8E = U+034E
0xCC 0x96 = U+0316
0xCC 0xB2 = U+0332
0xCC 0xB2 = U+0332
0xCC 0x98 = U+0318
0xCD 0x93 = U+0353
0xCC 0x97 = U+0317
0xCC 0xAF = U+032F
0xCD 0x9A = U+035A
0xCC 0x9E = U+031E
0xCD 0x96 = U+0356
0xCC 0xA5 = U+0325
0xCC 0xBB = U+033B
0xCD 0x9D = U+035D
0x73 = U+0073
0xCD 0x9E = U+035E
0xD2 0x89 = U+0489
0xCC 0xB2 = U+0332
0xCD 0x88 = U+0348
0xCC 0x99 = U+0319
0xCC 0xB9 = U+0339
0xCC 0xA4 = U+0324
0xCC 0xAB = U+032B
0xCD 0x87 = U+0347
0x20 = U+0020
0xCD 0x9A = U+035A
0xCC 0xAD = U+032D
0xCD 0x8E = U+034E
0xCD 0x89 = U+0349
0xCC 0xA0 = U+0320
0xCC 0xBA = U+033A
0xCD 0x89 = U+0349
0xCC 0xAE = U+032E
0xCC 0x9E = U+031E
0xCC 0xBB = U+033B
0xCC 0xA3 = U+0323
0xCC 0xB0 = U+0330
0xCC 0xBA = U+033A
0xCC 0x96 = U+0316
0xCD 0x96 = U+0356
0xCC 0x80 = U+0300
0xCC 0x81 = U+0301
0xCD 0xA2 = U+0362
0xCD 0x9E = U+035E
0x65 = U+0065
0xCC 0xB7 = U+0337
0xCC 0xAA = U+032A
0xCC 0xAD = U+032D
0xCC 0xAF = U+032F
0xCC 0xBC = U+033C
0xCD 0x93 = U+0353
0xCD 0x8E = U+034E
0xCC 0xB9 = U+0339
0xCC 0xA0 = U+0320
0xCD 0x96 = U+0356
0xCC 0xB2 = U+0332
0xCD 0x94 = U+0354
0xCC 0xAA = U+032A
0xCD 0x88 = U+0348
0xCC 0xA6 = U+0326
0xCD 0x88 = U+0348
0xCC 0xB1 = U+0331
0xCD 0x8D = U+034D
0xCC 0xAD = U+032D
0xCC 0xA9 = U+0329
0xCD 0xA0 = U+0360
0xC5 0x86 = U+0146
0xCD 0x9E = U+035E
0xD2 0x89 = U+0489
0xCC 0xAE = U+032E
0xCC 0xB3 = U+0333
0xCD 0x93 = U+0353
0xCD 0x99 = U+0359
0xCD 0x88 = U+0348
0xCC 0xBC = U+033C
0xCD 0x89 = U+0349
0xCC 0xAC = U+032C
0xCD 0x95 = U+0355
0xCD 0x88 = U+0348
0xCC 0xBA = U+033A
0xCD 0x88 = U+0348
0xCC 0xAD = U+032D
0xCC 0xA9 = U+0329
0xCC 0xAA = U+032A
0x6F = U+006F
0xCD 0x87 = U+0347
0xCC 0x97 = U+0317
0xCC 0xB1 = U+0331
0xCC 0xA0 = U+0320
0xCC 0xB1 = U+0331
0xCC 0xA0 = U+0320
0xCC 0xAF = U+032F
0xCC 0x95 = U+0315
0xCD 0xA2 = U+0362
0x75 = U+0075
0xCC 0xB8 = U+0338
0xCC 0xB3 = U+0333
0xCC 0xA6 = U+0326
0xCC 0xA9 = U+0329
0xCC 0xB3 = U+0333
0xCC 0xAB = U+032B
0xCC 0x96 = U+0316
0xCC 0x9C = U+031C
0xCD 0x85 = U+0345
0xE2 0x80 0x8C = U+200C
0xE2 0x80 0x8B = U+200B
0xC7 0xB5 = U+01F5
0xCC 0xA2 = U+0322
0xCC 0xB2 = U+0332
0xCC 0xA3 = U+0323
0xCD 0x8E = U+034E
0xCC 0xAE = U+032E
0xCC 0xAE = U+032E
0xCC 0xBC = U+033C
0xCC 0xAB = U+032B
0xCC 0xA5 = U+0325
0xCC 0xA0 = U+0320
0xCD 0x99 = U+0359
0xCC 0xB1 = U+0331
0xCC 0x9D = U+031D
0xCC 0x98 = U+0318
0xCD 0x95 = U+0355
0xCD 0x8E = U+034E
0xCC 0xB3 = U+0333
0xCC 0x9C = U+031C
0xCC 0xB2 = U+0332
0xCC 0x96 = U+0316
0x68 = U+0068
0xCC 0xB8 = U+0338
0xCC 0x9B = U+031B
0xCC 0xA9 = U+0329
0xCD 0x9A = U+035A
0xCC 0xAE = U+032E
0xCC 0xA4 = U+0324
0xCC 0x96 = U+0316
0xCC 0xB9 = U+0339
0xCD 0x99 = U+0359
0x2E = U+002E
0xCC 0xB6 = U+0336
0xCC 0xA8 = U+0328
0xCC 0xB3 = U+0333
0xCC 0x96 = U+0316
0xCC 0xA0 = U+0320
0xCC 0x97 = U+0317
0xCC 0xBC = U+033C
0xCC 0xA9 = U+0329
0xCD 0x95 = U+0355
0xCD 0x87 = U+0347
0xCD 0x89 = U+0349
0xCD 0x93 = U+0353
0xCC 0x9F = U+031F
0xCC 0xA6 = U+0326
0xCD 0x9C = U+035C
0xCD 0x9E = U+035E
0xCD 0x85 = U+0345
0x0A = U+000A
It gets tricky to decipher which parts of that are graphemes, but clearly, with all the stacked characters, this is not a fixed amount of data per grapheme, and there is no sane way to make Unicode work with a fixed width encoding per grapheme because, as the 'Zalgo' examples show, combining marks can basically be combined in arbitrary sequences.
The first grapheme in the second 'Zalgo' example contains:
0xC8 0xA8 = U+0228 LATIN CAPITAL LETTER E WITH CEDILLA
0xCC 0xB8 = U+0338 COMBINING LONG SOLIDUS OVERLAY
0xD2 0x89 = U+0489 CYRILLIC COMBINING MILLIONS SIGN
0xCC 0x9F = U+031F COMBINING PLUS SIGN BELOW
0xCD 0x8E = U+034E COMBINING UPWARDS ARROW BELOW
0xCD 0x9A = U+035A COMBINING DOUBLE RING BELOW
0xCC 0xB9 = U+0339 COMBINING RIGHT HALF RING BELOW
0xCD 0x9A = U+035A COMBINING DOUBLE RING BELOW
0xCC 0x99 = U+0319 COMBINING RIGHT TACK BELOW
0xCC 0x9F = U+031F COMBINING PLUS SIGN BELOW
0xCC 0x96 = U+0316 COMBINING GRAVE ACCENT BELOW
The next code point is U+0078 LATIN SMALL LETTER X, the start of a new grapheme. A couple of the combining marks appear several times each in that list.