pythonpython-3.xunicodeutf-32

Length of a single character encoded in UTF-32


Wikipedia tells me that the number of bits used by the UTF-32 encoding is 32 bits, so why does this give me a 64 bit length?

>>> Bits(bytes = 'a'.encode('utf-32')).bin
'1111111111111110000000000000000001100001000000000000000000000000'
>>> len(Bits(bytes = 'a'.encode('utf-32')).bin)
64

UTF-32 is supposed to be a 4-byte fixed length character set, which according to my understanding is that every character would have fixed length representing it within 32 bits, yet, the output of above code is 64. How is this?


Solution

  • Encoding to UTF-32 usually includes a Byte Order Mark; you have two characters encoded to UTF-32. The BOM is usually required as it lets the decoder know if the data was encoded in little endian or big endian order. The BOM is really just the U+FEFF ZERO WIDTH NO-BREAK SPACE codepoint, which is encoded to '11111111111111100000000000000000' (little-endian) in your example.

    Encode to one of the two endian-specific variants Python provides ('utf-32-le' or 'utf-32-be') to get a single character:

    >>> Bits(bytes = 'a'.encode('utf-32-le')).bin
    '01100001000000000000000000000000'
    >>> len(Bits(bytes = 'a'.encode('utf-32-le')).bin)
    32
    

    The -le and -be variants let you encode or decode UTF-32 without a BOM, because you explicitly set the byte order.

    Had you encoded more than one character, you'd have noticed that there are always 4 bytes more than the number of characters would require:

    >>> len('abcd'.encode('utf-32'))  # (BOM + 4 chars) * 4 bytes == 20 bytes
    20