javascriptutf-32

UTF-32 decoding in ECMAScript


I have UTF-32 data, an array buffer. I need to convert it into an ECMAScript string.

I've been told that I can just use TextDecoder with UTF-8, and it is supposed to "just work," I highly doubted the person who had told me this, but it worked anyways.

Except... the output text is riddled with null characters (3 per character), due to reading the null byte padding as a null character, instead of reading the whole four bytes as one character. ex:
\x70\x00\x00\x00
becomes
P UTF-32; null padding is read as one character
P\0\0\0 UTF-8; separated

According to the whatwg encoding spec, UTF-32 is not defined as an encoding label to be used, but instead, only UTF-8, and UTF-16, not UTF-32, does anyone have any suggestions on how I can achieve proper UTF-32 decoding, within a browser?

To be clear, I care about modern browsers, so I am excluding IE, Amaya, Android Webview, and Netscape Navigator, etc.


Solution

  • Decoding it as UTF-8 is definitely wrong! As you found out. In addition to the NUL thing, it will fail to decode characters outside of ASCII entirely.

    You can read the codepoints one by one with a DataView to decode:

    const utf32Decode = bytes => {
      const view = new DataView(bytes.buffer, bytes.byteOffset, bytes.byteLength);
      let result = '';
      
      for (let i = 0; i < bytes.length; i += 4) {
        result += String.fromCodePoint(view.getInt32(i, true));
      }
      
      return result;
    };
    
    const result = utf32Decode(new Uint8Array([0x70, 0x00, 0x00, 0x00]));
    console.log(JSON.stringify(result));

    Invalid UTF-32 will throw an error, thanks to getInt32 (invalid lengths) and String.fromCodePoint (invalid codepoints).