javascriptfileblobutf-32

How to convert a string to file with UTF-32LE encoding in JS?


Based on this thread I tried to create a blob with utf 32 encoding and BOM of FF FE 00 00(UTF-32LE representation) as follows:

var BOM = new Uint8Array([0xFF,0xFE,0x00,0x00]);
var b = new Blob([ BOM, "➀➁➂ Test" ]);
var url = URL.createObjectURL(b);
open(url);

But the file content gets corrupted and gets changed to a different language. What is the correct way to convert a string to a file with utf-32le format?

Edit: Im trying this in browser environment


Solution

  • Note: I'm assuming you're doing this in a browser, since you used Blob and Node.js only recently got Blob support, and you referenced a question doing this in a browser.

    You're just setting the BOM, you're not handling converting the data. As it says in MDN's documentation, any strings in the array will be encoded using UTF-8. So you have a UTF-32LE BOM followed by UTF-8 data.

    Surprisingly (to me), the browser platform doesn't seem to have a general-purpose text encoder (TextEncoder just encodes UTF-8), but JavaScript strings provide a means of iterating through their code points (not just code units) and getting the actual Unicode code point value. (If those terms are unfamiliar, my blog post What is a string? may help.) So you can get that number and convert it into four little-endian bytes. DataView provides a convenient way to do that.

    Finally, you'll want to specify the charset in the blob's MIME type (the BOM itself isn't sufficient). I expected text/plain; charset=UTF-32LE to work, but it doesn't, at least not in Chromium browsers. There's probably some legacy reason for that. text/html works (on its own), but we may as well be specific and do text/html; charset=UTF-32LE.

    See comments:

    function getUTF32LEUrl(str) {
        // The UTF-32LE BOM
        const BOM = new Uint8Array([0xFF,0xFE,0x00,0x00]);
        // A byte array and DataView to use when converting 32-bit LE to bytes;
        // they share an underlying buffer
        const uint8 = new Uint8Array(4);
        const view = new DataView(uint8.buffer);
        // Convert the payload to UTF-32LE
        const utf32le = Array.from(str, (ch) => {
            // Get the code point
            const codePoint = ch.codePointAt(0);
            // Write it as a 32-bit LE value in the buffer
            view.setUint32(0, codePoint, true);
            // Read it as individual bytes and create a plain array of them
            return Array.from(uint8);
        }).flat(); // Flatten the array of arrays into one long byte sequence
        // Create the blob and URL
        const b = new Blob(
            [ BOM, new Uint8Array(utf32le)],
            { type: "text/html; charset=UTF-32LE"} // Set the MIME type
        );
        const url = URL.createObjectURL(b);
        return url;
    }
    

    Beware, though, that the specification "prohibits" browsers from supporting UTF-32 (either LE or BE) for HTML:

    13.2.3.3 Character encodings

    User agents must support the encodings defined in Encoding, including, but not limited to, UTF-8, ISO-8859-2, ISO-8859-7, ISO-8859-8, windows-874, windows-1250, windows-1251, windows-1252, windows-1254, windows-1255, windows-1256, windows-1257, windows-1258, GBK, Big5, ISO-2022-JP, Shift_JIS, EUC-KR, UTF-16BE, UTF-16LE, UTF-16BE/LE, and x-user-defined. User agents must not support other encodings.

    Note: The above prohibits supporting, for example, CESU-8, UTF-7, BOCU-1, SCSU, EBCDIC, and UTF-32. This specification does not make any attempt to support prohibited encodings in its algorithms; support and use of prohibited encodings would thus lead to unexpected behavior. [CESU8] [UTF7] [BOCU1] [SCSU]

    You might be better off with one of the UTF-16s, given that you're using window.open to open the result. (For downloading, UTF-32 is fine if you really want a UTF-32 file.)


    Sadly, Stack Snippets won't let you open a new window, but here's a full example you can copy and paste to run locally:

    <!doctype html>
    <html>
    <head>
        <meta charset="UTF-8">
        <meta http-equiv="X-UA-Compatible" content="IE=Edge">
        <title>UTF-32 Conversion</title>
        <link rel="shortcut icon" href="favicon.png">
        <style>
        body, html {
            height: 100%;
            width: 100%;
            margin: 0;
            padding: 0;
            box-sizing: border-box;
        }
        *, *:before, *:after {
            box-sizing: inherit;
        }
        </style>
    </head>
    <body>
    <input type="button" value="Open" id="open">
    <input type="button" value="Download" id="download">
    <script type="module">
    function getUTF32LEUrl(str, mimeType) {
        // The UTF-32LE BOM
        const BOM = new Uint8Array([0xFF,0xFE,0x00,0x00]);
        // A byte array and DataView to use when converting 32-bit LE to bytes;
        // they share an underlying buffer
        const uint8 = new Uint8Array(4);
        const view = new DataView(uint8.buffer);
        // Convert the payload to UTF-32LE
        const utf32le = Array.from(str, (ch) => {
            // Get the code point
            const codePoint = ch.codePointAt(0);
            // Write it as a 32-bit LE value in the buffer
            view.setUint32(0, codePoint, true);
            // Read it as individual bytes and create a plain array of them
            return Array.from(uint8);
        }).flat(); // Flatten the array of arrays into one long byte sequence
        // Create the blob and URL
        const b = new Blob(
            [ BOM, new Uint8Array(utf32le)],
            mimeType // Set the MIME type
        );
        const url = URL.createObjectURL(b);
        return url;
    }
    document.getElementById("open").addEventListener("click", () => {
        const str = "➀➁➂ Test";
        const url = getUTF32LEUrl(str, { type: "text/html; charset=UTF-32LE" });
        window.open(url);
    });
    document.getElementById("download").addEventListener("click", () => {
        const str = "➀➁➂ Test";
        const url = getUTF32LEUrl(str, { type: "text/plain; charset=UTF-32LE" });
        const a = document.createElement("a");
        a.download = "utf-32_file.txt";
        a.href = url;
        a.click();
        document.body.removeChild(a);
    });
    </script>
    </body>
    </html>