javascriptstringencodingiso-8859-1typed-arrays

TextDecoder with latin1 encoding is giving different result from String.charCodeAt


I want to understand what is causing the difference. I have a list of the first 256 codes.

const codes = new Array(256).fill(0).map((_,i) => i); //[0, 1, 2, 3, ..., 255]
const chars1 = new TextDecoder('latin1').decode(new Uint8Array(codes));

This gives the following string:

'\x00\x01\x02\x03\x04\x05\x06\x07\b\t\n\v\f\r\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7F€\x81‚ƒ„…†‡ˆ‰Š‹Œ\x8DŽ\x8F\x90‘’“”•–—˜™š›œ\x9DžŸ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ'
const chars2 = String.fromCharCode(...codes);

This gives the following string:

'\x00\x01\x02\x03\x04\x05\x06\x07\b\t\n\v\f\r\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7F\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ'
console.log(chars1 === chars2) //prints false

In order to check what index/code is causing the difference, I used the following function:

function findDiff(str1,str2) {
    const diff = [];
    for(let i=0;i<str1.length;i++) {
        if(str1.charAt(i) !== str2.charAt(i)) {
            diff.push(i);
        }
    }
    return diff;
}

findDiff(chars1, chars2); // [128, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 142, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 158, 159]

Why do these code points generate a different character when using TextDecoder and String.fromCharCode ?

I was expecting both methods to give the same string output with the first 256 code points.


Solution

  • I figured out what was going on.

    new TextDecoder("latin1")
    

    does not actually decode latin1, but windows-1252.

    console.log(new TextDecoder("latin1").encoding) //"windows-1252"
    

    There seems to be difference between latin1 aka ISO-8859-1 and windows-1252 in the 0x80 and 0x9f range.

    Those differences are the ones I printed in the code I posted in the OP

    [128, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 142, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 158, 159]
    

    These code points are the only printable windows-1252 characters in the 0x80 and 0x9f range.

    Latin1 does not have any printable character in the 0x80 and 0x9f range.

    enter image description here

    new TextDecoder("latin1").decode(new Uint8Array([128, 130, 131, ...])); //"€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ"
    
    String.fromCharCode(128, 130, 131, ...); // "\x80\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8E\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9E\x9F"
    

    Conclusion:

    String.fromCharCode() will actually give you the correct latin1 characters, and similarly in Nodejs, Buffer.from().toString("latin1") will also give you the correct latin1 characters.

    TextDecoder("latin1") will, however, give you windows-1252 characters.

    I am curious why latin1 is considered windows-1252 for the TextDecoder.