I want to understand what is causing the difference. I have a list of the first 256 codes.
const codes = new Array(256).fill(0).map((_,i) => i); //[0, 1, 2, 3, ..., 255]
const chars1 = new TextDecoder('latin1').decode(new Uint8Array(codes));
This gives the following string:
'\x00\x01\x02\x03\x04\x05\x06\x07\b\t\n\v\f\r\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7F€\x81‚ƒ„…†‡ˆ‰Š‹Œ\x8DŽ\x8F\x90‘’“”•–—˜™š›œ\x9DžŸ ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ'
const chars2 = String.fromCharCode(...codes);
This gives the following string:
'\x00\x01\x02\x03\x04\x05\x06\x07\b\t\n\v\f\r\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7F\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ'
console.log(chars1 === chars2) //prints false
In order to check what index/code is causing the difference, I used the following function:
function findDiff(str1,str2) {
const diff = [];
for(let i=0;i<str1.length;i++) {
if(str1.charAt(i) !== str2.charAt(i)) {
diff.push(i);
}
}
return diff;
}
findDiff(chars1, chars2); // [128, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 142, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 158, 159]
Why do these code points generate a different character when using TextDecoder and String.fromCharCode ?
I was expecting both methods to give the same string output with the first 256 code points.
I figured out what was going on.
new TextDecoder("latin1")
does not actually decode latin1, but windows-1252.
console.log(new TextDecoder("latin1").encoding) //"windows-1252"
There seems to be difference between latin1 aka ISO-8859-1 and windows-1252 in the 0x80 and 0x9f range.
Those differences are the ones I printed in the code I posted in the OP
[128, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 142, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 158, 159]
These code points are the only printable windows-1252 characters in the 0x80 and 0x9f range.
Latin1 does not have any printable character in the 0x80 and 0x9f range.
new TextDecoder("latin1").decode(new Uint8Array([128, 130, 131, ...])); //"€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ"
String.fromCharCode(128, 130, 131, ...); // "\x80\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8E\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9E\x9F"
Conclusion:
String.fromCharCode() will actually give you the correct latin1 characters, and similarly in Nodejs, Buffer.from().toString("latin1") will also give you the correct latin1 characters.
TextDecoder("latin1") will, however, give you windows-1252 characters.
I am curious why latin1 is considered windows-1252 for the TextDecoder.