html node.js character-encoding windows-1252 quoted-printable

Decoding a combination of windows-1252 and quoted printable HTML

I have been given a piece of text representing HTML e.g.:

<html>\r\n<head>\r\n<meta http-equiv=3D\"Content-Type\" content=3D\"text/html; charset=3DWindows-1=\r\n252\">\r\n<style type=3D\"text/css\" style=3D\"display:none;\"><!-- P {margin-top:0;margi=\r\nn-bottom:0;} --></style>\r\n</head>\r\n<body dir=3D\"ltr\">This should be a pound sign: =A3 and this should be a long dash: =96 \r\n</body>\r\n</html>\r\n

From the HTML <meta> tag I can see that the piece of HTML should be encoded as Windows-1252.

I am using node.js to parse this piece of text with cheerio. However decoding it with https://github.com/mathiasbynens/windows-1252 is not helping: windows1252.decode(myString); is giving back the same input string.

The reason I think is because that input string is already encoded in the standard node.js charset, but it actually represents a windows-1252 encoded piece of HTML (if that makes sense?).

Checking those strange HEX numbers prepend by = I can see valid windows-1252 codes e.g.:

this =\r\n and this \r\n should somehow represent a carriage return in the Windows world,
=3D: HEX 3D is DEC 61 which is an equals sign: =,
=96: HEX 96 is DEC 150 which is an 'en dash' sign: – (some sort of "long minus symbol"),
=A3: HEX A3 is DEC 163 which is a pound sign: £

I don't have control in the generation of that piece of HTML, but I am supposed to parse it and clean it giving back £ (instead of =A3) etc.

Now, I know I could keep an in memory map with the conversions, but I was wondering if there is already a programmatic solution that covers the whole windows-1252 charset?

Cf. this for the whole conversion table: https://www.w3schools.com/charsets/ref_html_ansi.asp

Edit:

The input HTML comes from a IMAP session, so it seems there's a 7bit/8bit "quoted printable encoding" going on upstream that I can not control (cf https://en.wikipedia.org/wiki/Quoted-printable).

In the meanwhile I became aware of this extra encoding and I've tried this quoted-printable (cf. https://github.com/mathiasbynens/quoted-printable) library with no luck.

The following is an MCV (as per request):

var cheerio = require('cheerio');
var windows1252 = require('windows-1252');
var quotedPrintable = require('quoted-printable');

const inputString = '<html>\r\n<head>\r\n<meta http-equiv=3D\"Content-Type\" content=3D\"text/html; charset=3DWindows-1=\r\n252\">\r\n<style type=3D\"text/css\" style=3D\"display:none;\"><!-- P {margin-top:0;margi=\r\nn-bottom:0;} --></style>\r\n</head>\r\n<body dir=3D\"ltr\">This should be a pound sign: =A3 and this should be a long dash: =96 \r\n</body>\r\n</html>\r\n'
const $ = cheerio.load(inputString, {decodeEntities: true});
const bodyContent = $('html body').text().trim();
const decodedBodyContent = windows1252.decode(bodyContent);

console.log(`The input string: "${bodyContent}"`);
console.log(`The output string: "${decodedBodyContent}"`);

if (bodyContent === decodedBodyContent) {
  console.log('The windows1252 output seems the same of as the input');
}

const decodedQp = quotedPrintable.decode(bodyContent)
console.log(`The decoded QP string: "${decodedQp}"`);

The previous script is producing the following output:

The input string: "This should be a pound sign: =A3 and this should be a long dash: =96"
The output string: "This should be a pound sign: =A3 and this should be a long dash: =96"
The windows1252 output seems the same of as the input
The decoded QP string: "This should be a pound sign: £ and this should be a long dash: "

On my command line I can not see the long dash and I am not sure how I could properly decode all these =<something> encoded characters?

Solution

It seems the message received via IMAP is provided with a combination of 2 different encodings:

the actual string is encoded according to the "quoted printable" encoding (https://en.wikipedia.org/wiki/Quoted-printable) because I think there's an issue with the 7bit/8bit mapping when transporting that information via the IMAP channel (a TCP socket connection)
the logic representation of the content (an email body) which is HTML with a <meta> tag with a Windows-1252 charset

There is also an "issue" with these HTML chunks that contain a lot of carriage returns in the Windows flavour (\r\n). I had to pre-process the string to deal with that, in my case: removing those carriage returns.

The following MCV example should show the process of cleaning and validating the content of string representing an email body:

var quotedPrintable = require('quoted-printable');
var windows1252 = require('windows-1252');

const inputStr = 'This should be a pound sign: =A3 \r\nand this should be a long dash: =96\r\n';
console.log(`The original string: "${inputStr}"`);

// 1. clean the "Windows carriage returns" (\r\n)
const cleandStr = inputStr.replace(/\r\n/g, '');
console.log(`The string without carriage returns: "${cleandStr}"`);

// 2. decode using the "quoted printable protocol"
const decodedQp = quotedPrintable.decode(cleandStr)
console.log(`The decoded QP string: "${decodedQp}"`);

// 3. decode using the "windows-1252"
const windows1252DecodedQp = windows1252.decode(decodedQp);
console.log(`The windows1252 decoded QP string: "${windows1252DecodedQp}"`);

Which gives this output:

The original string: "This should be a pound sign: =A3
and this should be a long dash: =96
"
The string without carriage returns: "This should be a pound sign: =A3 and this should be a long dash: =96"
The decoded QP string: "This should be a pound sign: £ and this should be a long dash: "
The windows1252 decoded QP string: "This should be a pound sign: £ and this should be a long dash: –"

Notice the "long dash character" that is rendered differently before/after the Windows-1252 decoding phase.

Afaik, this had nothing to do with UTF-8 encoding/decoding. I was able to figure out the "decoding order" of the procedure from this: https://github.com/mathiasbynens/quoted-printable/issues/5

One thing I am not sure is if the Operating System I am running this piece of code on has some sort of impact on charsets/encodings of files or streams of strings.

The npm packages I have used are: