javascriptunicodeencodingrtfwindows-1251

Write non-latin characters and special symbols (Ø) to RTF file (javascript)


Update:

Yes, thank You very much satesrah for the idea! You are right, there is mashup of encodings and I can't convert the whole text in Win-1251 or Win-1252..

I didn't want to insert unicode and keep use single encoding in this file, but the only way I see is to convert all text with such symbols as \u1234?. So created this function:

function unicode_to_rtf_representation_u(srcStr) {
  if (srcStr == undefined) return "";

  let tgtStr = "";

  for (var i = 0; i < srcStr.length; i++) {
    let c = srcStr.charCodeAt(i);
    let result = "\\u" + c + "?";

    tgtStr += result; 
  }
  console.log("result strings is: " + tgtStr);
  return tgtStr;
}

it does something like

Abc Ø абв --> \u65?\u98?\u99?\u32?\u216?\u32?\u1072?\u1073?\u1074?

and this works..

Thank You much again!


Can You please help mу how to encode non-latin (russian) letters, that are mixed with special symbols, for example: Abc Ø абв (here is english text, special symbol 'latin o' and russian text).

I have existing RTF template with 'placeholder' text inside, and what I need is to replace this 'placeholder' with 'Abc Ø абв': enter image description here

I use function from here, at the bottom of the page to decode UTF-8 to Win-1251 - it successfully writes russian letters but finally I get "Ш" Instead of 'Ø':

enter image description here

Here is my example code and input and output files:

input rtf: https://mega.nz/file/CtNB2CiY#yid1nLq9P6Jo8zSRAsXeGai-mZLV6xP1OvN1jDpFyG4

output rtf generated by the code below: https://mega.nz/file/asMExKJI#q8oRn1J9oWMlUck6tJ6MdpVGiIjt81kNFRo7T3eSBTU

const http = require('http');
const port = 3100;

function utf8_decode_to_win1251(srcStr) {
  var tgtStr = "",
    c = 0;
  for (var i = 0; i < srcStr.length; i++) {
    c = srcStr.charCodeAt(i);
    if (c > 127) {
      if (c > 1024) {
        if (c === 1025) {
          c = 1016;
        } else if (c === 1105) {
          c = 1032;
        }
        c -= 848;
      }
      // c = c % 256; // ???
    }
    tgtStr += String.fromCharCode(c);
  }
  return tgtStr;
}


const server = http.createServer(function (req, res) {

  const fs = require('fs');

  // read existing file
  fs.readFile("C:\input.rtf", "utf8", (err, inputText) => {
    if (err) {
      console.error(err);
      return;
    }

    // I want to replace 'placeholder' text in file with this test text:
    let text = `Abc Ø абв`; // 'Abc Ø абв'

    text = utf8_decode_to_win1251(text); // text with encoded russian letters 'Abc Ø àáâ'

    // replace placeholder from input RTF with text with non-latin characters 'Abc Ø àáâ':
    inputText = inputText.replace("placeholder", text);

    // RTF uses 8-bit so need to convert from unicode
    let buf = Buffer.from(inputText, "ascii"); // "binary" also gives wrong output text https://stackoverflow.com/a/34476862/348736


    // write output file to disk
    fs.writeFile("C:\output.rtf", buf, function (error, resultFile) { // result file contains 'Abc Ш абв', which is wrong..
      if (!error) {
        console.info('Created file', resultFile);
      }
      else {
        console.error(error);
      }
    });
  });
});


server.listen(port, function (error) {

  if (error) {
    console.log(`${error}`);
  } else {
    console.log(`listening on port ${port}`);
  }
})

Solution

  • I don't think you can represent "Abc Ø абв" with an 8-bit encoding. At least as far as I know.

    I tried to make sense of what happens in your code. The thing is that in Windows-1251 there is no character Ø, you can check that in this table https://www.ascii-code.com/CP1251. And in Windows-1251 the characters aбв do exist. So it does not make sense that the function actually produces Windows-1251. But if you would try to convert "Abc Ø абв" to Windows-1252, you'd find that Windows-1252 does have the character Ø, but does not have абв (the a here is the cyrillic a which is different from the latin a). I think what's happening is, that you decode to Windows-1252, but the data ends up somewhere where it's supposed to be Windows-1251.

    Playing that through: "Abc Ø абв" translates to the hex (utf-8) 41 62 63 C3 98 D0 B0 D0 B1 D0 B2. Trying to decode this to Windows-1252 gives 41 62 63 D8 E0 E1 E2. Printing that gives "Abc Ø àáâ" which is exactly what you got. If you then change the encoding from Windows-1252 to Windows-1251 for the same hex, it prints "Abc Ш абв". Which again is what happend in your example. (You can try that out here https://www.rapidtables.com/convert/number/hex-to-ascii.html).