javascriptnode.jsxmlutf-8xmldom

Parse UTF-8 XML in javascript


I'm trying to load and parse a simple utf-8-encoded XML file in javascript using node and the xpath and xmldom packages. There are no XML namespaces used and the same XML parsed when converted to ASCII. I can see in the debugger in VS Code that the string has embedded spaces in between each character (surely due to loading the utf-8 file incorrectly) but I can't find a way to properly load and parse the utf-8 file.

Code:

var xpath = require('xpath')
  , dom = require('xmldom').DOMParser;

const fs = require('fs');

var myXml = "path_to_my_file.xml";

var xmlContents = fs.readFileSync(myXml, 'utf8').toString();

// this line causes errors parsing every single tag as the tag names have spaces in them from improper utf-8 decoding
var doc = new dom().parseFromString(xmlContents, 'application/xml');
var cvNode = xpath.select1("//MyTag", doc);

console.log(cvNode.textContent);

The code works fine if the file is ASCII (textContent has the proper data), but if it is UTF-8 then there are a number of parsing errors and cvNode is undefined.

Is there a proper way to parse UTF-8 XML in node/javascript? I can't for the life of me find a decent example.


Solution

  • When you see additional white spaces between each letter, this suggests that the file isn't actually encoded using utf-8 but uses a 16 bit unicode encoding.

    Try 'utf16le'.

    For a list of supported encodings see Buffers and Character Encodings.