I'm trying to load and parse a simple utf-8-encoded XML file in javascript using node and the xpath
and xmldom
packages. There are no XML namespaces used and the same XML parsed when converted to ASCII. I can see in the debugger in VS Code that the string has embedded spaces in between each character (surely due to loading the utf-8 file incorrectly) but I can't find a way to properly load and parse the utf-8 file.
Code:
var xpath = require('xpath')
, dom = require('xmldom').DOMParser;
const fs = require('fs');
var myXml = "path_to_my_file.xml";
var xmlContents = fs.readFileSync(myXml, 'utf8').toString();
// this line causes errors parsing every single tag as the tag names have spaces in them from improper utf-8 decoding
var doc = new dom().parseFromString(xmlContents, 'application/xml');
var cvNode = xpath.select1("//MyTag", doc);
console.log(cvNode.textContent);
The code works fine if the file is ASCII (textContent
has the proper data), but if it is UTF-8 then there are a number of parsing errors and cvNode
is undefined
.
Is there a proper way to parse UTF-8 XML in node/javascript? I can't for the life of me find a decent example.
When you see additional white spaces between each letter, this suggests that the file isn't actually encoded using utf-8 but uses a 16 bit unicode encoding.
Try 'utf16le'
.
For a list of supported encodings see Buffers and Character Encodings.