I am building a little document parser in node.js. To test, I have a raw HTML file, that is generally downloaded from the real website when the application executes.
I want to extract the first code example from each section of the Console.WriteLine that matches my constraint - it has to be written in C#. To do that, I have this sample XPath:
//*[@id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::div/following-sibling::div/pre[position()>1]/code[contains(@class,'lang-csharp')]
If I test the XPath online, I get the expected results, which is in this Gist.
In my node.js application, I am using xmldom and xpath to try and parse that exact same information out:
var exampleLookup = `//*[@id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::div/following-sibling::div/pre[position()>1]/code[contains(@class,'lang-csharp')]`;
var doc = new dom().parseFromString(rawHtmlString, 'text/html');
var sampleNodes = xpath.select(exampleLookup,doc);
This does not return anything, however.
What might be going on here?
This is most likely caused by the default namespace (xmlns="http://www.w3.org/1999/xhtml"
) in your HTML (XHTML).
Looking at the xpath docs, you should be able to bind the namespace to a prefix using useNamespaces
and use the prefix in your xpath (untested)...
var exampleLookup = `//*[@id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::x:div/following-sibling::x:div/x:pre[position()>1]/x:code[contains(@class,'lang-csharp')]`;
var doc = new dom().parseFromString(rawHtmlString, 'text/html');
var select = xpath.useNamespaces({"x": "http://www.w3.org/1999/xhtml"});
var sampleNodes = xpath.select(exampleLookup,doc);
Instead of binding the namespace to a prefix, you could also use local-name()
in your XPath, but I wouldn't recommend it. This is also covered in the docs.
Example...
//*[@id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::*[local-name()='div']/following-sibling::*[local-name()='div']/*[local-name()='pre'][position()>1]/*[local-name()='code'][contains(@class,'lang-csharp')]