My question here is what the logic of the following behavior might be, or if it's a bug (in MSXML6 under Windows) even what failure of logic could underpin such a bug.
Consider the input XML file.
<?xml version="1.0" encoding="utf-8"?>
<root>
<item>first item</item>
<item>second item</item>
</root>
The following XSLT attempts to extract the items in text format, one per line, with the standard Windows CR-LF line endings.
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE xsl:stylesheet [<!ENTITY eol "<![CDATA[
]]>">]> <!-- (a) !?? -->
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text" version="1.0" encoding="utf-8" media-type="text/plain"/>
<xsl:strip-space elements='*'/>
<xsl:template match="item"> <!-- list items, one per line -->
<xsl:value-of select="."/>
<xsl:text disable-output-escaping="yes">&eol;</xsl:text>
</xsl:template>
</xsl:stylesheet>
However, the output that I am getting includes extraneous escaped CRs literally output as " "
at the end of each line.
first item
second item
The question, again, is about the particular behavior above, which I find quite odd. I am specifically not asking for alternatives or workarounds, in fact variations thereof look to be working fine.
<!DOCTYPE xsl:stylesheet [<!ENTITY eol "<![CDATA[
]]>">]> <!-- (b) works -->
<!DOCTYPE xsl:stylesheet [<!ENTITY eol "&#xA;">]> <!-- (c) no newlines in output -->
<!DOCTYPE xsl:stylesheet [<!ENTITY eol "&#xA;">]> <!-- (d) works -->
<!DOCTYPE xsl:stylesheet [<!ENTITY eol "
">]> <!-- (e) no newlines in output -->
<!DOCTYPE xsl:stylesheet [<!ENTITY eol "
">]> <!-- (f) works -->
var vArgs = WScript.Arguments;
var xmlFile = vArgs(0);
var xslFile = vArgs(1);
var xmlDOMDocProgID = "MSXML2.DOMDocument.6.0";
var xmlDoc = new ActiveXObject(xmlDOMDocProgID);
xmlDoc.setProperty("NewParser", true);
xmlDoc.validateOnParse = false;
xmlDoc.async = false;
xmlDoc.load(xmlFile);
var xslDoc = new ActiveXObject(xmlDOMDocProgID);
xslDoc.setProperty("NewParser", true);
xslDoc.setProperty("ProhibitDTD", false);
xslDoc.validateOnParse = false;
xslDoc.async = false;
xslDoc.load(xslFile);
WScript.StdOut.Write(xmlDoc.transformNode(xslDoc));
Assuming it's saved as test.js
and the xml/xslt files are test.xml
and test.xslt
respectively, the transformation at the cmd prompt can be run as,,,
C:\etc>cscript //nologo test.js test.xml test.xslt
first item
second item
C:\etc>
I think it is a bug of MSXML 6 and the "new parser" you enable there with xslDoc.setProperty("NewParser", true);
. Even without using any XSLT at all you can load a document like
<!DOCTYPE root [<!ENTITY eol "<![CDATA[
]]>">]>
<root>&eol;</root>
with MSXML 6 and the "new parser" and check the text
property of the root/document element
var xmlDOMDocProgID = "MSXML2.DOMDocument.6.0";
var xmlDoc = new ActiveXObject(xmlDOMDocProgID);
xmlDoc.setProperty("NewParser", true);
xmlDoc.setProperty("ProhibitDTD", false);
xmlDoc.validateOnParse = false;
xmlDoc.load('cdata-input2.xml');
WScript.Echo(xmlDoc.documentElement.text);
and it shows
.
If you also output WScript.Echo(xmlDoc.documentElement.firstChild.firstChild.nodeValue);
you get the same value so somehow the entity parsing ends up "converting" the <!ENTITY eol "<![CDATA[
]]>">
from the DTD subset and the &eol;
into an entity reference node containing a CDATA section node with a node value where the escaped hexadecimal character reference 
is now an escaped decimal one
.