jtidyxomcyberneko

Comments getting escaped with NekoHTML (or JTidy) + XOM


I'm using NekoHTML to clean up some HTML, and then feeding it to XOM to get an object model. Somewhere in the course of this, comments are getting escaped.

Here's a relevant example of the input HTML (most of the <head> cut for clarity):

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
<head>
    <script type="text/JavaScript">
        <!-- // Hide the JS
        startTimeout(6000000, "/");
        // -->
    </script>

Here's the code:

// XOMSafeSAXParser is the Neko SAXParser extended to allow 
// XOM to set the (unnecessary in this case) features
// external-general-entities and external-parameter-entities
XMLReader reader = new XOMSafeSAXParser();

Builder xomBuilder = new Builder(reader);
Reader input = ...; // file, resource, etc.
Document doc = xomBuilder.build(input);

Serializer s = new Serializer(System.out, "UTF-8");
s.setIndent(4);
s.setMaxLength(200);
s.write(doc);
s.flush();

Here's the corresponding output:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML lang="en">
    <HEAD>
        <SCRIPT type="text/JavaScript"> &lt;!-- // Hide the JS startTimeout(6000000, "/"); // --&gt; </SCRIPT>
    </HEAD>

When I extract the script element from the XOM document, it looks like it's already been mangled (the SCRIPT element has one Text node as a child, not the sequence of Texts and Comments I would expect), so I don't think it's the Serializer that's going wrong.

Now, I don't expect the line breaks to be preserved and in fact I'm going to throw the script tags out anyway, but there are other places where I'd like comments to be preserved or at minimum like to be able to get text without escaped comments embedded in it.

Any ideas?


Update: NekoHTML was mangling some tags, so I switched to JTidy, and I have the same problem. Interestingly, though, it's only a problem for the script tag in the header; other comments come through fine. And there are weird extra JavaScript comments that I suspect (hope and pray) are JTidy's fault.

    <script type="text/JavaScript"> // &lt;!-- // Hide the JS startTimeout(6000000, "/"); // --&gt; // </script>

It looks as though what JTidy's doing is converting <script> contents to CDATA; when I send JTidy's raw outputut to stdout, I get this:

<script type="text/JavaScript">
//<![CDATA[
        <!-- // Hide the JS
        startTimeout(6000000, "/");
        // -->
    //]]>
</script>

Solution

  • All right. I seem to have found the explanation at least for the JTidy case:

    the basic issue is that browser scripts will often contain special XML characters: '&', '<', ']]>' and '<' + '/' + Letter. If these are escaped to make XML processors happy, it will break the script. The agreed solution is to place source within a CDATA section. This is now done for both and tags. So far, so good. But there are a number open issues and possible unintended consequences. ... script source is often embedded in HTML comments to prevent parsing by older browsers that do not support Javascript.

    HTML comments in general are okay; it's just HTML comments inside <script> tags that get mangled, because they're turned into (and escaped within) CDATA. XOM, in turn, merges CDATA into Text.

    Technically, I think this means JTidy is broken, but it's good enough for my purposes since I don't need the <script> tags at all.

    Still, if anybody has a solution that gets me out what I put in, I'd still like to hear it.