javaweb-crawlerheritrix

Heritrix not finding CSS files in conditional comment blocks


The Problem/evidence

Heritrix is not detecting the presence of files in conditional comments that open & close in one string, such as this:

<!--[if (gt IE 8)|!(IE)]><!--> 
<link rel="stylesheet" href="/css/mod.css" />
<!--<![endif]-->

However standard conditional blocks like this work fine:

<!--[if lte IE 9]>
<script src="/js/ltei9.js"></script>
<![endif]-->

I've identified the problem as being with this part of the comment:

<!-->

Removal of that block in a test case then allows Heritrix to discover the css file.

The Question

How should I overcome this? Is it a Heritrix bug, or is it something we can get around with a crawler-beans declaration? I'm aware that the comment block is there to "trick" certain browser versions, and changing the website code is not an option. Can Heritrix be adapted to negate comments?


Solution

  • ExtractorHTML parses the page using the following regex:

    static final String RELEVANT_TAG_EXTRACTOR =
      "(?is)<(?:((script[^>]*+)>.*?</script)" + // 1, 2
      "|((style[^>]*+)>.*?</style)" + // 3, 4
      "|(((meta)|(?:\\w{1,"+MAX_ELEMENT_REPLACE+"}))\\s+[^>]*+)" + // 5, 6, 7
      "|(!--(?!\\[if).*?--))>"; // 8
    

    Basically, cases 1 .. 7 match any interesting tags for link extractions, and case 8 matches HTML comments in order to discard them. As you can see, case 8 carefully avoids matching comments in the form <!--[if ... -->, so that they are not discarded. So in your specific case, the <!--> that follows is matched as a starting comment, and it is discarded up to the last -->.

    <!--[if (gt IE 8)|!(IE)]><!--> is a trick to make valid XHTML where the conditional content is parsed by any non IE browser. Heritrix could be fixed here by making RELEVANT_TAG_EXTRACTOR not consider <!--> as a comment start. This should work:

    static final String RELEVANT_TAG_EXTRACTOR =
      "(?is)<(?:((script[^>]*+)>.*?</script)" + // 1, 2
      "|((style[^>]*+)>.*?</style)" + // 3, 4
      "|(((meta)|(?:\\w{1,"+MAX_ELEMENT_REPLACE+"}))\\s+[^>]*+)" + // 5, 6, 7
      "|(!--(?!\\[if|>).*?--))>"; // 8
    

    You always can compile a java class inheriting org.archive.modules.extractor.ExtractorHTML with the fix, and use your class in place of the original ExtractorHTML.