javahtmlxpathhtmlcleaner

HtmlCleaner failing on some xpaths generated by XPather


I am using HtmlCleaner2.1 library for evaluating xpaths generated by XPather plugin against html to scrape content from it. But sometimes, HtmlCleaner fails to evaluate xpath.

For e.x. http://www.megaoutdoors.co.uk/norwegen-army-shirt-zipped-roll-top-collar-278-p.asp

For product title, xpath given by XPather is //body/div[11]/div[6]/div[2]/form/div[1]/h1 But this fails when I evaluate it using HtmlCleaner.

How could we overcome this problem. Does structure of page change when htmlcleaner cleans it?

Thanks
Jitendra


Solution

  • Does structure of page change when htmlcleaner cleans it?

    According to the intro example on http://htmlcleaner.sourceforge.net/, HTMLCleaner certainly can change the structure of the page when cleaning up. In that example it adds html and body elements, and moves the h1 element out of the table.

    Why don't you run HTMLCleaner on the page and look at the output from it? Then you'll be able to tell whether and how the structure has changed.

    Is there some way to avoid it or in other words, keep DOM generated by HtmlCleaner as close as possible to DOM built by browser.

    You could do this by specifying a modified tag info set, different from the default one. This is apparently what configures the "corrections" of the DOM. (See here for how to use it, if you're using the command-line interface.)

    Or if you could suggest some another html parser, whose DOM is very close to DOM by browser, so that xpath generated by XPather plugin would fail very rarely.

    I would try HTML Tidy and see what it does to the DOM. It's a widely used and mature program for tidying up scraped HTML.