parsinghtml-parsinghtmlunit

htmlunit parse Html version 2.70


I have been trying to upgrade htmlunit (https://www.htmlunit.org/) to version 2.70 from version 2.27. I noticed that the static method HtmlParser.parseHtml() no longer exists. I understand I have to instantiate a HtmlUnitNekoHtmlParser now. Something like this:

HTMLParser htmlParser = new HtmlUnitNekoHtmlParser();
HtmlPage htmlPage =new HtmlPage(tmpResponse,tmpWebWindow);
htmlParser.parse(tmpResponse, htmlPage, true, true);

However this leads to an error:

No script object associated with the Page. class: 'com.gargoylesoftware.htmlunit.html.HtmlPage'

According to the javadoc here:

https://javadoc.io/doc/net.sourceforge.htmlunit/htmlunit/latest/index.html

The booleans indicate whether we need to use the XHtml parser and if the script was created by javascript.

I have tried the following combinations:

htmlParser.parse(tmpResponse, htmlPage, false, true)
Still the message No script object associated with the Page

htmlParser.parse(tmpResponse, htmlPage, false, false)
No script object associated with the Page

htmlParser.parse(tmpResponse, htmlPage, true, false)
No script object associated with the Page

What would be the correct way to replace the old HtmlParser.parseHtml() statements in this new version of htmlunit?


Solution

  • Oh, 2.27 to 2.70 is a huge step.

    Option 1: you simply like to parse string content (see https://www.htmlunit.org/faq.html#HowToParseHtmlString)

    You can do it like this...

    try (WebClient webClient = new WebClient(browserVersion)) {
        final HtmlPage page = webClient.loadHtmlCodeIntoCurrentWindow(htmlCode);
        // work with the html page
    }
    

    Option 2: the hard way (in general you have to do what the impl from option 1 does)

    final HTMLParser htmlParser = webClient.getPageCreator().getHtmlParser();
    final WebWindow webWindow = webClient.getCurrentWindow();
    
    final HtmlPage page = new HtmlPage(webResponse, webWindow);
    webWindow.setEnclosedPage(page);
    
    htmlParser.parse(webResponse, page, false, false);
    

    Hope that helps