javascriptjavahtmlhtmlunitsourceforge

HTMLUnit not return completely loaded page with JavaScript


I'm trying to get the content of the web page, namely the right side of the page with the list of apartments (div elements with class="classified"). When viewing the page in browser it's clear that it uses JavaScript.

I'm using HtmlUnit for Java and especially waitForBackgroundJavaScript(10000) method for waiting till the JavaScript is finished. However, it still doesn't work for me and I get the same HTML w/o elements showing apartment pages as by initial call.

java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("org.apache.http.client").setLevel(Level.OFF);

URL url = new URL("https://r.onliner.by/pk/#bounds%5Blb%5D%5Blat%5D=53.75074091071493&bounds%5Blb%5D%5Blong%5D=27.301025390625004&bounds%5Brt%5D%5Blat%5D=54.04527964804286&bounds%5Brt%5D%5Blong%5D=27.822875976562504");

WebClient webClient = new WebClient(BrowserVersion.FIREFOX_60);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);

HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScript(50000); 

System.out.println(page.asXml());

webClient.close();

Enabling setThrowExceptionOnScriptError shows some exceptions in JavaScript code (I'm not sure if it is relevant to the question as there are no such issues in browser).

I've also tried some other methods like

// option 2
webClient.waitForBackgroundJavaScriptStartingBefore(50000);

// option 3
webClient.setJavaScriptTimeout(50000);

// option 4
JavaScriptJobManager manager = page.getEnclosingWindow().getJobManager();
while (manager.getJobCount() > 0)
    Thread.sleep(1000);

but nothing worked. Could you please advise how to get content of the page?


Solution

  • Given the Problems HTMLUnit has with JavaScript, you need to find a workaround. Seeing that you know which element you want, you can implement a while loop. This could look somehow like this:

    while(!page.asText().contains(„<div id=\„exmaple-id\">“)){
            webClient.waitForBackgroundJavaScript(500);
        }
    

    If you are afraid of being catched in this loop, you could add a counting variable to the while condition. As far as my exeprience goes, is this a reliable way of dealing with this kind of delay.