javadomhtmlunit

HtmlUnit returning empty list of DomElements


I am having trouble retrieving the list of Dom Elements when using the method getElementsByName from HtmlPage.

Here is the HTML Page. (Trying to get the CategoriaAgente from the select tag).

HTML (The part that I need):

<select name="CategoriaAgente">
  <option value="-">Escolha uma categoria</option>
  <option value="t">Todos</option>
  <option value="p">Permissionária de Distribuição</option>
  <option value="d">Concessionária de Distribuição</option>
</select>

Snippet of the Java code (Using HtmlUnit):

    public List<HtmlOption> listaAgentes() {
    List<HtmlOption> listaAgentes = null;

    try (WebClient webClient = new WebClient()) {
        log.info("COLETANDO AGENTES");

        // parâmetros do webclient
        webClient.setJavaScriptTimeout(15000);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setUseInsecureSSL(true);
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setTimeout(300000);

        String url = "https://www2.aneel.gov.br/aplicacoes_liferay/tarifa/";
        HtmlPage page = webClient.getPage(url);
        
        // SELECIONAR CATEGORIA AGENTE
        List<DomElement> listaCategoriaAgente = page.getElementsByName("CategoriaAgente");
       //... 

The list listaCategoriaAgente is ALWAYS empty. I tried some solutions found on S.O. but none of them works. Help? Thanks in advance!

EDIT: After the comment from @hooknc , I found that the page is looking for some kind of captcha from cloudfare. This is what I get from POSTMAN....

enter image description here

Someone knows how to bypass this challenge-form using HtmlUnit? Thanks!!!!!

EDIT 2:

Well, I think I made some progress(?)...

This is the code so far....

try (WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
        webClient.getOptions().setCssEnabled(false);
        webClient.setJavaScriptTimeout(0);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setUseInsecureSSL(true);
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setTimeout(0);
        webClient.getCookieManager().setCookiesEnabled(true);
        webClient.getOptions().setJavaScriptEnabled(true);
        webClient.getOptions().setRedirectEnabled(true);
        webClient.getCache().setMaxSize(0);
        webClient.waitForBackgroundJavaScript(10_000);
        webClient.waitForBackgroundJavaScriptStartingBefore(10_000);

        HtmlPage page = null;
        String url = null;

        url = "https://www2.aneel.gov.br/aplicacoes_liferay/tarifa/";
        page = webClient.getPage(url);

        if (page.asXml().contains("Checking if the site connection is secure")) {
            log.info(page.asXml());

            synchronized(page) {
                page.wait(10_000);
            }
            webClient.waitForBackgroundJavaScript(10_000);
        }

And... this is what I get from the log...

<div id="challenge-success" style="display: none;">
      <div class="h2">
        <span class="icon-wrapper">
          <img class="heading-icon" alt="Success icon" src=""/>
        </span>
        Connection is secure
      </div>
      <div class="core-msg spacer">
        Proceeding...
      </div>
    </div>

So... It says Proceeding... but nothing happens... I waited 4ever, but it just stucks on the Proceeding...

Any thoughts?? Thanks!!!


Solution

  • Well. This is what happened. I posted (a related) question, and a guy (possibly from the htmlunit crew) posted an update on git to solve the cookie problem. When using that updated version (2.68.0-SNAPSHOT - and I had to update the version of apache-commons-lang3 too) all the problems disappeared. Cloudflare accepted the connection and everything worked! Here is the final version of the code....

    try (WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
            String url = "https://www2.aneel.gov.br:443/aplicacoes_liferay/tarifa/";
            
            // parâmetros do webclient
            webClient.getOptions().setCssEnabled(true);
            webClient.setJavaScriptTimeout(0);
            webClient.getOptions().setThrowExceptionOnScriptError(false);
            webClient.getOptions().setUseInsecureSSL(true);
            webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
            webClient.getOptions().setTimeout(0);
            webClient.getOptions().setJavaScriptEnabled(true);
            webClient.getOptions().setRedirectEnabled(true);
            
            CookieManager cookies = new CookieManager();            
            cookies.setCookiesEnabled(true);
            webClient.setCookieManager(cookies);
            
            webClient.setAjaxController(new NicelyResynchronizingAjaxController());
            
            webClient.waitForBackgroundJavaScript(10000);
            webClient.waitForBackgroundJavaScriptStartingBefore(10000);
            
            webClient.getCache().setMaxSize(0);
            
            java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);
            java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(Level.OFF);
            java.util.logging.Logger.getLogger("org.apache.commons.httpclient").setLevel(Level.OFF);
            
            HtmlPage page = webClient.getPage(url);
            webClient.getRefreshHandler().handleRefresh(page, new URL(url), 10);
            
            synchronized(page) {
                page.wait(10000);
            }
            
            if (page.asXml().contains("Checking if the site connection is secure")) {
                log.info(page.asXml());
                webClient.waitForBackgroundJavaScript(10_000);
            }
    
            List<DomElement> listaCategoriaAgente = page.getElementsByName("CategoriaAgente");
    

    With the updates, and this piece of code, the list of DOM Elements I needed came properly. Thank you all for the assist!