crawler4jhtml-title

crawler4j - I can't get the title


In short: I can’t get this URL’s title http://www.namlihipermarketleri.com.tr/default.asp?git=9&urun=10277 (which is broken now (18-11-2015) )

İn my WebCrawler implementation:

     @Override
     public void visit(Page page) {          
         System.out.println(page.getWebURL().getURL()); // when this prints the url
         if (page.getParseData() instanceof HtmlParseData) {
             HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
             System.out.println(htmlParseData.getTitle()); // This line prints an empty line!
         }
     }

Note: Title itself contains some commas “,”. Can you suggest a solution? Is this a bug?

Thanks in advance.


Solution

  • The problem was probably there were 4 title tags in the HTML document.

    I've used Jsoup: http://jsoup.org/

    HtmlParseData htmlParseData = (HtmlParseData) page
                            .getParseData();
    String html = htmlParseData.getHtml();
    Document htmlDocument = Jsoup.parse(html);              
    String title = htmlDocument.getElementsByTag("title").get(0).text();