In short: I can’t get this URL’s title http://www.namlihipermarketleri.com.tr/default.asp?git=9&urun=10277 (which is broken now (18-11-2015) )
İn my WebCrawler implementation:
@Override
public void visit(Page page) {
System.out.println(page.getWebURL().getURL()); // when this prints the url
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
System.out.println(htmlParseData.getTitle()); // This line prints an empty line!
}
}
Note: Title itself contains some commas “,”. Can you suggest a solution? Is this a bug?
Thanks in advance.
The problem was probably there were 4 title tags in the HTML document.
I've used Jsoup: http://jsoup.org/
HtmlParseData htmlParseData = (HtmlParseData) page
.getParseData();
String html = htmlParseData.getHtml();
Document htmlDocument = Jsoup.parse(html);
String title = htmlDocument.getElementsByTag("title").get(0).text();