Searching SO and Google, I've found that there are a few Java HTML parsers which are consistently recommended by various parties. Unfortunately it's hard to find any information on the strengths and weaknesses of the various libraries. I'm hoping that some people have spent some comparing these libraries, and can share what they've learned.
Here's what I've seen:
And if there's a major parser that I've missed, I'd love to hear about its pros and cons as well.
Thanks!
Almost all known HTML parsers implements the W3C DOM API (part of the JAXP API, Java API for XML processing) and give you a org.w3c.dom.Document
back which is ready for direct use by JAXP API. The major differences are usually to be found in the features of the parser in question. Most parsers are to a certain degree forgiving and lenient with non-well-formed HTML ("tagsoup"), like JTidy, NekoHTML, TagSoup and HtmlCleaner. You usually use this kind of HTML parser to "tidy" the HTML source (e.g. replacing the HTML-valid <br>
by a XML-valid <br />
), so that you can traverse it "the usual way" using the W3C DOM and JAXP API.
The only ones which jump out are HtmlUnit and Jsoup.
HtmlUnit provides its own API which allows it to programmatically act like a web browser -- i.e. enter form values, click elements, invoke JavaScript, etc. It's much more than just a HTML parser. It's a real "GUI-less web browser" and HTML unit testing tool.
Jsoup also provides an API which is completely its own. It gives you the possibility to select elements using jQuery-like CSS selectors and provides a slick API to traverse the HTML DOM tree to get the elements of interest.
Particularly the traversing of the HTML DOM tree is the major strength of Jsoup. Ones who have worked with org.w3c.dom.Document
know what a hell of pain it is to traverse the DOM using the verbose NodeList
and Node
APIs. True, XPath
makes life easier, but it's still another learning curve and it can still end up being verbose.
Here's an example which uses a "plain" W3C DOM parser like JTidy in combination with XPath to extract the first paragraph of your question and the names of all answerers (I am using XPath since without it, the code needed to gather the information of interest would otherwise be 10 times as big, without writing utility/helper methods).
String url = "http://stackoverflow.com/questions/3152138";
Document document = new Tidy().parseDOM(new URL(url).openStream(), null);
XPath xpath = XPathFactory.newInstance().newXPath();
Node question = (Node) xpath.compile("//*[@id='question']//*[contains(@class,'post-text')]//p[1]").evaluate(document, XPathConstants.NODE);
System.out.println("Question: " + question.getFirstChild().getNodeValue());
NodeList answerers = (NodeList) xpath.compile("//*[@id='answers']//*[contains(@class,'user-details')]//a[1]").evaluate(document, XPathConstants.NODESET);
for (int i = 0; i < answerers.getLength(); i++) {
System.out.println("Answerer: " + answerers.item(i).getFirstChild().getNodeValue());
}
And here's an example how to do exactly the same thing with Jsoup:
String url = "http://stackoverflow.com/questions/3152138";
Document document = Jsoup.connect(url).get();
Element question = document.select("#question .post-text p").first();
System.out.println("Question: " + question.text());
Elements answerers = document.select("#answers .user-details a");
for (Element answerer : answerers) {
System.out.println("Answerer: " + answerer.text());
}
Do you see the difference? It's not only less code, but Jsoup is also relatively easy to grasp if you already have moderate experience with CSS selectors (by e.g. developing websites and/or using jQuery).
The pros and cons of each should be clear enough now. If you just want to use the standard JAXP API to traverse it, then go for one from the first group of parsers I mentioned. There are pretty a lot of them. Which one to choose depends on the features it provides (how is HTML cleaning made easy for you? are there some listeners/interceptors and tag-specific cleaners?) and the robustness of the library (how often is it updated/maintained/fixed?). If you want to unit test the HTML, then HtmlUnit is the way to go. If you want to extract specific data from the HTML (which is more than often the real world requirement), then Jsoup is the way to go.