javatag-soupjericho-html-parsercrawler4j

What html parser should I use?


I am working on a product where I need to parse a HTML document. I looked for Jericho, TagSoup, Jsoup and Crawl4J. Which parser should I use to parse HTML as I need to run this process in multi thread environment using quartz?

At a time if 10 thread run in memory, then I need an API which consumes less memory. In jericho, I read somewhere that it is text based search API and consumes less memory. Am I right? Or I should go for other and why?


Solution

  • Test them out and check their memory footprint. It's hard to make predictions on memory profiles without knowing and testing the HTML you're going to parse.

    FFIW, I've used Jsoup in a number of different systems and I find that it works really well. I have never noticed any rampant memory issues with it either.