What is the difference between web crawler and parser?
In java there are some name for fetching libraries . For example , they name nutch as a crawler and jsoup as a parser .
Are they do the same purpose?
Are they fully similar for the job?
thanks
The jsoup
library is a Java library for working with real-world HTML. It is capable of fetching and working with HTML. However, it is not a Web-Crawler in general as it is only capable of fetching one page at a time (without writing a custom program (=crawler) using jsoup
to fetch, extract and fetch new urls).
A Web crawler uses a HTML parser to extract URLs from a previously fetched Website and adds this newly discovered URL to its frontier.
A general sequence diagram of a Web crawler can be found in this answer: What sequence of steps does crawler4j follow to fetch data?
To summarize it:
A HTML parser is a necessary component of a Web crawler for parsing and extracting URLs from given HTML input. However, a HTML parser alone, is not a Web crawler as it lacks some necessary features such as maintaining previously visted URLs, politeness, etc.