I'd like to learn,
Generally ,What sequences it follows?
please, I want a descriptive content
Thanks
The process for a typical multi-threaded crawler is as follows:
We have a queue data structure, which is called frontier
. Newly discovered URLs (or start points, so-called seeds) are added to this datastructure. In addition, for every URL a unique ID is assigned in order to determine, if a given URL was previously visited.
Crawler threads then obtain URLs from the frontier
and schedule them for later processing.
The actual processing starts:
robots.txt
for the given URL is determined and parsed to honour exclusion criteria and be a polite web-crawler (configurable)crawler4j
this can be controlled via shouldVisit(...)
).The whole process is repeated until no new URLs are added to the frontier
.
Besides the implementation details of crawler4j
a more or less general (focused) crawler architecture (on a single server/pc) looks like this:
Disclaimer: Image is my own work. Please respect this by referencing this post.