javaweb-crawlercrawler4j

What sequence of steps does crawler4j follow to fetch data?


I'd like to learn,

  1. how crawler4j works?
  2. Does it fetch web page then download its content and extract it ?
  3. What about .db and .cvs file and its structures?

Generally ,What sequences it follows?

please, I want a descriptive content

Thanks


Solution

  • General Crawler Process

    The process for a typical multi-threaded crawler is as follows:

    1. We have a queue data structure, which is called frontier. Newly discovered URLs (or start points, so-called seeds) are added to this datastructure. In addition, for every URL a unique ID is assigned in order to determine, if a given URL was previously visited.

    2. Crawler threads then obtain URLs from the frontier and schedule them for later processing.

    3. The actual processing starts:

      • The robots.txt for the given URL is determined and parsed to honour exclusion criteria and be a polite web-crawler (configurable)
      • Next, the thread will check for politeness, i.e. time to wait before visting the same host of an URL again.
      • The actual URL is vistied by the crawler and the content is downloaded (this can be literally everything)
      • If we have HTML content, this content is parsed and potential new URLs are extracted and added to the frontier (in crawler4j this can be controlled via shouldVisit(...)).
    4. The whole process is repeated until no new URLs are added to the frontier.

    General (Focused) Crawler Architecture

    Besides the implementation details of crawler4j a more or less general (focused) crawler architecture (on a single server/pc) looks like this:

    basic crawler architecture

    Disclaimer: Image is my own work. Please respect this by referencing this post.