hadoopsolrhdfsnutchnutch2

Nutch 1.17 web crawling with storage optimization


I am using Nutch 1.17 to crawl over million websites. I have to perform following things for this.

  1. One time run crawler as deep crawler so that it should fetched maximum URLs from given (1 million) domains. For first time, you can run it for max 48 hours.
  2. After this, run crawler with same 1 million domains after 5 to 6 hour and only select those URLs that are new on those domains.
  3. After the job completion, index URLs in Solr
  4. Later on, there is no need to store raw HTML, hence to save storage (HDFS), remove raw data only and maintain each page metadata so that in next job, we should avoid to re-fetch a page again (before its scheduled time).

There isn't any other processing or post analysis. Now, I have a choice to use Hadoop cluster of medium size (max 30 machine). Each machine has 16GB RAM, 12 Cores and 2 TB Storage. Solr machine(s) are also of same spaces. Now, to maintain above, I am curious about followings:

a. How to achieve above document crawl rate i.e., how many machines are enough ? 
b. Should I need to add more machines or is there any better solution ?
c. Is it possible to remove raw data from Nutch and keep metadata only ?
d. Is there any best strategy to achieve the above objectives.

Solution

  • a. How to achieve above document crawl rate i.e., how many machines are enough ?

    Assuming a polite delay between successive fetches to the same domain is chosen: let's assume 10 pages can be fetcher per domain and minute, the max. crawl rate is 600 million pages per hour (10^6*10*60). A cluster with 360 cores should be enough to come close to this rate. Whether it's possible to crawl the one million domains exhaustively within 48 hours depends on the size of each of the domains. Keep in mind, that the mentioned crawl rate of 10 pages per domain and minute, it's only possible to fetch 10*60*48 = 28800 pages per domain within 48 hours.

    c. Is it possible to remove raw data from Nutch and keep metadata only ?

    As soon as a segment was indexed you can delete it. The CrawlDb is sufficient to decide whether a link found on one of the 1 million home pages is new.

    1. After the job completion, index URLs in Solr

    Maybe index segments immediately after each cycle.

    b. Should I need to add more machines or is there any better solution ? d. Is there any best strategy to achieve the above objectives.

    A lot depends on whether the domains are of similar size or not. In case they show a power-law distribution (that's likely), you have few domains with multiple millions of pages (hardly crawled exhaustively) and a long tail of domains with only few pages (max. few hundred pages). In this situation you need less resources but more time to achieve the desired result.