I'm Crawling web pages with Apache Nutch(1.18 version).
I Thought that adding more hadoop nodes makes Nutch crawl web pages more fast.
However, It doesn't. There are almost no differences when crawling with 3 datanodes and 5 datanodes.
I've added --num-fetchers parameter(value is 5, because the number of my hadoop datanodes is 5) too.
please help me to find what's the problem.
Only a broad web crawl covering many web sites (hosts / domains) will profit from adding more Hadoop nodes. If only a small number of sites is crawled, parallelization will not make Nutch faster. Nutch configured to behave polite by default and does not access a single site in parallel and also waits between successive fetches from the same site.
But there are ways to make Nutch crawl a single site faster.
to make a single fetcher task faster (and more aggressively fetching from a single host (or domain depending on the value of partition.url.mode) the following configuration properties need to be adapted: fetcher.server.delay
, fetcher.threads.per.queue
and maybe other fetcher properties.
to allow more fetcher tasks (Hadoop nodes) crawl the same web site in parallel, URLPartitioner's getPartition method needs to be modified, see this discussion.
Be aware that making Nutch more aggressive without consent will probably result in complaints by the admins of the crawled web sites and increases the likelihood to get blocked!