dnsnetwork-programmingweb-crawler

Why is the DNS Resolver necessary in a crawler architecture?


In every paper I have read about crawler proposals, I see that one important component is the DNS Resolver.

My question is:

Why is it necessary? Can't we just make a request to http://www.some-domain.com/?


Solution

  • DNS resolution is a well-known bottleneck in web crawling. Due to the distributed nature of the Domain Name Service, DNS resolution may entail multiple requests and round-trips across the internet, requiring seconds and sometimes even longer. Right away, this puts in jeopardy our goal of fetching several hundred documents a second.

    There is another important difficulty in DNS resolution; the lookup implementations in standard libraries (likely to be used by anyone developing a crawler) are generally synchronous. This means that once a request is made to the Domain Name Service, other crawler threads at that node are blocked until the first request is completed. To circumvent this, most web crawlers implement their own DNS resolver as a component of the crawler.

    http://nlp.stanford.edu/IR-book/html/htmledition/dns-resolution-1.html