web-crawlerapache-stormstormcrawler

how to crawl a login protected site or page?


I want to crawl a site, which is required access to see pages. I am able to crawl guest pages, but how to crawl login protected pages? It will be great if somebody share steps to configure or skip the authentication mechanism to crawl a page using storm crawler.

Thank you very much in advance.


Solution

  • You can set the following keys with their corresponding values in the configuration of your topology

    http.basicauth.user
    http.basicauth.password
    

    See WIKI page on configuration