stormcrawler

dealing with redirect domains in StormCrawler


I am working on StormCrawler based project. One of our requirements is finding domains which redirected to another domain. In StormCrawler, each redirected URL considered as a depth in crawling. For example, for a domain with two redirected steps, we need to crawl with depth=2. How can I resolve all redirected domains without considering the depth in the crawler?


Solution

  • The filters do not distinguish between URLs found from redirections and those coming from links in a page. You could simply deactivate the depth-based filter and instead have a custom parse filter to restrict the outlinks if necessary.