web-crawlerstormcrawler

crawl URLs based on their priorities in StormCrawler


I am working on a crawler based on the StormCrawler project. I have a requirement to crawl URLs based on their priorities. For example, I have two types of priority: HIGH, LOW. I want to crawl HIGH priority URLs as soon as possible before LOW URLs. I need a method for handling the above problem in the crawler. How can I handle this requirement in Apache Storm and StormCrawler?


Solution

  • With Elasticsearch as a backend, you can configure the spouts to sort the URLs within a bucket by whichever field you want. The fields are sorted by ascending order so you should store a value in the metadata of 0 for high and 1 for low and specify the key name in the conf es.status.bucket.sort.field. (Note that HIGH and LOW as values would work as well).

    The default values in the ES archetype are

    es.status.bucket.sort.field:

    • "nextFetchDate"
    • "url"

    you should keep the nextFetchDate so that URLs with the same priority are sorted by it and have for instance

    es.status.bucket.sort.field:

    • "metadata.priority"
    • "nextFetchDate"
    • "url"

    Note that this won't affect how the buckets are sorted, just the order within them.