web-crawlerstormcrawler

What is the meaning of bucket in StormCrawler spouts?


What is the meaning of bucket in the StormCrawler project? I have seen bucket in different spouts of the project. For example, in Solr and Sql based spouts we have used it in the spouts.


Solution

  • A bucket is simply a way of partitioning the data from the backend in order to guarantee a good diversity of sources while crawling. The values are usually set to be the hostnames, domains or IPs of the pages.

    Without buckets, the spout could get a lot of URLs for the same website. The FetcherBolt enforces politeness and internally stores URLs in queues, so in the worst-case scenario, it would have a single queue with loads of URLs and fetch them one by one, with a politeness delay.

    With bucketing, you get a number of URLs from various sites and fetch them in parallel. Internally, the FetcherBolt would have a lot of queues with a few URLs in each of them.

    You can see the number of internal queues and active threads from the FetcherBolt when using the Grafana dashboard (or the Kibana) one.

    FetcherBolt Grafana Queues

    Performance-wise, it is better to have the best possible diversity of sources.