What is the meaning of bucket in the StormCrawler project? I have seen bucket in different spouts of the project. For example, in Solr and Sql based spouts we have used it in the spouts.
A bucket is simply a way of partitioning the data from the backend in order to guarantee a good diversity of sources while crawling. The values are usually set to be the hostnames, domains or IPs of the pages.
Without buckets, the spout could get a lot of URLs for the same website. The FetcherBolt enforces politeness and internally stores URLs in queues, so in the worst-case scenario, it would have a single queue with loads of URLs and fetch them one by one, with a politeness delay.
With bucketing, you get a number of URLs from various sites and fetch them in parallel. Internally, the FetcherBolt would have a lot of queues with a few URLs in each of them.
You can see the number of internal queues and active threads from the FetcherBolt when using the Grafana dashboard (or the Kibana) one.
Performance-wise, it is better to have the best possible diversity of sources.