I'm trying to crawl pages from one particular domain using Heritrix.
The crawl rate seems to be really slow. And one thing I notice is that while there are 25 threads, 24 of them are always idle. It seems there is only one thread that is actively taking URI from queue and fetching data from server.
Rates
0.33 URIs/sec (0.34 avg); 18 KB/sec (20 avg)
Load
1 active of 25 threads; 1 congestion ratio; 13193 deepest queue; 13193 average depth
Elapsed
1h32m3s424ms
Threads
25 threads: 24 ABOUT_TO_GET_URI, 1 ABOUT_TO_BEGIN_PROCESSOR; 24 noActiveProcessor, 1 fetchHttp
Frontier
RUN - 2 URI queues: 1 active (1 in-process; 0 ready; 0 snoozed); 0 inactive; 0 ineligible; 0 retired; 1 exhausted
Memory
79933 KiB used; 143508 KiB current heap; 253440 KiB max heap
Any configuration I can use to make use of all the 25 threads? I've already discovered and changed configs related to politeness (min/max delay) Thanks!
Found an answer from the mailing list: Setting parallelQueues
in queueAssignmentPolicy
bean.
parallelQueues: default value (and historical behavior) is '1'. If instead N, all URIs that previously went into the same single-named queue will go into N related queues (via a consistent hash-mapping of the path?query portion of the URL). Each queue is considered separately for traditional politeness based on one-at-a-time connections and snooze-delays-between-fetches -- so N queues means N fetches could be in progress against a site at once. Thus, should only be used in an overlay setting, applied to sites likely to handle multiple connections well.