I'm evaluating crawler4j for ~1M crawls per day My scenario is this: I'm fetching the URL and parsing its description, keywords and title, now I would like to save each URL and its words into a single file
I've seen how it's possible to save crawled data to files. However, since I have many crawls to perform I want different threads performing the save file operation on the file system (in order to not block the fetcher thread). Is that possible to do with crawler4j? If so, how?
Thanks
Consider using a Queue
(BlockingQueue
or similar) where you put the data to be written and which are then processed by one/more worker Threads (this approach is nothing crawler4j-specific). Search for "producer consumer" to get some general ideas.
Concerning your follow-up question on how to pass the Queue
to the crawler instances, this should do the trick (this is only from looking at the source code, haven't used crawler4j on my own):
final BlockingQueue<Data> queue = …
// use a factory, instead of supplying the crawler type to pass the queue
controller.start(new WebCrawlerFactory<MyCrawler>() {
@Override
public MyCrawler newInstance() throws Exception {
return new MyCrawler(queue);
}
}, numberOfCrawlers);