javaasynchronousweb-scrapingcrawler4j

crawler4j asynchronously saving results to file


I'm evaluating crawler4j for ~1M crawls per day My scenario is this: I'm fetching the URL and parsing its description, keywords and title, now I would like to save each URL and its words into a single file

I've seen how it's possible to save crawled data to files. However, since I have many crawls to perform I want different threads performing the save file operation on the file system (in order to not block the fetcher thread). Is that possible to do with crawler4j? If so, how?

Thanks


Solution

  • Consider using a Queue (BlockingQueue or similar) where you put the data to be written and which are then processed by one/more worker Threads (this approach is nothing crawler4j-specific). Search for "producer consumer" to get some general ideas.

    Concerning your follow-up question on how to pass the Queue to the crawler instances, this should do the trick (this is only from looking at the source code, haven't used crawler4j on my own):

    final BlockingQueue<Data> queue = …
    
    // use a factory, instead of supplying the crawler type to pass the queue
    controller.start(new WebCrawlerFactory<MyCrawler>() {
        @Override
        public MyCrawler newInstance() throws Exception {
            return new MyCrawler(queue);
        }
    }, numberOfCrawlers);