javaweb-crawlercrawler4j

Is there a way to clear the to visit queue in crawler4j during crawling


I am trying to figure out a way to change seed at crawling runtime and delete completely the "to visit" database/queue.

In particular, I would like to remove all the current urls in the queue and add a new seed. Something along the lines of:

public class MyCrawler extends WebCrawler {

private int discarded = 0;

@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
    boolean isDiscarded = checkPage(referringPage,url);
    if(isDiscarded){
        this.discarded++;
        if(discarded >= 100){
            //Clear all the urls that need to be visited
            ?_____?
            //Add the new seed
            this.myController.addSeed("http://new_seed.com");
            discarded = 0;
        }
    }
    return isDiscarded;
}

....

I know I can call controller.shutdown() and start everything again but it's kind of slow.


Solution

  • There is no build-in functionality for achieving this without modifying the original source-code (via forking it or using Reflection API).

    Every WebCrawler obtains new URLs via a Frontier instance, which stores the current (discovered and not yet fetched) URLs for all web-crawlers. Sadly, this variable has private access in WebCrawler.

    If you want to remove all current URLs, you need to reset the Frontier object. Without implementing a custom Frontier (see the source code), which offers this functionality, resetting will not be possible.