I am trying to figure out a way to change seed at crawling runtime and delete completely the "to visit" database/queue.
In particular, I would like to remove all the current urls in the queue and add a new seed. Something along the lines of:
public class MyCrawler extends WebCrawler {
private int discarded = 0;
@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
boolean isDiscarded = checkPage(referringPage,url);
if(isDiscarded){
this.discarded++;
if(discarded >= 100){
//Clear all the urls that need to be visited
?_____?
//Add the new seed
this.myController.addSeed("http://new_seed.com");
discarded = 0;
}
}
return isDiscarded;
}
....
I know I can call controller.shutdown() and start everything again but it's kind of slow.
There is no build-in functionality for achieving this without modifying the original source-code (via forking it or using Reflection API).
Every WebCrawler
obtains new URLs via a Frontier
instance, which stores the current (discovered and not yet fetched) URLs for all web-crawlers. Sadly, this variable has private
access in WebCrawler
.
If you want to remove all current URLs, you need to reset the Frontier
object. Without implementing a custom Frontier
(see the source code), which offers this functionality, resetting will not be possible.