javaweb-crawlercrawler4j

Crawl a list of sites using Crawler4j


I have a problem to load a list of links; these links should be used by controller.addSeed in a loop. Here is the code

SelectorString selector = new SelectorString();
List <String> lista = new ArrayList<>();
lista=selector.leggiFile();
String crawlStorageFolder = "/home/usersstage/Desktop/prova";
for(String x : lista){
    System.out.println(x);
    System.out.println("----");
}

// numberOfCrawlers mostra il numero di thread inizializzati per il
// crawling

int numberOfCrawlers = 2; // threads
CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder(crawlStorageFolder);

// Non mandare più di una richiesta per secondo (1000 mills || 200
// mills?)
config.setPolitenessDelay(200);

// profondità del crawl. -1 per illimitato
config.setMaxDepthOfCrawling(-1);

// numero massimo di pagine da crawllare
config.setMaxPagesToFetch(-1);

config.setResumableCrawling(false);

// instanza del controller per questo crawl
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig,
        pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher,
        robotstxtServer);
// LOOP used to add several websites (more than 100)
for(int i=0;i<lista.size();i++){
    controller.addSeed(lista.get(i).toString());    
}
controller.start(Crawler.class, numberOfCrawlers);

I need to crawl into this sites and retrieve only rss pages but the output of the crawled list is empty.


Solution

  • That code that you posted shows how to configure the CrawlController. But you need to configure the Crawler if you only need to crawl rss resources. The logic belongs in the 'shouldVisit' method on the crawler. Check this example.