I have been playing around with Crawler4j and have successfully had it crawl some pages but have no success crawling others. For example I have gotten it to successfully crawl Reddi with this code:
public class Controller {
public static void main(String[] args) throws Exception {
String crawlStorageFolder = "//home/user/Documents/Misc/Crawler/test";
int numberOfCrawlers = 1;
CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder(crawlStorageFolder);
/*
* Instantiate the controller for this crawl.
*/
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
/*
* For each crawl, you need to add some seed urls. These are the first
* URLs that are fetched and then the crawler starts following links
* which are found in these pages
*/
controller.addSeed("https://www.reddit.com/r/movies");
controller.addSeed("https://www.reddit.com/r/politics");
/*
* Start the crawl. This is a blocking operation, meaning that your code
* will reach the line after this only when crawling is finished.
*/
controller.start(MyCrawler.class, numberOfCrawlers);
}
}
And with:
@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
String href = url.getURL().toLowerCase();
return !FILTERS.matcher(href).matches()
&& href.startsWith("https://www.reddit.com/");
}
in MyCrawler.java. However when I have tried to crawl http://www.ratemyprofessors.com/ the program just hangs without output and does not crawl anything. I use the following code like above, in myController.java:
controller.addSeed("http://www.ratemyprofessors.com/campusRatings.jsp?sid=1222");
controller.addSeed("http://www.ratemyprofessors.com/ShowRatings.jsp?tid=136044");
And in MyCrawler.java:
@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
String href = url.getURL().toLowerCase();
return !FILTERS.matcher(href).matches()
&& href.startsWith("http://www.ratemyprofessors.com/");
}
So I am wondering:
crawler4j
respects crawler politness such as the robots.txt
. In your case this file is the following one.
Inspecting this file reveals, that it is disallowed to crawl your given seed points:
Disallow: /ShowRatings.jsp
Disallow: /campusRatings.jsp
This theory is supported by the crawler4j
log output:
2015-12-15 19:47:18,791 WARN [main] CrawlController (430): Robots.txt does not allow this seed: http://www.ratemyprofessors.com/campusRatings.jsp?sid=1222
2015-12-15 19:47:18,793 WARN [main] CrawlController (430): Robots.txt does not allow this seed: http://www.ratemyprofessors.com/ShowRatings.jsp?tid=136044