I'm looking for a web crawler with the ability to grab the page's CSS. I don't need any other fancy crawling abilities.
I'm trying to make my way through Xapian, Nutch and Heritrix. They all seem to be a bit complex. If anyone has any experience or recommendation I would love to hear. An accessible tutorial to any of the above platforms, is also welcomed.
David
You are right, don't use those, they are way too heavy.
Use: Crawler4j
Follow the onsite tutorial for a simple crawler.
The only change you need is in MyCrawler.java: Remove "css" from the FILTERS pattern In the visit() method, put a simple condition as follows:
if (url.contains(".css")) {
// do what you need with it
}
That's it - you are good!