javaluceneweb-crawlernutchxapian

Java CSS Crawler


I'm looking for a web crawler with the ability to grab the page's CSS. I don't need any other fancy crawling abilities.

I'm trying to make my way through Xapian, Nutch and Heritrix. They all seem to be a bit complex. If anyone has any experience or recommendation I would love to hear. An accessible tutorial to any of the above platforms, is also welcomed.

David


Solution

  • You are right, don't use those, they are way too heavy.

    Use: Crawler4j

    Follow the onsite tutorial for a simple crawler.

    The only change you need is in MyCrawler.java: Remove "css" from the FILTERS pattern In the visit() method, put a simple condition as follows:

    if (url.contains(".css")) {
        // do what you need with it
    }
    

    That's it - you are good!